MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Concepts

Quality Gating

Every skill in the Linguistic Agent Skills suite is evaluated against a standardized rubric before it can be used in the pipeline. The rubric — called skill-judge — has 8 dimensions and 120 total points. Skills that fall below the threshold are rejected or flagged for improvement.

The 8 Dimensions

#DimensionMax PointsWhat It Measures
D1Trigger accuracy20Does the skill activate on the right queries and not on wrong ones?
D2Scope clarity15Are the "when to use" and "when NOT to use" boundaries clear and correct?
D3Decision quality15Are the decisions the skill makes technically sound?
D4Anti-pattern coverage18Does the skill prevent the important mistakes?
D5Edge-case handling17Does the skill handle the non-obvious scenarios correctly?
D6Output format15Is the output structured and usable by downstream skills?
D7Workflow completeness10Is the step-by-step workflow complete and actionable?
D8Integration10Does the skill correctly reference related skills and hand off cleanly?

Per-Tier Requirements

TierSkillsRequired ScorePer-Dimension Floors
Entry-point (A−)orchestrator, scope, ethics, eval≥ 102/120D1 ≥ 15, D3 ≥ 10, D4 ≥ 13, D5 ≥ 12
Specialist (A−)All other specialists≥ 96/120D1 ≥ 15, D3 ≥ 10, D4 ≥ 13, D5 ≥ 12
Mindset stub (B+)codeswitch, historical, lexicon≥ 96/120D1 ≥ 12, D3 ≥ 8, D4 ≥ 10, D5 ≥ 9

Entry-point skills have higher requirements because failures there cascade through the entire pipeline. A wrong trigger on linguistic-orchestrator misdirects the whole session; a wrong trigger on linguistic-morph only affects the Analyze phase.

Scores by Skill (Snapshot 2026-04-23)

SkillScoreTier
linguistic-ethics106/120Entry-point A−
linguistic-scope105/120Entry-point A−
linguistic-transfer105/120Specialist A−
linguistic-eval104/120Entry-point A−
linguistic-tokenize104/120Specialist A−
linguistic-scripts104/120Specialist A−
linguistic-corpus103/120Specialist A−
linguistic-annotate103/120Specialist A−
linguistic-bitext102/120Specialist A−
linguistic-morph102/120Specialist A−
linguistic-syntax102/120Specialist A−
linguistic-semantics102/120Specialist A−
linguistic-discourse102/120Specialist A−
linguistic-orchestrator102/120Entry-point A−
linguistic-speech101/120Specialist A−
linguistic-lexicon98/120Mindset stub B+
linguistic-codeswitch97/120Mindset stub B+
linguistic-historical97/120Mindset stub B+

Eval Methodology

Each skill is evaluated using Anthropic's skill-creator plugin:

  • 3 eval prompts for standard skills (the model run with and without the skill, comparing outputs)
  • 5 eval prompts for A-tier entry-point skills (broader coverage)
  • Baseline runs use the same model version as the with-skill run — clean knowledge-delta signal

Scores are stored in tests/e2e/scores.json (machine-readable) and docs/skill-judge-dashboard.md (human-readable). The regression test (tests/integration/test_regression.py) verifies that earlier-phase skills maintain their scores at every merge.

Why Quality Gating Matters

Without gating, skill quality would drift — a skill that initially handles D5 edge cases correctly could degrade through edits that don't re-test those scenarios. The rubric makes quality measurable and regression-detectable.

For contributors: before submitting a skill improvement PR, run the skill-judge eval locally to verify the score hasn't dropped. The pre-push hook (ruff check --fix && mypy && pytest tests/) catches code-level issues; skill-judge catches content-level issues.

Was this page helpful?
Edit on GitHub

Last updated on

On this page