Quality Gating
Every skill in the Linguistic Agent Skills suite is evaluated against a standardized rubric before it can be used in the pipeline. The rubric — called skill-judge — has 8 dimensions and 120 total points. Skills that fall below the threshold are rejected or flagged for improvement.
The 8 Dimensions
| # | Dimension | Max Points | What It Measures |
|---|---|---|---|
| D1 | Trigger accuracy | 20 | Does the skill activate on the right queries and not on wrong ones? |
| D2 | Scope clarity | 15 | Are the "when to use" and "when NOT to use" boundaries clear and correct? |
| D3 | Decision quality | 15 | Are the decisions the skill makes technically sound? |
| D4 | Anti-pattern coverage | 18 | Does the skill prevent the important mistakes? |
| D5 | Edge-case handling | 17 | Does the skill handle the non-obvious scenarios correctly? |
| D6 | Output format | 15 | Is the output structured and usable by downstream skills? |
| D7 | Workflow completeness | 10 | Is the step-by-step workflow complete and actionable? |
| D8 | Integration | 10 | Does the skill correctly reference related skills and hand off cleanly? |
Per-Tier Requirements
| Tier | Skills | Required Score | Per-Dimension Floors |
|---|---|---|---|
| Entry-point (A−) | orchestrator, scope, ethics, eval | ≥ 102/120 | D1 ≥ 15, D3 ≥ 10, D4 ≥ 13, D5 ≥ 12 |
| Specialist (A−) | All other specialists | ≥ 96/120 | D1 ≥ 15, D3 ≥ 10, D4 ≥ 13, D5 ≥ 12 |
| Mindset stub (B+) | codeswitch, historical, lexicon | ≥ 96/120 | D1 ≥ 12, D3 ≥ 8, D4 ≥ 10, D5 ≥ 9 |
Entry-point skills have higher requirements because failures there cascade through the entire pipeline. A wrong trigger on linguistic-orchestrator misdirects the whole session; a wrong trigger on linguistic-morph only affects the Analyze phase.
Scores by Skill (Snapshot 2026-04-23)
| Skill | Score | Tier |
|---|---|---|
linguistic-ethics | 106/120 | Entry-point A− |
linguistic-scope | 105/120 | Entry-point A− |
linguistic-transfer | 105/120 | Specialist A− |
linguistic-eval | 104/120 | Entry-point A− |
linguistic-tokenize | 104/120 | Specialist A− |
linguistic-scripts | 104/120 | Specialist A− |
linguistic-corpus | 103/120 | Specialist A− |
linguistic-annotate | 103/120 | Specialist A− |
linguistic-bitext | 102/120 | Specialist A− |
linguistic-morph | 102/120 | Specialist A− |
linguistic-syntax | 102/120 | Specialist A− |
linguistic-semantics | 102/120 | Specialist A− |
linguistic-discourse | 102/120 | Specialist A− |
linguistic-orchestrator | 102/120 | Entry-point A− |
linguistic-speech | 101/120 | Specialist A− |
linguistic-lexicon | 98/120 | Mindset stub B+ |
linguistic-codeswitch | 97/120 | Mindset stub B+ |
linguistic-historical | 97/120 | Mindset stub B+ |
Eval Methodology
Each skill is evaluated using Anthropic's skill-creator plugin:
- 3 eval prompts for standard skills (the model run with and without the skill, comparing outputs)
- 5 eval prompts for A-tier entry-point skills (broader coverage)
- Baseline runs use the same model version as the with-skill run — clean knowledge-delta signal
Scores are stored in tests/e2e/scores.json (machine-readable) and docs/skill-judge-dashboard.md (human-readable). The regression test (tests/integration/test_regression.py) verifies that earlier-phase skills maintain their scores at every merge.
Why Quality Gating Matters
Without gating, skill quality would drift — a skill that initially handles D5 edge cases correctly could degrade through edits that don't re-test those scenarios. The rubric makes quality measurable and regression-detectable.
For contributors: before submitting a skill improvement PR, run the skill-judge eval locally to verify the score hasn't dropped. The pre-push hook (ruff check --fix && mypy && pytest tests/) catches code-level issues; skill-judge catches content-level issues.
Last updated on