Quality Gating

Every skill in the Linguistic Agent Skills suite is evaluated against a standardized rubric before it can be used in the pipeline. The rubric — called skill-judge — has 8 dimensions and 120 total points. Skills that fall below the threshold are rejected or flagged for improvement.

The 8 Dimensions

#	Dimension	Max Points	What It Measures
D1	Trigger accuracy	20	Does the skill activate on the right queries and not on wrong ones?
D2	Scope clarity	15	Are the "when to use" and "when NOT to use" boundaries clear and correct?
D3	Decision quality	15	Are the decisions the skill makes technically sound?
D4	Anti-pattern coverage	18	Does the skill prevent the important mistakes?
D5	Edge-case handling	17	Does the skill handle the non-obvious scenarios correctly?
D6	Output format	15	Is the output structured and usable by downstream skills?
D7	Workflow completeness	10	Is the step-by-step workflow complete and actionable?
D8	Integration	10	Does the skill correctly reference related skills and hand off cleanly?

Per-Tier Requirements

Tier	Skills	Required Score	Per-Dimension Floors
Entry-point (A−)	orchestrator, scope, ethics, eval	≥ 102/120	D1 ≥ 15, D3 ≥ 10, D4 ≥ 13, D5 ≥ 12
Specialist (A−)	All other specialists	≥ 96/120	D1 ≥ 15, D3 ≥ 10, D4 ≥ 13, D5 ≥ 12
Mindset stub (B+)	codeswitch, historical, lexicon	≥ 96/120	D1 ≥ 12, D3 ≥ 8, D4 ≥ 10, D5 ≥ 9

Entry-point skills have higher requirements because failures there cascade through the entire pipeline. A wrong trigger on linguistic-orchestrator misdirects the whole session; a wrong trigger on linguistic-morph only affects the Analyze phase.

Scores by Skill (Snapshot 2026-04-23)

Skill	Score	Tier
`linguistic-ethics`	106/120	Entry-point A−
`linguistic-scope`	105/120	Entry-point A−
`linguistic-transfer`	105/120	Specialist A−
`linguistic-eval`	104/120	Entry-point A−
`linguistic-tokenize`	104/120	Specialist A−
`linguistic-scripts`	104/120	Specialist A−
`linguistic-corpus`	103/120	Specialist A−
`linguistic-annotate`	103/120	Specialist A−
`linguistic-bitext`	102/120	Specialist A−
`linguistic-morph`	102/120	Specialist A−
`linguistic-syntax`	102/120	Specialist A−
`linguistic-semantics`	102/120	Specialist A−
`linguistic-discourse`	102/120	Specialist A−
`linguistic-orchestrator`	102/120	Entry-point A−
`linguistic-speech`	101/120	Specialist A−
`linguistic-lexicon`	98/120	Mindset stub B+
`linguistic-codeswitch`	97/120	Mindset stub B+
`linguistic-historical`	97/120	Mindset stub B+

Eval Methodology

Each skill is evaluated using Anthropic's skill-creator plugin:

3 eval prompts for standard skills (the model run with and without the skill, comparing outputs)
5 eval prompts for A-tier entry-point skills (broader coverage)
Baseline runs use the same model version as the with-skill run — clean knowledge-delta signal

Scores are stored in tests/e2e/scores.json (machine-readable) and docs/skill-judge-dashboard.md (human-readable). The regression test (tests/integration/test_regression.py) verifies that earlier-phase skills maintain their scores at every merge.

Why Quality Gating Matters

Without gating, skill quality would drift — a skill that initially handles D5 edge cases correctly could degrade through edits that don't re-test those scenarios. The rubric makes quality measurable and regression-detectable.

For contributors: before submitting a skill improvement PR, run the skill-judge eval locally to verify the score hasn't dropped. The pre-push hook (ruff check --fix && mypy && pytest tests/) catches code-level issues; skill-judge catches content-level issues.

Was this page helpful?