MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-eval

Honest evaluation for low-resource LLMs: benchmark selection, metric selection, BLiMP-style grammatical-knowledge probes, and contamination-aware reporting. A-tier because eval results drive release decisions — weak eval cascades into wrong investment for months.

Overview

Picking BLEU over chrF for a morphologically-rich language doesn't just produce one bad number — it produces a misleading number that drives wrong investment decisions for months. linguistic-eval enforces the right metric choices, surfaces contamination risks, and ensures per-dialect/per-register breakdowns that make systematic failures visible.

Pipeline Position

Phase: Evaluate (Phase 3) — orchestrator's last specialist before Release

Before this skill: All Acquire and Analyze phase skills; model training complete

After this skill: linguistic-ethics (Release gate)

When It Activates

  • Reporting quality numbers for any non-English LLM
  • Choosing benchmark + metric for a language pair / task
  • Building grammatical-knowledge probes (BLiMP-style)
  • Contamination audit (cross-reference linguistic-corpus)
  • Adding fairness eval (per-dialect / per-register breakdown)

What It Does

Benchmark Selection

TaskBenchmark Options
MT (En ↔ X)FLORES+ (broad); NTREX-128 (cleaner re contamination); AfroBench, IndicXTREME, SEACrowd (regional)
Reading comprehensionBelebele (122 languages)
NERMasakhaNER 2.0 (Africa); WikiAnn (broad)
SentimentAfriSenti (Africa); IndicSenti; SemEval per-language
QATyDi-QA, XQuAD, MLQA
GeneralXNLI, BIG-bench, BUFFET

FLORES-200 contamination: FLORES-200 is in many pretrain mixes (Llama-3 cutoff March 2024 has likely seen it). Report FLORES as a lower bound on quality, not a fair eval. NTREX-128 and Belebele are cleaner alternatives.

Metric Selection

TaskPrimarySecondaryNever Primary
MT (general)chrF++ + COMET-22spBLEUBLEU on morphologically-rich
MT (low COMET coverage)chrF++ + GEMBA-MQMspBLEUBLEU on MRL
Reading comprehensionaccuracyper-Q breakdown
NERF1 (per-tag)exact-matchaccuracy alone
SentimentF1 (per-class)accuracyaccuracy alone
Speech ASRCER (preferred)WERWER on space-less script
Speech TTSMOS (human)PESQ / STOImetric-only

BLEU is pathological for morphologically-rich languages. A single-morpheme edit wrecks BLEU as harshly as a full mistranslation. For Turkish, Finnish, Swahili, Yoruba, Inuktitut: chrF/chrF++/spBLEU are primary. Report BLEU as supplementary only.

COMET coverage varies. COMET-22 has good European + Indic coverage; spotty Bantu / Indigenous Americas. Always check per-language coverage before reporting — a "missing" language gets random numbers. Use GEMBA-MQM (LLM-judge with structured MQM error rubric) as supplement or fallback.

Grammatical-Knowledge Probes

English BLiMP-style probes don't transfer. Build per-language minimal-pair probes:

PhenomenonExample
Subject-verb agreement"la luna brilla" / *"la luna brillan"
Gender agreement"el libro" / *"la libro"
Tone (lexical) preservationYoruba: á/à/ọ̀ contrast
Case markingRussian instrumental vs nominative
Word orderSOV vs SVO violation

Target: ≥100 minimal pairs per phenomenon. Compute model log-likelihood difference; >0 = correct preference.

Per-Stratum Breakdown

Always report:

  • Per-dialect (Egyptian Arabic vs MSA; Cusco vs Ayacucho Quechua)
  • Per-register (Bible vs news vs web vs conversation)
  • Per-direction (En→X vs X→En — not comparable in difficulty)
  • Per-class (NER per-tag; sentiment per-class)

Aggregate-only reporting hides systematic failures.

Contamination Check (Two-Sided)

(a) Train mix vs eval set — exact + n-gram overlap (b) Eval set vs base-model pretrain proxies — proxy via release dates + known inclusions

Inputs & Outputs

InputDescription
Target language + taskFor benchmark + metric selection
Joshi classFor benchmark availability expectations
Training corpus manifestFor contamination audit
OutputDescription
Benchmark selectionWith contamination flag
Metric selectionWith rationale
Probe specPhenomena + pair count
Contamination reportPASS / FAIL
Stratified resultsPer-dialect / register / direction

Example Usage

Language: Yoruba (yor), task: MT En↔Yor

Eval Plan: Yoruba MT
- Benchmark: FLORES+ (flag as lower bound — in pretrain mix);
    NTREX-128 (cleaner; 128 languages includes Yoruba)
- Metric: chrF++ (primary) + GEMBA-MQM (COMET-22 coverage check needed)
    BLEU: supplementary only
- Probes: tone preservation (150 pairs); subject-verb agreement (120 pairs)
- Contamination: FLORES contamination confirmed — report as lower bound
- Stratified: per-direction (En→Yor separately from Yor→En)
    per-register (Bible-domain vs news vs web)
- CER: not applicable (no ASR task)
Was this page helpful?
Edit on GitHub

Last updated on

On this page