linguistic-eval
Honest evaluation for low-resource LLMs: benchmark selection, metric selection, BLiMP-style grammatical-knowledge probes, and contamination-aware reporting. A-tier because eval results drive release decisions — weak eval cascades into wrong investment for months.
Overview
Picking BLEU over chrF for a morphologically-rich language doesn't just produce one bad number — it produces a misleading number that drives wrong investment decisions for months. linguistic-eval enforces the right metric choices, surfaces contamination risks, and ensures per-dialect/per-register breakdowns that make systematic failures visible.
Pipeline Position
Phase: Evaluate (Phase 3) — orchestrator's last specialist before Release
Before this skill: All Acquire and Analyze phase skills; model training complete
After this skill: linguistic-ethics (Release gate)
When It Activates
- Reporting quality numbers for any non-English LLM
- Choosing benchmark + metric for a language pair / task
- Building grammatical-knowledge probes (BLiMP-style)
- Contamination audit (cross-reference linguistic-corpus)
- Adding fairness eval (per-dialect / per-register breakdown)
What It Does
Benchmark Selection
| Task | Benchmark Options |
|---|---|
| MT (En ↔ X) | FLORES+ (broad); NTREX-128 (cleaner re contamination); AfroBench, IndicXTREME, SEACrowd (regional) |
| Reading comprehension | Belebele (122 languages) |
| NER | MasakhaNER 2.0 (Africa); WikiAnn (broad) |
| Sentiment | AfriSenti (Africa); IndicSenti; SemEval per-language |
| QA | TyDi-QA, XQuAD, MLQA |
| General | XNLI, BIG-bench, BUFFET |
FLORES-200 contamination: FLORES-200 is in many pretrain mixes (Llama-3 cutoff March 2024 has likely seen it). Report FLORES as a lower bound on quality, not a fair eval. NTREX-128 and Belebele are cleaner alternatives.
Metric Selection
| Task | Primary | Secondary | Never Primary |
|---|---|---|---|
| MT (general) | chrF++ + COMET-22 | spBLEU | BLEU on morphologically-rich |
| MT (low COMET coverage) | chrF++ + GEMBA-MQM | spBLEU | BLEU on MRL |
| Reading comprehension | accuracy | per-Q breakdown | — |
| NER | F1 (per-tag) | exact-match | accuracy alone |
| Sentiment | F1 (per-class) | accuracy | accuracy alone |
| Speech ASR | CER (preferred) | WER | WER on space-less script |
| Speech TTS | MOS (human) | PESQ / STOI | metric-only |
BLEU is pathological for morphologically-rich languages. A single-morpheme edit wrecks BLEU as harshly as a full mistranslation. For Turkish, Finnish, Swahili, Yoruba, Inuktitut: chrF/chrF++/spBLEU are primary. Report BLEU as supplementary only.
COMET coverage varies. COMET-22 has good European + Indic coverage; spotty Bantu / Indigenous Americas. Always check per-language coverage before reporting — a "missing" language gets random numbers. Use GEMBA-MQM (LLM-judge with structured MQM error rubric) as supplement or fallback.
Grammatical-Knowledge Probes
English BLiMP-style probes don't transfer. Build per-language minimal-pair probes:
| Phenomenon | Example |
|---|---|
| Subject-verb agreement | "la luna brilla" / *"la luna brillan" |
| Gender agreement | "el libro" / *"la libro" |
| Tone (lexical) preservation | Yoruba: á/à/ọ̀ contrast |
| Case marking | Russian instrumental vs nominative |
| Word order | SOV vs SVO violation |
Target: ≥100 minimal pairs per phenomenon. Compute model log-likelihood difference; >0 = correct preference.
Per-Stratum Breakdown
Always report:
- Per-dialect (Egyptian Arabic vs MSA; Cusco vs Ayacucho Quechua)
- Per-register (Bible vs news vs web vs conversation)
- Per-direction (En→X vs X→En — not comparable in difficulty)
- Per-class (NER per-tag; sentiment per-class)
Aggregate-only reporting hides systematic failures.
Contamination Check (Two-Sided)
(a) Train mix vs eval set — exact + n-gram overlap (b) Eval set vs base-model pretrain proxies — proxy via release dates + known inclusions
Inputs & Outputs
| Input | Description |
|---|---|
| Target language + task | For benchmark + metric selection |
| Joshi class | For benchmark availability expectations |
| Training corpus manifest | For contamination audit |
| Output | Description |
|---|---|
| Benchmark selection | With contamination flag |
| Metric selection | With rationale |
| Probe spec | Phenomena + pair count |
| Contamination report | PASS / FAIL |
| Stratified results | Per-dialect / register / direction |
Example Usage
Language: Yoruba (yor), task: MT En↔Yor
Eval Plan: Yoruba MT
- Benchmark: FLORES+ (flag as lower bound — in pretrain mix);
NTREX-128 (cleaner; 128 languages includes Yoruba)
- Metric: chrF++ (primary) + GEMBA-MQM (COMET-22 coverage check needed)
BLEU: supplementary only
- Probes: tone preservation (150 pairs); subject-verb agreement (120 pairs)
- Contamination: FLORES contamination confirmed — report as lower bound
- Stratified: per-direction (En→Yor separately from Yor→En)
per-register (Bible-domain vs news vs web)
- CER: not applicable (no ASR task)Related Skills
linguistic-corpus— contamination audit cross-referencelinguistic-syntax— agreement probes from syntax analysislinguistic-ethics— Release gate after eval
Related Skills from Other Suites
- Statistical Analysis — quantitative assessment methods
Last updated on
linguistic-speech
Bridge field-linguistics annotation (ELAN/Praat/FLEx/SayMore) and audio data into ML pipelines (Lhotse/ESPnet/k2/MMS/Whisper). G2P/IPA workflows, low-resource ASR/TTS recipe selection.
linguistic-codeswitch
Code-switching awareness for ML pipelines — Hinglish, Spanglish, Singlish, MSA+dialect Arabic, and other bilingual mixing. Optional Mindset specialist. Code-switching is the norm for many bilingual users, not noise to filter.