linguistic-eval

Honest evaluation for low-resource LLMs: benchmark selection, metric selection, BLiMP-style grammatical-knowledge probes, and contamination-aware reporting. A-tier because eval results drive release decisions — weak eval cascades into wrong investment for months.

Overview

Picking BLEU over chrF for a morphologically-rich language doesn't just produce one bad number — it produces a misleading number that drives wrong investment decisions for months. linguistic-eval enforces the right metric choices, surfaces contamination risks, and ensures per-dialect/per-register breakdowns that make systematic failures visible.

Pipeline Position

Phase: Evaluate (Phase 3) — orchestrator's last specialist before Release

Before this skill: All Acquire and Analyze phase skills; model training complete

After this skill: linguistic-ethics (Release gate)

When It Activates

Reporting quality numbers for any non-English LLM
Choosing benchmark + metric for a language pair / task
Building grammatical-knowledge probes (BLiMP-style)
Contamination audit (cross-reference linguistic-corpus)
Adding fairness eval (per-dialect / per-register breakdown)

What It Does

Benchmark Selection

Task	Benchmark Options
MT (En ↔ X)	FLORES+ (broad); NTREX-128 (cleaner re contamination); AfroBench, IndicXTREME, SEACrowd (regional)
Reading comprehension	Belebele (122 languages)
NER	MasakhaNER 2.0 (Africa); WikiAnn (broad)
Sentiment	AfriSenti (Africa); IndicSenti; SemEval per-language
QA	TyDi-QA, XQuAD, MLQA
General	XNLI, BIG-bench, BUFFET

FLORES-200 contamination: FLORES-200 is in many pretrain mixes (Llama-3 cutoff March 2024 has likely seen it). Report FLORES as a lower bound on quality, not a fair eval. NTREX-128 and Belebele are cleaner alternatives.

Metric Selection

Task	Primary	Secondary	Never Primary
MT (general)	chrF++ + COMET-22	spBLEU	BLEU on morphologically-rich
MT (low COMET coverage)	chrF++ + GEMBA-MQM	spBLEU	BLEU on MRL
Reading comprehension	accuracy	per-Q breakdown	—
NER	F1 (per-tag)	exact-match	accuracy alone
Sentiment	F1 (per-class)	accuracy	accuracy alone
Speech ASR	CER (preferred)	WER	WER on space-less script
Speech TTS	MOS (human)	PESQ / STOI	metric-only

BLEU is pathological for morphologically-rich languages. A single-morpheme edit wrecks BLEU as harshly as a full mistranslation. For Turkish, Finnish, Swahili, Yoruba, Inuktitut: chrF/chrF++/spBLEU are primary. Report BLEU as supplementary only.

COMET coverage varies. COMET-22 has good European + Indic coverage; spotty Bantu / Indigenous Americas. Always check per-language coverage before reporting — a "missing" language gets random numbers. Use GEMBA-MQM (LLM-judge with structured MQM error rubric) as supplement or fallback.

Grammatical-Knowledge Probes

English BLiMP-style probes don't transfer. Build per-language minimal-pair probes:

Phenomenon	Example
Subject-verb agreement	"la luna brilla" / *"la luna brillan"
Gender agreement	"el libro" / *"la libro"
Tone (lexical) preservation	Yoruba: á/à/ọ̀ contrast
Case marking	Russian instrumental vs nominative
Word order	SOV vs SVO violation

Target: ≥100 minimal pairs per phenomenon. Compute model log-likelihood difference; >0 = correct preference.

Per-Stratum Breakdown

Always report:

Per-dialect (Egyptian Arabic vs MSA; Cusco vs Ayacucho Quechua)
Per-register (Bible vs news vs web vs conversation)
Per-direction (En→X vs X→En — not comparable in difficulty)
Per-class (NER per-tag; sentiment per-class)

Aggregate-only reporting hides systematic failures.

Contamination Check (Two-Sided)

(a) Train mix vs eval set — exact + n-gram overlap (b) Eval set vs base-model pretrain proxies — proxy via release dates + known inclusions

Inputs & Outputs

Input	Description
Target language + task	For benchmark + metric selection
Joshi class	For benchmark availability expectations
Training corpus manifest	For contamination audit

Output	Description
Benchmark selection	With contamination flag
Metric selection	With rationale
Probe spec	Phenomena + pair count
Contamination report	PASS / FAIL
Stratified results	Per-dialect / register / direction

Example Usage

Language: Yoruba (yor), task: MT En↔Yor

Eval Plan: Yoruba MT
- Benchmark: FLORES+ (flag as lower bound — in pretrain mix);
    NTREX-128 (cleaner; 128 languages includes Yoruba)
- Metric: chrF++ (primary) + GEMBA-MQM (COMET-22 coverage check needed)
    BLEU: supplementary only
- Probes: tone preservation (150 pairs); subject-verb agreement (120 pairs)
- Contamination: FLORES contamination confirmed — report as lower bound
- Stratified: per-direction (En→Yor separately from Yor→En)
    per-register (Bible-domain vs news vs web)
- CER: not applicable (no ASR task)

linguistic-corpus — contamination audit cross-reference
linguistic-syntax — agreement probes from syntax analysis
linguistic-ethics — Release gate after eval

Statistical Analysis — quantitative assessment methods

Was this page helpful?

On this page