Joshi Classification
The Joshi classification system, introduced in Joshi et al. (ACL 2020, "The State and Fate of Linguistic Diversity and Inclusion in the NLP World"), provides a 6-level taxonomy of language resource availability. Every downstream decision in the linguistic pipeline — tokenizer strategy, adapter configuration, eval benchmark — depends on the Joshi class of the target language.
The Six Classes
Class 0 — "The Left-Behinds"
Characteristics: No labeled data. Often no written standard. Frequently endangered or dormant. NLP tooling essentially nonexistent.
Examples: Dahalo (Kenya, ~400 speakers), Yagua (Peru/Colombia, ~6,000 speakers), many Indigenous Americas languages, many languages documented only in field archives.
Typical strategy:
- Field documentation partnership (ELAR, AILLA, PARADISEC)
- Bootstrap from related Class 3+ language via cognate sets (
linguistic-historical) - FPIC mandatory — community partnership before any data acquisition
- Synthetic bitext via dictionary substitution and Swadesh-list bootstrap
- HyperOfa vocab extension + LoRA from multilingual base (mBART, NLLB, BLOOM)
Eval: No benchmarks exist. Construct custom probes; report as "prototype quality" with wide error bars.
Class 1 — "The Scraping-Bys"
Characteristics: Wikipedia dump exists. Minimal labeled data (possibly one NER or POS dataset). No comprehensive benchmarks. Growing online presence.
Examples: Igbo (Nigeria, ~30M speakers), Marathi (India, lower end), many West African languages (Fon, Ewe), several Indigenous Americas languages with revitalization programs.
Typical strategy:
- Continued pretraining on Wikipedia + OPUS + available web data
- OFA or HyperOfa vocab extension depending on parallel data availability
- LoRA r=16–32 from a multilingual base
- Tokenizer: fertility audit mandatory; byte fallback mandatory
- Ethics: check community norms even for CC-BY data
Eval: FLORES-200 if included; MasakhaNER for NER (African languages); custom probes for other tasks.
Class 2 — "Hopefuls"
Characteristics: Some labeled data. Growing number of NLP papers. Included in multilingual benchmarks (FLORES-200, Belebele). Active NLP research community. Still significant gaps.
Examples: Yoruba (Nigeria, ~40M speakers), Khmer (Cambodia, ~16M speakers), Twi (Ghana, ~9M speakers), Amharic (Ethiopia, ~25M speakers), Hausa (West Africa, ~70M speakers).
Typical strategy:
- OFA vocab extension + LoRA (parallel data usually available)
- FLORES-200 eval (flag as lower bound — in many pretrain mixes)
- chrF++ + COMET-22 as primary metrics (never BLEU for Class 2 morphologically-rich)
- Ethics: standard FPIC; check Bible-NLP register traps
Eval: FLORES-200, Belebele (reading comprehension), regional benchmarks (AfroBench for African languages, IndicXTREME for Indic).
Class 3 — "Rising Stars"
Characteristics: Multiple benchmarks. Growing tooling ecosystem. Often a regional standard or lingua franca. Reasonable data availability but still behind Class 4–5 in tooling depth.
Examples: Swahili (East Africa, ~200M speakers including L2), Indonesian (Indonesia, ~270M speakers including L2), Bengali (Bangladesh/India, lower-resource end), Nepali, Sinhala.
Typical strategy:
- Continued pretraining OR standard fine-tune (enough data for either)
- Vocabulary extension may or may not be needed (fertility audit still required)
- Standard LoRA fine-tune; forgetting mitigation less critical at this class
- Full benchmark suite available
Eval: FLORES-200, Belebele, XNLI, per-task benchmarks.
Class 4 — "Underdogs"
Characteristics: Many benchmarks. Reasonable tooling. Regional standard with significant NLP research. Still meaningfully behind Class 5 in dataset breadth and model quality.
Examples: Vietnamese (~95M speakers), Turkish (~80M speakers), Tamil (~75M speakers), Hindi (higher end), Polish, Czech, Dutch.
Typical strategy:
- Standard fine-tune; continued pretraining if adapting from English-only model
- Vocabulary extension rarely needed (fertility usually adequate)
- Full eval suite; per-dialect breakdown matters (especially for Hindi vs Urdu, Vietnamese regional variants)
Eval: Full FLORES-200 + COMET; per-benchmark comparison with published baselines.
Class 5 — "Winners"
Characteristics: Benchmark-saturated. Abundant tooling. Dominant in NLP research. Most pretrained models are optimized for Class 5 languages.
Examples: English, Mandarin, Spanish, French, German, Japanese, Arabic (MSA), Russian, Portuguese, Korean.
Typical strategy:
- Standard everything: full fine-tune, standard tokenizer, BLEU acceptable as supplementary
- Most pretrained models already handle Class 5 well
- The linguistic suite skills are rarely needed; use standard ML practices
Note: Arabic (MSA) is Class 5, but Arabic dialects (Egyptian, Levantine, Maghrebi) range from Class 2–4. Always disambiguate.
Class Is Multi-Dimensional
A critical principle: Class is data + benchmarks + tooling — not just Wikipedia size.
- Cebuano has 6M+ Wikipedia articles (mostly bot-generated) → effective NLP Class: 2
- Welsh has smaller Wikipedia but rich NLP tooling and community investment → effective Class: 3–4
- Swahili has broad L2 speaker base but limited native-speaker NLP data → treat as Class 3 with caveats
linguistic-scope computes Joshi class from multiple signals: Wikipedia presence, OPUS presence, FLORES-200 inclusion, NLLB inclusion, HuggingFace dataset count. Single-signal classification is unreliable.
Classification in Practice
When linguistic-scope produces a range (e.g., "Class 1–2") rather than a single number, it means the deciding factor is ambiguous — present the range with the deciding factor and defer to user judgment. The heuristics should be transparent, not hidden in an opaque classification.
Last updated on