Joshi Classification

The Joshi classification system, introduced in Joshi et al. (ACL 2020, "The State and Fate of Linguistic Diversity and Inclusion in the NLP World"), provides a 6-level taxonomy of language resource availability. Every downstream decision in the linguistic pipeline — tokenizer strategy, adapter configuration, eval benchmark — depends on the Joshi class of the target language.

The Six Classes

Class 0 — "The Left-Behinds"

Characteristics: No labeled data. Often no written standard. Frequently endangered or dormant. NLP tooling essentially nonexistent.

Examples: Dahalo (Kenya, ~400 speakers), Yagua (Peru/Colombia, ~6,000 speakers), many Indigenous Americas languages, many languages documented only in field archives.

Typical strategy:

Field documentation partnership (ELAR, AILLA, PARADISEC)
Bootstrap from related Class 3+ language via cognate sets (linguistic-historical)
FPIC mandatory — community partnership before any data acquisition
Synthetic bitext via dictionary substitution and Swadesh-list bootstrap
HyperOfa vocab extension + LoRA from multilingual base (mBART, NLLB, BLOOM)

Eval: No benchmarks exist. Construct custom probes; report as "prototype quality" with wide error bars.

Class 1 — "The Scraping-Bys"

Characteristics: Wikipedia dump exists. Minimal labeled data (possibly one NER or POS dataset). No comprehensive benchmarks. Growing online presence.

Examples: Igbo (Nigeria, ~30M speakers), Marathi (India, lower end), many West African languages (Fon, Ewe), several Indigenous Americas languages with revitalization programs.

Typical strategy:

Continued pretraining on Wikipedia + OPUS + available web data
OFA or HyperOfa vocab extension depending on parallel data availability
LoRA r=16–32 from a multilingual base
Tokenizer: fertility audit mandatory; byte fallback mandatory
Ethics: check community norms even for CC-BY data

Eval: FLORES-200 if included; MasakhaNER for NER (African languages); custom probes for other tasks.

Class 2 — "Hopefuls"

Characteristics: Some labeled data. Growing number of NLP papers. Included in multilingual benchmarks (FLORES-200, Belebele). Active NLP research community. Still significant gaps.

Examples: Yoruba (Nigeria, ~40M speakers), Khmer (Cambodia, ~16M speakers), Twi (Ghana, ~9M speakers), Amharic (Ethiopia, ~25M speakers), Hausa (West Africa, ~70M speakers).

Typical strategy:

OFA vocab extension + LoRA (parallel data usually available)
FLORES-200 eval (flag as lower bound — in many pretrain mixes)
chrF++ + COMET-22 as primary metrics (never BLEU for Class 2 morphologically-rich)
Ethics: standard FPIC; check Bible-NLP register traps

Eval: FLORES-200, Belebele (reading comprehension), regional benchmarks (AfroBench for African languages, IndicXTREME for Indic).

Class 3 — "Rising Stars"

Characteristics: Multiple benchmarks. Growing tooling ecosystem. Often a regional standard or lingua franca. Reasonable data availability but still behind Class 4–5 in tooling depth.

Examples: Swahili (East Africa, ~200M speakers including L2), Indonesian (Indonesia, ~270M speakers including L2), Bengali (Bangladesh/India, lower-resource end), Nepali, Sinhala.

Typical strategy:

Continued pretraining OR standard fine-tune (enough data for either)
Vocabulary extension may or may not be needed (fertility audit still required)
Standard LoRA fine-tune; forgetting mitigation less critical at this class
Full benchmark suite available

Eval: FLORES-200, Belebele, XNLI, per-task benchmarks.

Class 4 — "Underdogs"

Characteristics: Many benchmarks. Reasonable tooling. Regional standard with significant NLP research. Still meaningfully behind Class 5 in dataset breadth and model quality.

Examples: Vietnamese (~95M speakers), Turkish (~80M speakers), Tamil (~75M speakers), Hindi (higher end), Polish, Czech, Dutch.

Typical strategy:

Standard fine-tune; continued pretraining if adapting from English-only model
Vocabulary extension rarely needed (fertility usually adequate)
Full eval suite; per-dialect breakdown matters (especially for Hindi vs Urdu, Vietnamese regional variants)

Eval: Full FLORES-200 + COMET; per-benchmark comparison with published baselines.

Class 5 — "Winners"

Characteristics: Benchmark-saturated. Abundant tooling. Dominant in NLP research. Most pretrained models are optimized for Class 5 languages.

Examples: English, Mandarin, Spanish, French, German, Japanese, Arabic (MSA), Russian, Portuguese, Korean.

Typical strategy:

Standard everything: full fine-tune, standard tokenizer, BLEU acceptable as supplementary
Most pretrained models already handle Class 5 well
The linguistic suite skills are rarely needed; use standard ML practices

Note: Arabic (MSA) is Class 5, but Arabic dialects (Egyptian, Levantine, Maghrebi) range from Class 2–4. Always disambiguate.

Class Is Multi-Dimensional

A critical principle: Class is data + benchmarks + tooling — not just Wikipedia size.

Cebuano has 6M+ Wikipedia articles (mostly bot-generated) → effective NLP Class: 2
Welsh has smaller Wikipedia but rich NLP tooling and community investment → effective Class: 3–4
Swahili has broad L2 speaker base but limited native-speaker NLP data → treat as Class 3 with caveats

linguistic-scope computes Joshi class from multiple signals: Wikipedia presence, OPUS presence, FLORES-200 inclusion, NLLB inclusion, HuggingFace dataset count. Single-signal classification is unreliable.

Classification in Practice

When linguistic-scope produces a range (e.g., "Class 1–2") rather than a single number, it means the deciding factor is ambiguous — present the range with the deciding factor and defer to user judgment. The heuristics should be transparent, not hidden in an opaque classification.

Was this page helpful?

On this page