MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Concepts

Joshi Classification

The Joshi classification system, introduced in Joshi et al. (ACL 2020, "The State and Fate of Linguistic Diversity and Inclusion in the NLP World"), provides a 6-level taxonomy of language resource availability. Every downstream decision in the linguistic pipeline — tokenizer strategy, adapter configuration, eval benchmark — depends on the Joshi class of the target language.

The Six Classes

Class 0 — "The Left-Behinds"

Characteristics: No labeled data. Often no written standard. Frequently endangered or dormant. NLP tooling essentially nonexistent.

Examples: Dahalo (Kenya, ~400 speakers), Yagua (Peru/Colombia, ~6,000 speakers), many Indigenous Americas languages, many languages documented only in field archives.

Typical strategy:

  • Field documentation partnership (ELAR, AILLA, PARADISEC)
  • Bootstrap from related Class 3+ language via cognate sets (linguistic-historical)
  • FPIC mandatory — community partnership before any data acquisition
  • Synthetic bitext via dictionary substitution and Swadesh-list bootstrap
  • HyperOfa vocab extension + LoRA from multilingual base (mBART, NLLB, BLOOM)

Eval: No benchmarks exist. Construct custom probes; report as "prototype quality" with wide error bars.


Class 1 — "The Scraping-Bys"

Characteristics: Wikipedia dump exists. Minimal labeled data (possibly one NER or POS dataset). No comprehensive benchmarks. Growing online presence.

Examples: Igbo (Nigeria, ~30M speakers), Marathi (India, lower end), many West African languages (Fon, Ewe), several Indigenous Americas languages with revitalization programs.

Typical strategy:

  • Continued pretraining on Wikipedia + OPUS + available web data
  • OFA or HyperOfa vocab extension depending on parallel data availability
  • LoRA r=16–32 from a multilingual base
  • Tokenizer: fertility audit mandatory; byte fallback mandatory
  • Ethics: check community norms even for CC-BY data

Eval: FLORES-200 if included; MasakhaNER for NER (African languages); custom probes for other tasks.


Class 2 — "Hopefuls"

Characteristics: Some labeled data. Growing number of NLP papers. Included in multilingual benchmarks (FLORES-200, Belebele). Active NLP research community. Still significant gaps.

Examples: Yoruba (Nigeria, ~40M speakers), Khmer (Cambodia, ~16M speakers), Twi (Ghana, ~9M speakers), Amharic (Ethiopia, ~25M speakers), Hausa (West Africa, ~70M speakers).

Typical strategy:

  • OFA vocab extension + LoRA (parallel data usually available)
  • FLORES-200 eval (flag as lower bound — in many pretrain mixes)
  • chrF++ + COMET-22 as primary metrics (never BLEU for Class 2 morphologically-rich)
  • Ethics: standard FPIC; check Bible-NLP register traps

Eval: FLORES-200, Belebele (reading comprehension), regional benchmarks (AfroBench for African languages, IndicXTREME for Indic).


Class 3 — "Rising Stars"

Characteristics: Multiple benchmarks. Growing tooling ecosystem. Often a regional standard or lingua franca. Reasonable data availability but still behind Class 4–5 in tooling depth.

Examples: Swahili (East Africa, ~200M speakers including L2), Indonesian (Indonesia, ~270M speakers including L2), Bengali (Bangladesh/India, lower-resource end), Nepali, Sinhala.

Typical strategy:

  • Continued pretraining OR standard fine-tune (enough data for either)
  • Vocabulary extension may or may not be needed (fertility audit still required)
  • Standard LoRA fine-tune; forgetting mitigation less critical at this class
  • Full benchmark suite available

Eval: FLORES-200, Belebele, XNLI, per-task benchmarks.


Class 4 — "Underdogs"

Characteristics: Many benchmarks. Reasonable tooling. Regional standard with significant NLP research. Still meaningfully behind Class 5 in dataset breadth and model quality.

Examples: Vietnamese (~95M speakers), Turkish (~80M speakers), Tamil (~75M speakers), Hindi (higher end), Polish, Czech, Dutch.

Typical strategy:

  • Standard fine-tune; continued pretraining if adapting from English-only model
  • Vocabulary extension rarely needed (fertility usually adequate)
  • Full eval suite; per-dialect breakdown matters (especially for Hindi vs Urdu, Vietnamese regional variants)

Eval: Full FLORES-200 + COMET; per-benchmark comparison with published baselines.


Class 5 — "Winners"

Characteristics: Benchmark-saturated. Abundant tooling. Dominant in NLP research. Most pretrained models are optimized for Class 5 languages.

Examples: English, Mandarin, Spanish, French, German, Japanese, Arabic (MSA), Russian, Portuguese, Korean.

Typical strategy:

  • Standard everything: full fine-tune, standard tokenizer, BLEU acceptable as supplementary
  • Most pretrained models already handle Class 5 well
  • The linguistic suite skills are rarely needed; use standard ML practices

Note: Arabic (MSA) is Class 5, but Arabic dialects (Egyptian, Levantine, Maghrebi) range from Class 2–4. Always disambiguate.


Class Is Multi-Dimensional

A critical principle: Class is data + benchmarks + tooling — not just Wikipedia size.

  • Cebuano has 6M+ Wikipedia articles (mostly bot-generated) → effective NLP Class: 2
  • Welsh has smaller Wikipedia but rich NLP tooling and community investment → effective Class: 3–4
  • Swahili has broad L2 speaker base but limited native-speaker NLP data → treat as Class 3 with caveats

linguistic-scope computes Joshi class from multiple signals: Wikipedia presence, OPUS presence, FLORES-200 inclusion, NLLB inclusion, HuggingFace dataset count. Single-signal classification is unreliable.

Classification in Practice

When linguistic-scope produces a range (e.g., "Class 1–2") rather than a single number, it means the deciding factor is ambiguous — present the range with the deciding factor and defer to user judgment. The heuristics should be transparent, not hidden in an opaque classification.

Was this page helpful?
Edit on GitHub

Last updated on

On this page