Low-Resource Languages Guide
Low-resource languages are not a homogeneous category. A language with 10M speakers but no NLP benchmarks (Joshi Class 2) needs a different strategy than a language with 500 speakers and no written tradition (Joshi Class 0). Understanding this spectrum is the prerequisite for every decision in the linguistic pipeline.
The Joshi Classification System
Joshi et al. (ACL 2020, "The State and Fate of Linguistic Diversity and Inclusion in the NLP World") established a 6-level resource classification based on data availability, benchmark coverage, and tooling support.
| Class | Label | Characteristics | Examples | Strategy |
|---|---|---|---|---|
| 0 | The Left-Behinds | No labeled data; often no written standard; endangered or dormant | Dahalo, Yagua, many Indigenous Americas languages | Field documentation; bootstrap from related language; community partnership mandatory |
| 1 | The Scraping-Bys | Wikipedia dump exists; minimal labeled data; no benchmarks | Igbo, Marathi (low end), many West African languages | Continued pretraining + adapter; vocabulary extension |
| 2 | Hopefuls | Some labeled data; no full benchmark coverage; growing community | Yoruba, Khmer, Twi, Amharic | Vocabulary extension + LoRA; FLORES-200 eval |
| 3 | Rising Stars | Multiple benchmarks; growing tooling; regional standard | Swahili, Indonesian (low end), Bengali (lower end) | Standard fine-tune + careful eval |
| 4 | Underdogs | Many benchmarks; reasonable tooling; regional standard | Vietnamese, Turkish, Tamil, Hindi | Standard fine-tune |
| 5 | Winners | Benchmark-saturated; abundant tooling; dominant in NLP | English, Mandarin, Spanish, French | Standard everything |
Critical: Class is multi-dimensional — data + benchmarks + tooling. A language is NOT Class 5 just because Wikipedia is large. Cebuano has 6M Wikipedia articles (mostly bot-generated) but is effectively Class 2 for practical NLP.
Finding Data for Low-Resource Languages
Catalog Sources (start here)
| Source | Coverage | Best For |
|---|---|---|
| OLDI (Open Language Data Initiative) | 100+ languages | High-quality community-contributed data |
| CulturaX | 167 languages | Web-scale multilingual |
| MADLAD-400 | 400+ languages | Broad coverage; overlaps with CulturaX |
| Glot500 | 500+ languages | Very broad; varied quality |
| Wikipedia dumps | 300+ languages | Encyclopedic; watch for bot-generated |
| OPUS | Parallel data, many languages | Best for bitext mining |
| Bible-NLP | 1,500+ languages | Broad but liturgical register only |
The Bible-NLP Warning
Bible-NLP appears in nearly every low-resource catalog. For Joshi Class 0–1 languages, it is often the only available data. This creates a register trap: training on Bible-only data produces a model that sounds like 17th-century scripture. Always:
- Limit Bible text to ≤30% of any training mix
- Supplement with contemporary web, news, or social media data when available
- Flag the register imbalance in model cards
Community Archives (for endangered languages)
| Archive | Focus | Access |
|---|---|---|
| ELAR (Endangered Languages Archive) | Endangered language documentation | Researcher access; some community-restricted |
| AILLA (Archive of the Indigenous Languages of Latin America) | Latin American Indigenous languages | Graduated access |
| PARADISEC | Pacific and Southeast Asian languages | Researcher access |
| DELAMAN | Network linking multiple archives | Meta-catalog |
| DoBeS | Documentation of Endangered Languages | Researcher access |
Always route through linguistic-ethics before accessing community archive data. These archives contain materials with specific use restrictions; license text alone is insufficient.
Typological Considerations
Low-resource languages span enormous typological diversity. Common patterns that require special handling:
Agglutinative Languages (Turkish, Finnish, Swahili, Korean, Tamil)
Morphemes stack as separate units: "evlerinizden" (Turkish, "from your houses") = ev + ler + iniz + den. BPE fertility is 2–4× higher than English. Vocabulary extension is typically required. UniMorph paradigms + SIGMORPHON segmenters help; standard BPE treats the word as an opaque sequence.
Polysynthetic Languages (Inuktitut, Navajo, West Greenlandic)
Entire sentences can compress into a single word with 8–20+ morphemes. Fertility can reach 5–7×. Morpheme segmentation is mandatory — vocabulary extension alone is insufficient. FST analyzers (HFST) are essential when available.
Tonal Languages (Yoruba, Vietnamese, Hausa, Mandarin)
Diacritics carry lexical meaning. "Ọkọ̀" (boat), "ọkọ" (husband), "okọ́" (hoe) — different words distinguished only by tone marks. Stripping diacritics is catastrophic data corruption. Any pipeline that calls unidecode() on tone-language text must be blocked.
Script-Diverse Languages
Many low-resource languages have multiple co-existing scripts (Kazakh: Cyrillic + Latin + Arabic; Punjabi: Gurmukhi + Nastaliq; Uzbek: Cyrillic + Latin). Per-script normalization policies are required — a single-policy approach silently corrupts one script variant.
Ethical Considerations
Working with low-resource languages requires heightened ethical awareness because many under-resourced languages are also endangered, Indigenous, or community-controlled.
The Community Partnership Principle
For languages at EGIDS 6b or lower (threatened to extinct): community partnership is mandatory before any data acquisition. This is not optional ethics — it is a practical prerequisite. Projects that skip community engagement risk:
- Access revocation after training begins
- Regulatory consequences (Indigenous data sovereignty laws)
- Harm to already-vulnerable communities
FPIC for Class 0–1 Languages
Class 0–1 languages are often endangered. Many of their data sources are community archives with explicit use restrictions. FPIC (Free, Prior, Informed Consent) is required — and it is process, not a one-time signature.
See the Ethics and FPIC Guide for full detail.
Data Sovereignty
Indigenous communities increasingly assert data sovereignty — the right to govern how data about their language and culture is collected, used, and shared. This is distinct from copyright. A dataset can be publicly available and still subject to community governance claims. Route all Indigenous-language data through linguistic-ethics.
Practical Starting Point
For any new low-resource language project:
- Run
linguistic-scope— get ISO 639-3 code, Joshi class, typology, transfer source - Run
linguistic-ethicsseed — set ethics depth based on vitality - Check OLDI + CulturaX + Wikipedia for existing data
- Run
linguistic-corpus— catalog, dedup, register balance - Run
linguistic-tokenize— fertility audit before any training decision
This 5-step sequence takes minutes and prevents the most costly low-resource ML mistakes.
Last updated on