Low-Resource Languages Guide

Low-resource languages are not a homogeneous category. A language with 10M speakers but no NLP benchmarks (Joshi Class 2) needs a different strategy than a language with 500 speakers and no written tradition (Joshi Class 0). Understanding this spectrum is the prerequisite for every decision in the linguistic pipeline.

The Joshi Classification System

Joshi et al. (ACL 2020, "The State and Fate of Linguistic Diversity and Inclusion in the NLP World") established a 6-level resource classification based on data availability, benchmark coverage, and tooling support.

Class	Label	Characteristics	Examples	Strategy
0	The Left-Behinds	No labeled data; often no written standard; endangered or dormant	Dahalo, Yagua, many Indigenous Americas languages	Field documentation; bootstrap from related language; community partnership mandatory
1	The Scraping-Bys	Wikipedia dump exists; minimal labeled data; no benchmarks	Igbo, Marathi (low end), many West African languages	Continued pretraining + adapter; vocabulary extension
2	Hopefuls	Some labeled data; no full benchmark coverage; growing community	Yoruba, Khmer, Twi, Amharic	Vocabulary extension + LoRA; FLORES-200 eval
3	Rising Stars	Multiple benchmarks; growing tooling; regional standard	Swahili, Indonesian (low end), Bengali (lower end)	Standard fine-tune + careful eval
4	Underdogs	Many benchmarks; reasonable tooling; regional standard	Vietnamese, Turkish, Tamil, Hindi	Standard fine-tune
5	Winners	Benchmark-saturated; abundant tooling; dominant in NLP	English, Mandarin, Spanish, French	Standard everything

Critical: Class is multi-dimensional — data + benchmarks + tooling. A language is NOT Class 5 just because Wikipedia is large. Cebuano has 6M Wikipedia articles (mostly bot-generated) but is effectively Class 2 for practical NLP.

Finding Data for Low-Resource Languages

Catalog Sources (start here)

Source	Coverage	Best For
OLDI (Open Language Data Initiative)	100+ languages	High-quality community-contributed data
CulturaX	167 languages	Web-scale multilingual
MADLAD-400	400+ languages	Broad coverage; overlaps with CulturaX
Glot500	500+ languages	Very broad; varied quality
Wikipedia dumps	300+ languages	Encyclopedic; watch for bot-generated
OPUS	Parallel data, many languages	Best for bitext mining
Bible-NLP	1,500+ languages	Broad but liturgical register only

The Bible-NLP Warning

Bible-NLP appears in nearly every low-resource catalog. For Joshi Class 0–1 languages, it is often the only available data. This creates a register trap: training on Bible-only data produces a model that sounds like 17th-century scripture. Always:

Limit Bible text to ≤30% of any training mix
Supplement with contemporary web, news, or social media data when available
Flag the register imbalance in model cards

Community Archives (for endangered languages)

Archive	Focus	Access
ELAR (Endangered Languages Archive)	Endangered language documentation	Researcher access; some community-restricted
AILLA (Archive of the Indigenous Languages of Latin America)	Latin American Indigenous languages	Graduated access
PARADISEC	Pacific and Southeast Asian languages	Researcher access
DELAMAN	Network linking multiple archives	Meta-catalog
DoBeS	Documentation of Endangered Languages	Researcher access

Always route through linguistic-ethics before accessing community archive data. These archives contain materials with specific use restrictions; license text alone is insufficient.

Typological Considerations

Low-resource languages span enormous typological diversity. Common patterns that require special handling:

Agglutinative Languages (Turkish, Finnish, Swahili, Korean, Tamil)

Morphemes stack as separate units: "evlerinizden" (Turkish, "from your houses") = ev + ler + iniz + den. BPE fertility is 2–4× higher than English. Vocabulary extension is typically required. UniMorph paradigms + SIGMORPHON segmenters help; standard BPE treats the word as an opaque sequence.

Polysynthetic Languages (Inuktitut, Navajo, West Greenlandic)

Entire sentences can compress into a single word with 8–20+ morphemes. Fertility can reach 5–7×. Morpheme segmentation is mandatory — vocabulary extension alone is insufficient. FST analyzers (HFST) are essential when available.

Tonal Languages (Yoruba, Vietnamese, Hausa, Mandarin)

Diacritics carry lexical meaning. "Ọkọ̀" (boat), "ọkọ" (husband), "okọ́" (hoe) — different words distinguished only by tone marks. Stripping diacritics is catastrophic data corruption. Any pipeline that calls unidecode() on tone-language text must be blocked.

Script-Diverse Languages

Many low-resource languages have multiple co-existing scripts (Kazakh: Cyrillic + Latin + Arabic; Punjabi: Gurmukhi + Nastaliq; Uzbek: Cyrillic + Latin). Per-script normalization policies are required — a single-policy approach silently corrupts one script variant.

Ethical Considerations

Working with low-resource languages requires heightened ethical awareness because many under-resourced languages are also endangered, Indigenous, or community-controlled.

The Community Partnership Principle

For languages at EGIDS 6b or lower (threatened to extinct): community partnership is mandatory before any data acquisition. This is not optional ethics — it is a practical prerequisite. Projects that skip community engagement risk:

Access revocation after training begins
Regulatory consequences (Indigenous data sovereignty laws)
Harm to already-vulnerable communities

FPIC for Class 0–1 Languages

Class 0–1 languages are often endangered. Many of their data sources are community archives with explicit use restrictions. FPIC (Free, Prior, Informed Consent) is required — and it is process, not a one-time signature.

See the Ethics and FPIC Guide for full detail.

Data Sovereignty

Indigenous communities increasingly assert data sovereignty — the right to govern how data about their language and culture is collected, used, and shared. This is distinct from copyright. A dataset can be publicly available and still subject to community governance claims. Route all Indigenous-language data through linguistic-ethics.

Practical Starting Point

For any new low-resource language project:

Run linguistic-scope — get ISO 639-3 code, Joshi class, typology, transfer source
Run linguistic-ethics seed — set ethics depth based on vitality
Check OLDI + CulturaX + Wikipedia for existing data
Run linguistic-corpus — catalog, dedup, register balance
Run linguistic-tokenize — fertility audit before any training decision

This 5-step sequence takes minutes and prevents the most costly low-resource ML mistakes.

Was this page helpful?

On this page