MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Guides

Low-Resource Languages Guide

Low-resource languages are not a homogeneous category. A language with 10M speakers but no NLP benchmarks (Joshi Class 2) needs a different strategy than a language with 500 speakers and no written tradition (Joshi Class 0). Understanding this spectrum is the prerequisite for every decision in the linguistic pipeline.

The Joshi Classification System

Joshi et al. (ACL 2020, "The State and Fate of Linguistic Diversity and Inclusion in the NLP World") established a 6-level resource classification based on data availability, benchmark coverage, and tooling support.

ClassLabelCharacteristicsExamplesStrategy
0The Left-BehindsNo labeled data; often no written standard; endangered or dormantDahalo, Yagua, many Indigenous Americas languagesField documentation; bootstrap from related language; community partnership mandatory
1The Scraping-BysWikipedia dump exists; minimal labeled data; no benchmarksIgbo, Marathi (low end), many West African languagesContinued pretraining + adapter; vocabulary extension
2HopefulsSome labeled data; no full benchmark coverage; growing communityYoruba, Khmer, Twi, AmharicVocabulary extension + LoRA; FLORES-200 eval
3Rising StarsMultiple benchmarks; growing tooling; regional standardSwahili, Indonesian (low end), Bengali (lower end)Standard fine-tune + careful eval
4UnderdogsMany benchmarks; reasonable tooling; regional standardVietnamese, Turkish, Tamil, HindiStandard fine-tune
5WinnersBenchmark-saturated; abundant tooling; dominant in NLPEnglish, Mandarin, Spanish, FrenchStandard everything

Critical: Class is multi-dimensional — data + benchmarks + tooling. A language is NOT Class 5 just because Wikipedia is large. Cebuano has 6M Wikipedia articles (mostly bot-generated) but is effectively Class 2 for practical NLP.

Finding Data for Low-Resource Languages

Catalog Sources (start here)

SourceCoverageBest For
OLDI (Open Language Data Initiative)100+ languagesHigh-quality community-contributed data
CulturaX167 languagesWeb-scale multilingual
MADLAD-400400+ languagesBroad coverage; overlaps with CulturaX
Glot500500+ languagesVery broad; varied quality
Wikipedia dumps300+ languagesEncyclopedic; watch for bot-generated
OPUSParallel data, many languagesBest for bitext mining
Bible-NLP1,500+ languagesBroad but liturgical register only

The Bible-NLP Warning

Bible-NLP appears in nearly every low-resource catalog. For Joshi Class 0–1 languages, it is often the only available data. This creates a register trap: training on Bible-only data produces a model that sounds like 17th-century scripture. Always:

  • Limit Bible text to ≤30% of any training mix
  • Supplement with contemporary web, news, or social media data when available
  • Flag the register imbalance in model cards

Community Archives (for endangered languages)

ArchiveFocusAccess
ELAR (Endangered Languages Archive)Endangered language documentationResearcher access; some community-restricted
AILLA (Archive of the Indigenous Languages of Latin America)Latin American Indigenous languagesGraduated access
PARADISECPacific and Southeast Asian languagesResearcher access
DELAMANNetwork linking multiple archivesMeta-catalog
DoBeSDocumentation of Endangered LanguagesResearcher access

Always route through linguistic-ethics before accessing community archive data. These archives contain materials with specific use restrictions; license text alone is insufficient.

Typological Considerations

Low-resource languages span enormous typological diversity. Common patterns that require special handling:

Agglutinative Languages (Turkish, Finnish, Swahili, Korean, Tamil)

Morphemes stack as separate units: "evlerinizden" (Turkish, "from your houses") = ev + ler + iniz + den. BPE fertility is 2–4× higher than English. Vocabulary extension is typically required. UniMorph paradigms + SIGMORPHON segmenters help; standard BPE treats the word as an opaque sequence.

Polysynthetic Languages (Inuktitut, Navajo, West Greenlandic)

Entire sentences can compress into a single word with 8–20+ morphemes. Fertility can reach 5–7×. Morpheme segmentation is mandatory — vocabulary extension alone is insufficient. FST analyzers (HFST) are essential when available.

Tonal Languages (Yoruba, Vietnamese, Hausa, Mandarin)

Diacritics carry lexical meaning. "Ọkọ̀" (boat), "ọkọ" (husband), "okọ́" (hoe) — different words distinguished only by tone marks. Stripping diacritics is catastrophic data corruption. Any pipeline that calls unidecode() on tone-language text must be blocked.

Script-Diverse Languages

Many low-resource languages have multiple co-existing scripts (Kazakh: Cyrillic + Latin + Arabic; Punjabi: Gurmukhi + Nastaliq; Uzbek: Cyrillic + Latin). Per-script normalization policies are required — a single-policy approach silently corrupts one script variant.

Ethical Considerations

Working with low-resource languages requires heightened ethical awareness because many under-resourced languages are also endangered, Indigenous, or community-controlled.

The Community Partnership Principle

For languages at EGIDS 6b or lower (threatened to extinct): community partnership is mandatory before any data acquisition. This is not optional ethics — it is a practical prerequisite. Projects that skip community engagement risk:

  • Access revocation after training begins
  • Regulatory consequences (Indigenous data sovereignty laws)
  • Harm to already-vulnerable communities

FPIC for Class 0–1 Languages

Class 0–1 languages are often endangered. Many of their data sources are community archives with explicit use restrictions. FPIC (Free, Prior, Informed Consent) is required — and it is process, not a one-time signature.

See the Ethics and FPIC Guide for full detail.

Data Sovereignty

Indigenous communities increasingly assert data sovereignty — the right to govern how data about their language and culture is collected, used, and shared. This is distinct from copyright. A dataset can be publicly available and still subject to community governance claims. Route all Indigenous-language data through linguistic-ethics.

Practical Starting Point

For any new low-resource language project:

  1. Run linguistic-scope — get ISO 639-3 code, Joshi class, typology, transfer source
  2. Run linguistic-ethics seed — set ethics depth based on vitality
  3. Check OLDI + CulturaX + Wikipedia for existing data
  4. Run linguistic-corpus — catalog, dedup, register balance
  5. Run linguistic-tokenize — fertility audit before any training decision

This 5-step sequence takes minutes and prevents the most costly low-resource ML mistakes.

Was this page helpful?
Edit on GitHub

Last updated on

On this page