Typological Profiling
Typological profiling is the process of characterizing a language's structural features and computing its distance from other languages. In the linguistic pipeline, it serves two purposes: predicting which aspects of a language will be hardest for ML models, and selecting the best transfer source.
What Is a Typological Profile?
A typological profile is a vector of linguistic features extracted from cross-linguistic databases (WALS, Grambank, URIEL). Key dimensions:
| Feature | Values | ML Implication |
|---|---|---|
| Word order | SOV, SVO, VSO, VOS, OVS, OSV | Parser transfer; word-order probes |
| Morphological type | Isolating, agglutinative, fusional, polysynthetic | Tokenizer fertility; segmenter choice |
| Tone | None, lexical, grammatical | Diacritic preservation MANDATORY |
| Alignment | Nominative-accusative, ergative-absolutive | Parser eval; case probes |
| Pro-drop | Yes, no | Zero-anaphora for coreference |
| Classifier system | Yes, no | Numeral handling fragility |
| Evidentiality | Yes, no | Translation systems silently drop |
URIEL Distance
URIEL (University Research Initiative on Endangered Languages) provides typological distance vectors for thousands of language pairs. The distance is computed from combined WALS, Grambank, and EGIDS features and normalized to [0, 1].
Lower URIEL distance = more typologically similar = better transfer source candidate.
| Distance Range | Interpretation |
|---|---|
| 0.0–0.2 | Very close (same family + shared features) |
| 0.2–0.4 | Close (related family or shared typological features) |
| 0.4–0.6 | Moderate (different family but some overlap) |
| 0.6–0.8 | Distant (different families, different features) |
| 0.8–1.0 | Very distant (fundamentally different structure) |
Transfer Source Selection
The most common mistake in low-resource NLP is defaulting to English as the transfer source. English is typologically unusual: SVO, almost no morphology, no tone, no evidentiality, accusative alignment, Latin script. For most of the world's languages, it is a poor transfer source.
Process:
linguistic-scoperunsuriel_distance.pyto compute distances to top-100 candidate languages- Candidates are filtered by data availability (Joshi class ≥ 2 preferred)
- Top-3 are recommended with bounded justifications
Example — Yoruba (yor):
Transfer-source candidates for Yoruba (yor):
1. Igbo (ibo) — distance 0.18 — same family + tone + Latin script + Class 1 data
2. Hausa (hau) — distance 0.34 — regional contact + tone + Class 2 data
3. Swahili (swa) — distance 0.41 — same family > Bantu + Class 3 + Latin script
English (eng) — distance 0.62 — NOT recommended (no tone, no same family, no shared morphology)For Yoruba, Igbo as transfer source outperforms English by 2–5× on parser transfer and 15–25% on NER, purely due to typological proximity.
Outlier Features That Require Special Handling
linguistic-scope surfaces outlier features that require targeted intervention:
Polysynthesis (Inuktitut, Navajo, West Greenlandic)
Tokenizer fertility 4–7×. A single word encodes a full sentence. Vocabulary extension is mandatory. FST morpheme segmentation is essential — standard BPE treats the polysynthetic word as an opaque unit and fails.
Tone (Yoruba, Vietnamese, Hausa, Mandarin, Igbo)
Diacritic preservation is non-negotiable. Tone distinctions are lexical — different words, not pronunciation variants. Any pipeline that strips diacritics from these languages is corrupting the training data, not cleaning it.
Root-and-Pattern Morphology (Arabic, Hebrew, Amharic)
BPE captures Arabic roots poorly. The trilateral root system means surface forms that look unrelated share a root (k-t-b: kataba "he wrote", kitāb "book", kātib "writer"). Morphological pre-processing or root-aware tokenization is recommended.
Agglutination (Turkish, Finnish, Hungarian, Korean, Swahili Bantu noun class)
Morphemes concatenate as distinct units. Tokenizer fertility 2–4×. UniMorph paradigms + SIGMORPHON segmenters help significantly. Standard BPE has higher fertility than necessary but does not fundamentally fail the way it does for polysynthetic languages.
Ergative-Absolutive Alignment (Basque, many Caucasian, Tibetan, many Indigenous Americas)
Subject and object roles are inverted compared to nominative-accusative languages. English-trained parsers handle ergative-absolutive poorly — parser F1 on Basque is misleading because the underlying role assignment is wrong for English-style subject detection.
Evidentiality (Quechua, Tibetan, many Turkic)
Verbs grammatically encode the source of information (direct witness vs hearsay vs inference). Translation systems silently drop evidential distinctions — the generated translation is grammatical but loses crucial information. Targeted eval probes are needed.
Databases Used
| Database | Coverage | What It Provides |
|---|---|---|
| WALS (World Atlas of Language Structures) | 2,662 languages | 192 structural features; per-language entries |
| Grambank | 2,467 languages | 195 features; more recent + broader coverage |
| URIEL / lang2vec | Thousands of languages | Combined feature vectors; distance computation |
| Glottolog | 8,000+ languages | Language catalog, genealogy, geographic data |
Limitations
URIEL distances are heuristics — they predict transfer success with high variance. Always note uncertainty:
- URIEL coverage is uneven; many Class 0–1 languages have partial or estimated vectors
- Typological distance predicts structural transfer; it does not predict data-domain transfer
- A language with low URIEL distance but only liturgical data may still produce poor transfer for general-domain tasks
linguistic-scope always presents transfer recommendations with explicit uncertainty bounds. Never present URIEL distances as deterministic predictions.
Last updated on
Pipeline Architecture
The 5-phase linguistic pipeline model — Scope, Acquire, Analyze, Evaluate, Release — and how specialist skills map to each phase.
Joshi Classification
The 6-level resource classification system (Classes 0–5) for characterizing language data availability, with language examples and strategy implications for each level.