Typological Profiling

Typological profiling is the process of characterizing a language's structural features and computing its distance from other languages. In the linguistic pipeline, it serves two purposes: predicting which aspects of a language will be hardest for ML models, and selecting the best transfer source.

What Is a Typological Profile?

A typological profile is a vector of linguistic features extracted from cross-linguistic databases (WALS, Grambank, URIEL). Key dimensions:

Feature	Values	ML Implication
Word order	SOV, SVO, VSO, VOS, OVS, OSV	Parser transfer; word-order probes
Morphological type	Isolating, agglutinative, fusional, polysynthetic	Tokenizer fertility; segmenter choice
Tone	None, lexical, grammatical	Diacritic preservation MANDATORY
Alignment	Nominative-accusative, ergative-absolutive	Parser eval; case probes
Pro-drop	Yes, no	Zero-anaphora for coreference
Classifier system	Yes, no	Numeral handling fragility
Evidentiality	Yes, no	Translation systems silently drop

URIEL Distance

URIEL (University Research Initiative on Endangered Languages) provides typological distance vectors for thousands of language pairs. The distance is computed from combined WALS, Grambank, and EGIDS features and normalized to [0, 1].

Lower URIEL distance = more typologically similar = better transfer source candidate.

Distance Range	Interpretation
0.0–0.2	Very close (same family + shared features)
0.2–0.4	Close (related family or shared typological features)
0.4–0.6	Moderate (different family but some overlap)
0.6–0.8	Distant (different families, different features)
0.8–1.0	Very distant (fundamentally different structure)

Transfer Source Selection

The most common mistake in low-resource NLP is defaulting to English as the transfer source. English is typologically unusual: SVO, almost no morphology, no tone, no evidentiality, accusative alignment, Latin script. For most of the world's languages, it is a poor transfer source.

Process:

linguistic-scope runs uriel_distance.py to compute distances to top-100 candidate languages
Candidates are filtered by data availability (Joshi class ≥ 2 preferred)
Top-3 are recommended with bounded justifications

Example — Yoruba (yor):

Transfer-source candidates for Yoruba (yor):
1. Igbo (ibo) — distance 0.18 — same family + tone + Latin script + Class 1 data
2. Hausa (hau) — distance 0.34 — regional contact + tone + Class 2 data
3. Swahili (swa) — distance 0.41 — same family > Bantu + Class 3 + Latin script
English (eng) — distance 0.62 — NOT recommended (no tone, no same family, no shared morphology)

For Yoruba, Igbo as transfer source outperforms English by 2–5× on parser transfer and 15–25% on NER, purely due to typological proximity.

Outlier Features That Require Special Handling

linguistic-scope surfaces outlier features that require targeted intervention:

Polysynthesis (Inuktitut, Navajo, West Greenlandic)

Tokenizer fertility 4–7×. A single word encodes a full sentence. Vocabulary extension is mandatory. FST morpheme segmentation is essential — standard BPE treats the polysynthetic word as an opaque unit and fails.

Tone (Yoruba, Vietnamese, Hausa, Mandarin, Igbo)

Diacritic preservation is non-negotiable. Tone distinctions are lexical — different words, not pronunciation variants. Any pipeline that strips diacritics from these languages is corrupting the training data, not cleaning it.

Root-and-Pattern Morphology (Arabic, Hebrew, Amharic)

BPE captures Arabic roots poorly. The trilateral root system means surface forms that look unrelated share a root (k-t-b: kataba "he wrote", kitāb "book", kātib "writer"). Morphological pre-processing or root-aware tokenization is recommended.

Agglutination (Turkish, Finnish, Hungarian, Korean, Swahili Bantu noun class)

Morphemes concatenate as distinct units. Tokenizer fertility 2–4×. UniMorph paradigms + SIGMORPHON segmenters help significantly. Standard BPE has higher fertility than necessary but does not fundamentally fail the way it does for polysynthetic languages.

Ergative-Absolutive Alignment (Basque, many Caucasian, Tibetan, many Indigenous Americas)

Subject and object roles are inverted compared to nominative-accusative languages. English-trained parsers handle ergative-absolutive poorly — parser F1 on Basque is misleading because the underlying role assignment is wrong for English-style subject detection.

Evidentiality (Quechua, Tibetan, many Turkic)

Verbs grammatically encode the source of information (direct witness vs hearsay vs inference). Translation systems silently drop evidential distinctions — the generated translation is grammatical but loses crucial information. Targeted eval probes are needed.

Databases Used

Database	Coverage	What It Provides
WALS (World Atlas of Language Structures)	2,662 languages	192 structural features; per-language entries
Grambank	2,467 languages	195 features; more recent + broader coverage
URIEL / lang2vec	Thousands of languages	Combined feature vectors; distance computation
Glottolog	8,000+ languages	Language catalog, genealogy, geographic data

Limitations

URIEL distances are heuristics — they predict transfer success with high variance. Always note uncertainty:

URIEL coverage is uneven; many Class 0–1 languages have partial or estimated vectors
Typological distance predicts structural transfer; it does not predict data-domain transfer
A language with low URIEL distance but only liturgical data may still produce poor transfer for general-domain tasks

linguistic-scope always presents transfer recommendations with explicit uncertainty bounds. Never present URIEL distances as deterministic predictions.

Was this page helpful?

On this page