MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Concepts

Typological Profiling

Typological profiling is the process of characterizing a language's structural features and computing its distance from other languages. In the linguistic pipeline, it serves two purposes: predicting which aspects of a language will be hardest for ML models, and selecting the best transfer source.

What Is a Typological Profile?

A typological profile is a vector of linguistic features extracted from cross-linguistic databases (WALS, Grambank, URIEL). Key dimensions:

FeatureValuesML Implication
Word orderSOV, SVO, VSO, VOS, OVS, OSVParser transfer; word-order probes
Morphological typeIsolating, agglutinative, fusional, polysyntheticTokenizer fertility; segmenter choice
ToneNone, lexical, grammaticalDiacritic preservation MANDATORY
AlignmentNominative-accusative, ergative-absolutiveParser eval; case probes
Pro-dropYes, noZero-anaphora for coreference
Classifier systemYes, noNumeral handling fragility
EvidentialityYes, noTranslation systems silently drop

URIEL Distance

URIEL (University Research Initiative on Endangered Languages) provides typological distance vectors for thousands of language pairs. The distance is computed from combined WALS, Grambank, and EGIDS features and normalized to [0, 1].

Lower URIEL distance = more typologically similar = better transfer source candidate.

Distance RangeInterpretation
0.0–0.2Very close (same family + shared features)
0.2–0.4Close (related family or shared typological features)
0.4–0.6Moderate (different family but some overlap)
0.6–0.8Distant (different families, different features)
0.8–1.0Very distant (fundamentally different structure)

Transfer Source Selection

The most common mistake in low-resource NLP is defaulting to English as the transfer source. English is typologically unusual: SVO, almost no morphology, no tone, no evidentiality, accusative alignment, Latin script. For most of the world's languages, it is a poor transfer source.

Process:

  1. linguistic-scope runs uriel_distance.py to compute distances to top-100 candidate languages
  2. Candidates are filtered by data availability (Joshi class ≥ 2 preferred)
  3. Top-3 are recommended with bounded justifications

Example — Yoruba (yor):

Transfer-source candidates for Yoruba (yor):
1. Igbo (ibo) — distance 0.18 — same family + tone + Latin script + Class 1 data
2. Hausa (hau) — distance 0.34 — regional contact + tone + Class 2 data
3. Swahili (swa) — distance 0.41 — same family > Bantu + Class 3 + Latin script
English (eng) — distance 0.62 — NOT recommended (no tone, no same family, no shared morphology)

For Yoruba, Igbo as transfer source outperforms English by 2–5× on parser transfer and 15–25% on NER, purely due to typological proximity.

Outlier Features That Require Special Handling

linguistic-scope surfaces outlier features that require targeted intervention:

Polysynthesis (Inuktitut, Navajo, West Greenlandic)

Tokenizer fertility 4–7×. A single word encodes a full sentence. Vocabulary extension is mandatory. FST morpheme segmentation is essential — standard BPE treats the polysynthetic word as an opaque unit and fails.

Tone (Yoruba, Vietnamese, Hausa, Mandarin, Igbo)

Diacritic preservation is non-negotiable. Tone distinctions are lexical — different words, not pronunciation variants. Any pipeline that strips diacritics from these languages is corrupting the training data, not cleaning it.

Root-and-Pattern Morphology (Arabic, Hebrew, Amharic)

BPE captures Arabic roots poorly. The trilateral root system means surface forms that look unrelated share a root (k-t-b: kataba "he wrote", kitāb "book", kātib "writer"). Morphological pre-processing or root-aware tokenization is recommended.

Agglutination (Turkish, Finnish, Hungarian, Korean, Swahili Bantu noun class)

Morphemes concatenate as distinct units. Tokenizer fertility 2–4×. UniMorph paradigms + SIGMORPHON segmenters help significantly. Standard BPE has higher fertility than necessary but does not fundamentally fail the way it does for polysynthetic languages.

Ergative-Absolutive Alignment (Basque, many Caucasian, Tibetan, many Indigenous Americas)

Subject and object roles are inverted compared to nominative-accusative languages. English-trained parsers handle ergative-absolutive poorly — parser F1 on Basque is misleading because the underlying role assignment is wrong for English-style subject detection.

Evidentiality (Quechua, Tibetan, many Turkic)

Verbs grammatically encode the source of information (direct witness vs hearsay vs inference). Translation systems silently drop evidential distinctions — the generated translation is grammatical but loses crucial information. Targeted eval probes are needed.

Databases Used

DatabaseCoverageWhat It Provides
WALS (World Atlas of Language Structures)2,662 languages192 structural features; per-language entries
Grambank2,467 languages195 features; more recent + broader coverage
URIEL / lang2vecThousands of languagesCombined feature vectors; distance computation
Glottolog8,000+ languagesLanguage catalog, genealogy, geographic data

Limitations

URIEL distances are heuristics — they predict transfer success with high variance. Always note uncertainty:

  • URIEL coverage is uneven; many Class 0–1 languages have partial or estimated vectors
  • Typological distance predicts structural transfer; it does not predict data-domain transfer
  • A language with low URIEL distance but only liturgical data may still produce poor transfer for general-domain tasks

linguistic-scope always presents transfer recommendations with explicit uncertainty bounds. Never present URIEL distances as deterministic predictions.

Was this page helpful?
Edit on GitHub

Last updated on

On this page