linguistic-historical
Historical/comparative linguistics primitives for ML data augmentation: cognate sets across related languages, Swadesh lists for cheap bilingual-lexicon bootstrapping, and regular sound-correspondence rules. Optional Mindset specialist.
Overview
For Class 0–1 languages with no usable training data, comparative-historical linguistics offers the only practical bootstrapping path: cognate sets, Swadesh lists, and sound correspondences from a related higher-resource language. These are decades-mature tools — the operationalizable primitives for low-resource ML data augmentation.
Pipeline Position
Phase: Optional (Phase 4) — activate when target is Class 0–1 and a related higher-resource language exists
When to activate: After linguistic-scope identifies URIEL distance ≤ 0.3 to a related Class 3+ language and linguistic-corpus confirms insufficient monolingual data
When It Activates
- Class 0–1 target language; need bilingual lexicon from related-language cognates
- Bootstrapping bitext from Swadesh-list pairs
- Validating typological-distance recommendations from
linguistic-scope - Cognate-based data augmentation between related-language pairs
When NOT to use: Class 3+ language — standard data is ample; cognate bootstrap not needed. Pure typology lookup → linguistic-scope.
What It Does
Key Primitives
Cognate sets: Etymologically-related word pairs across related languages. Spanish "noche" ↔ Italian "notte" ↔ French "nuit" ↔ Portuguese "noite" — same Latin source, predictable sound shifts. For Class 0–1, cognate sets bootstrap bilingual lexicons cheaply.
Swadesh lists: 100-word / 200-word core-vocabulary lists. Standard starting point for Class 0 bilingual-lexicon work. NOT a complete lexicon — a starting bootstrap only. Modern terms, technical vocabulary, and cultural concepts diverge; Swadesh covers core.
Regular sound correspondences: Grimm's Law (Germanic), Proto-Bantu reflexes, etc. Encode as rewrite rules → automatic cognate detection or augmentation between related-language pairs.
Key Data Sources
| Resource | Coverage | Use |
|---|---|---|
| LingPy | Python comparative-linguistics toolkit | Cognate-detection algorithms |
| NorthEuraLex | Large cognate database for Eurasian languages | Lexicon bootstrapping |
| IE-CoR | Indo-European Cognate Relationships database | IE language families |
| BDPROTO | Proto-language phoneme inventories | Reconstruction reference |
| CogNet | Multilingual cognate database | Cross-family lookup |
When Cognate Augmentation Is Highest-Value
Cognate-based augmentation is most valuable when:
- Target is Class 0–1
- Source is Class 3+
- URIEL distance ≤ 0.3 (cross-reference
linguistic-scope)
"Related" does not mean "mutually intelligible" — Italian and Romanian are both Romance, but speakers don't understand each other. Typological proximity does not equal usable bilingual-lexicon directly.
Inputs & Outputs
| Input | Description |
|---|---|
| Target language + related language | For cognate set lookup |
| URIEL distance from scope | Confirms relationship strength |
| Output | Description |
|---|---|
| Cognate set | Core vocabulary pairs with sound correspondences |
| Swadesh bootstrap lexicon | 100–200 word pairs |
| Sound-correspondence rules | Rewrite rules for augmentation |
| Augmentation estimate | Expected lexicon coverage gain |
Example Usage
Target: Aromanian (rup, Class 1), related to Romanian (ron, Class 4), URIEL distance 0.19
Historical Bootstrap: Aromanian from Romanian
- URIEL distance: 0.19 (close; cognate augmentation high-value)
- Cognate source: Romanian (ron) Class 4 — ample data
- Swadesh bootstrap: 200-word list (Aromanian ↔ Romanian cognate pairs)
with sound-correspondence rules:
rom /oa/ → arom /ua/ (diphthong shift)
rom final -u dropped → arom retained
- LingPy cognate detection: configured with Romance sound-correspondence rules
- Expected coverage: ~180/200 Swadesh items have usable cognates
- Note: modern/technical vocabulary diverges; Swadesh bootstrap only
- Next: linguistic-bitext for synthetic bitext via cognate substitutionRelated Skills
linguistic-scope— URIEL distance confirms relationship; transfer-source recommendationlinguistic-bitext— cognate-based bitext synthesislinguistic-lexicon— cognate bootstrap feeds lexicon construction
Last updated on
linguistic-codeswitch
Code-switching awareness for ML pipelines — Hinglish, Spanglish, Singlish, MSA+dialect Arabic, and other bilingual mixing. Optional Mindset specialist. Code-switching is the norm for many bilingual users, not noise to filter.
linguistic-lexicon
Lexicography for ML — dictionary-building methodology, sense splitting/lumping decisions, MWE inventories for RAG glossary injection and MT post-edit, citation-form conventions, variant handling. Optional Mindset specialist.