linguistic-historical

Historical/comparative linguistics primitives for ML data augmentation: cognate sets across related languages, Swadesh lists for cheap bilingual-lexicon bootstrapping, and regular sound-correspondence rules. Optional Mindset specialist.

Overview

For Class 0–1 languages with no usable training data, comparative-historical linguistics offers the only practical bootstrapping path: cognate sets, Swadesh lists, and sound correspondences from a related higher-resource language. These are decades-mature tools — the operationalizable primitives for low-resource ML data augmentation.

Pipeline Position

Phase: Optional (Phase 4) — activate when target is Class 0–1 and a related higher-resource language exists

When to activate: After linguistic-scope identifies URIEL distance ≤ 0.3 to a related Class 3+ language and linguistic-corpus confirms insufficient monolingual data

When It Activates

Class 0–1 target language; need bilingual lexicon from related-language cognates
Bootstrapping bitext from Swadesh-list pairs
Validating typological-distance recommendations from linguistic-scope
Cognate-based data augmentation between related-language pairs

When NOT to use: Class 3+ language — standard data is ample; cognate bootstrap not needed. Pure typology lookup → linguistic-scope.

What It Does

Key Primitives

Cognate sets: Etymologically-related word pairs across related languages. Spanish "noche" ↔ Italian "notte" ↔ French "nuit" ↔ Portuguese "noite" — same Latin source, predictable sound shifts. For Class 0–1, cognate sets bootstrap bilingual lexicons cheaply.

Swadesh lists: 100-word / 200-word core-vocabulary lists. Standard starting point for Class 0 bilingual-lexicon work. NOT a complete lexicon — a starting bootstrap only. Modern terms, technical vocabulary, and cultural concepts diverge; Swadesh covers core.

Regular sound correspondences: Grimm's Law (Germanic), Proto-Bantu reflexes, etc. Encode as rewrite rules → automatic cognate detection or augmentation between related-language pairs.

Key Data Sources

Resource	Coverage	Use
LingPy	Python comparative-linguistics toolkit	Cognate-detection algorithms
NorthEuraLex	Large cognate database for Eurasian languages	Lexicon bootstrapping
IE-CoR	Indo-European Cognate Relationships database	IE language families
BDPROTO	Proto-language phoneme inventories	Reconstruction reference
CogNet	Multilingual cognate database	Cross-family lookup

When Cognate Augmentation Is Highest-Value

Cognate-based augmentation is most valuable when:

Target is Class 0–1
Source is Class 3+
URIEL distance ≤ 0.3 (cross-reference linguistic-scope)

"Related" does not mean "mutually intelligible" — Italian and Romanian are both Romance, but speakers don't understand each other. Typological proximity does not equal usable bilingual-lexicon directly.

Inputs & Outputs

Input	Description
Target language + related language	For cognate set lookup
URIEL distance from scope	Confirms relationship strength

Output	Description
Cognate set	Core vocabulary pairs with sound correspondences
Swadesh bootstrap lexicon	100–200 word pairs
Sound-correspondence rules	Rewrite rules for augmentation
Augmentation estimate	Expected lexicon coverage gain

Example Usage

Target: Aromanian (rup, Class 1), related to Romanian (ron, Class 4), URIEL distance 0.19

Historical Bootstrap: Aromanian from Romanian
- URIEL distance: 0.19 (close; cognate augmentation high-value)
- Cognate source: Romanian (ron) Class 4 — ample data
- Swadesh bootstrap: 200-word list (Aromanian ↔ Romanian cognate pairs)
    with sound-correspondence rules:
    rom /oa/ → arom /ua/ (diphthong shift)
    rom final -u dropped → arom retained
- LingPy cognate detection: configured with Romance sound-correspondence rules
- Expected coverage: ~180/200 Swadesh items have usable cognates
- Note: modern/technical vocabulary diverges; Swadesh bootstrap only
- Next: linguistic-bitext for synthetic bitext via cognate substitution

linguistic-scope — URIEL distance confirms relationship; transfer-source recommendation
linguistic-bitext — cognate-based bitext synthesis
linguistic-lexicon — cognate bootstrap feeds lexicon construction

Was this page helpful?

On this page