MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-historical

Historical/comparative linguistics primitives for ML data augmentation: cognate sets across related languages, Swadesh lists for cheap bilingual-lexicon bootstrapping, and regular sound-correspondence rules. Optional Mindset specialist.

Overview

For Class 0–1 languages with no usable training data, comparative-historical linguistics offers the only practical bootstrapping path: cognate sets, Swadesh lists, and sound correspondences from a related higher-resource language. These are decades-mature tools — the operationalizable primitives for low-resource ML data augmentation.

Pipeline Position

Phase: Optional (Phase 4) — activate when target is Class 0–1 and a related higher-resource language exists

When to activate: After linguistic-scope identifies URIEL distance ≤ 0.3 to a related Class 3+ language and linguistic-corpus confirms insufficient monolingual data

When It Activates

  • Class 0–1 target language; need bilingual lexicon from related-language cognates
  • Bootstrapping bitext from Swadesh-list pairs
  • Validating typological-distance recommendations from linguistic-scope
  • Cognate-based data augmentation between related-language pairs

When NOT to use: Class 3+ language — standard data is ample; cognate bootstrap not needed. Pure typology lookup → linguistic-scope.

What It Does

Key Primitives

Cognate sets: Etymologically-related word pairs across related languages. Spanish "noche" ↔ Italian "notte" ↔ French "nuit" ↔ Portuguese "noite" — same Latin source, predictable sound shifts. For Class 0–1, cognate sets bootstrap bilingual lexicons cheaply.

Swadesh lists: 100-word / 200-word core-vocabulary lists. Standard starting point for Class 0 bilingual-lexicon work. NOT a complete lexicon — a starting bootstrap only. Modern terms, technical vocabulary, and cultural concepts diverge; Swadesh covers core.

Regular sound correspondences: Grimm's Law (Germanic), Proto-Bantu reflexes, etc. Encode as rewrite rules → automatic cognate detection or augmentation between related-language pairs.

Key Data Sources

ResourceCoverageUse
LingPyPython comparative-linguistics toolkitCognate-detection algorithms
NorthEuraLexLarge cognate database for Eurasian languagesLexicon bootstrapping
IE-CoRIndo-European Cognate Relationships databaseIE language families
BDPROTOProto-language phoneme inventoriesReconstruction reference
CogNetMultilingual cognate databaseCross-family lookup

When Cognate Augmentation Is Highest-Value

Cognate-based augmentation is most valuable when:

  • Target is Class 0–1
  • Source is Class 3+
  • URIEL distance ≤ 0.3 (cross-reference linguistic-scope)

"Related" does not mean "mutually intelligible" — Italian and Romanian are both Romance, but speakers don't understand each other. Typological proximity does not equal usable bilingual-lexicon directly.

Inputs & Outputs

InputDescription
Target language + related languageFor cognate set lookup
URIEL distance from scopeConfirms relationship strength
OutputDescription
Cognate setCore vocabulary pairs with sound correspondences
Swadesh bootstrap lexicon100–200 word pairs
Sound-correspondence rulesRewrite rules for augmentation
Augmentation estimateExpected lexicon coverage gain

Example Usage

Target: Aromanian (rup, Class 1), related to Romanian (ron, Class 4), URIEL distance 0.19

Historical Bootstrap: Aromanian from Romanian
- URIEL distance: 0.19 (close; cognate augmentation high-value)
- Cognate source: Romanian (ron) Class 4 — ample data
- Swadesh bootstrap: 200-word list (Aromanian ↔ Romanian cognate pairs)
    with sound-correspondence rules:
    rom /oa/ → arom /ua/ (diphthong shift)
    rom final -u dropped → arom retained
- LingPy cognate detection: configured with Romance sound-correspondence rules
- Expected coverage: ~180/200 Swadesh items have usable cognates
- Note: modern/technical vocabulary diverges; Swadesh bootstrap only
- Next: linguistic-bitext for synthetic bitext via cognate substitution
Was this page helpful?
Edit on GitHub

Last updated on

On this page