linguistic-morph
Morphological analysis for the target language: UniMorph paradigm lookup, SIGMORPHON segmenters, FST/HFST analyzer recommendations, and morphology-aware data augmentation via paradigm completion.
Overview
Tokenizer fertility tells you BPE produces too many tokens. Morphology tells you why — and the mechanism determines the fix. Concatenative agglutination (Turkish) needs a different approach than templatic morphology (Arabic root-and-pattern) or polysynthesis (Inuktitut word-sentences). linguistic-morph classifies the morphology tier and selects the right tooling before expensive training runs.
Pipeline Position
Phase: Analyze (Phase 2)
Before this skill: linguistic-scope (typology outliers flag morphological complexity), linguistic-tokenize (fertility audit may trigger morph analysis)
After this skill: linguistic-syntax (UD annotation needs morph-aware tokenization), linguistic-eval (morphology-aware augmentation improves downstream metrics)
When It Activates
- Target language is morphologically complex (agglutinative, polysynthetic, templatic, fusional with rich case)
- Tokenizer fertility audit shows morpheme-tokenization gap
- Building gold morphological annotations
- Augmenting small training data via paradigm completion
- Choosing between UniMorph paradigms vs ML segmenter vs FST analyzer
When NOT to use: Latin/Cyrillic fusional language with low morphology where standard fine-tune is enough. Pure tokenizer audit → linguistic-tokenize.
Morphology Tier Classification
| Tier | Examples | Implications |
|---|---|---|
| lo (low) | English, Mandarin, Vietnamese, Indonesian | Standard tokenizer fine |
| mid (moderate) | Spanish, French, Russian, Modern Greek | Some inflection; BPE usually OK |
| hi (high) | Turkish, Finnish, Hungarian, Korean, Tamil, Arabic (templatic), Hebrew | Morphology-aware tokenization helps; UniMorph if available |
| extreme (polysynthetic) | Inuktitut, Navajo, Mohawk, West Greenlandic, Cherokee | Morpheme segmentation mandatory |
What It Does
Approach Selection
| Tier | UniMorph Available | Recommended Approach |
|---|---|---|
| lo / mid | any | Skip morph skill; BPE handles it |
| hi | YES (good coverage) | UniMorph paradigms + paradigm-completion augmentation |
| hi | NO | SIGMORPHON 2022/2023 segmenter + tokenizer audit |
| hi (templatic) | any | Root-pattern handler — special tooling needed |
| extreme | YES | UniMorph + FST (HFST/foma) + segmenter for OOV |
| extreme | NO | SIGMORPHON segmenter + community-collected paradigm samples |
Segmenter Selection
| Family | Recommended Segmenter |
|---|---|
| Agglutinative (Turkic, Uralic, Bantu, Korean) | SIGMORPHON 2023 winners; Morfessor as fallback |
| Polysynthetic (Eskimo, Athabaskan, Iroquoian) | UniMorph + FST if available; else SIGMORPHON 2023 |
| Templatic (Semitic) | Custom root-pattern — do NOT use concatenative segmenters |
| Fusional with case (Slavic, Greek, Sanskrit) | Morfessor or stanza morphology |
Never use BPE-as-segmenter for morphological analysis — BPE is a compression algorithm; it does not respect morpheme boundaries.
Paradigm Completion Augmentation
For Class 1–2 agglutinative targets: pull UniMorph paradigms, generate 10–30 inflected forms per lemma in training corpus, add as augmented training data. Typical gain: 1–3 BLEU on MT, 2–5% on POS/NER.
FST Analyzers
For languages with documented FSTs: HFST (~50 languages, open-source) or foma. Check Apertium repos for many under-resourced languages. When no FST exists: SIGMORPHON segmenter is a faster path than building an FST from scratch.
Inputs & Outputs
| Input | Description |
|---|---|
| Target language + typology from scope | For tier classification |
| Fertility verdict from tokenize | May trigger morphology deep-dive |
| Output | Description |
|---|---|
| Morphology tier | lo / mid / hi / extreme |
| UniMorph coverage | Available paradigms / total estimate |
| Segmenter recommendation | SIGMORPHON / Morfessor / FST |
| Augmentation strategy | Paradigm completion × multiplier |
workspace_state.md entry | Morphology plan |
Example Usage
Language: Turkish (tur), Joshi Class 4, agglutinative-hi tier
Morphology Plan: Turkish
- Tier: hi (agglutinative)
- UniMorph coverage: good (Turkish well-covered)
- Approach: UniMorph paradigms + paradigm-completion augmentation
- Segmenter: SIGMORPHON 2023 (best agglutinative)
- FST: HFST Turkish available (Apertium)
- Augmentation: paradigm completion 15×, add <morph_aug> tag
- Expected gain: +2 BLEU MT, +3% NER F1Related Skills
linguistic-scope— typology outliers flag polysynthesis, agglutination, templatic morphologylinguistic-tokenize— fertility audit triggers morphology analysis for hi/extreme tierslinguistic-syntax— UD annotation benefits from morph-aware tokenizationlinguistic-annotate— gold morphological annotation methodology
Last updated on
linguistic-transfer
Plan cross-lingual adaptation of pretrained LLMs — LoRA/QLoRA/DoRA config (rank scales with typological distance), MAD-X adapter stacks, source-language selection via URIEL, catastrophic-forgetting mitigation, tool selection.
linguistic-syntax
Universal Dependencies treebank usage, cross-lingual parser transfer (UDify/Trankit/stanza), and agreement-probe construction for grammatical-correctness evaluation of low-resource LLMs.