linguistic-morph

Morphological analysis for the target language: UniMorph paradigm lookup, SIGMORPHON segmenters, FST/HFST analyzer recommendations, and morphology-aware data augmentation via paradigm completion.

Overview

Tokenizer fertility tells you BPE produces too many tokens. Morphology tells you why — and the mechanism determines the fix. Concatenative agglutination (Turkish) needs a different approach than templatic morphology (Arabic root-and-pattern) or polysynthesis (Inuktitut word-sentences). linguistic-morph classifies the morphology tier and selects the right tooling before expensive training runs.

Pipeline Position

Phase: Analyze (Phase 2)

Before this skill: linguistic-scope (typology outliers flag morphological complexity), linguistic-tokenize (fertility audit may trigger morph analysis)

After this skill: linguistic-syntax (UD annotation needs morph-aware tokenization), linguistic-eval (morphology-aware augmentation improves downstream metrics)

When It Activates

Target language is morphologically complex (agglutinative, polysynthetic, templatic, fusional with rich case)
Tokenizer fertility audit shows morpheme-tokenization gap
Building gold morphological annotations
Augmenting small training data via paradigm completion
Choosing between UniMorph paradigms vs ML segmenter vs FST analyzer

When NOT to use: Latin/Cyrillic fusional language with low morphology where standard fine-tune is enough. Pure tokenizer audit → linguistic-tokenize.

Morphology Tier Classification

Tier	Examples	Implications
lo (low)	English, Mandarin, Vietnamese, Indonesian	Standard tokenizer fine
mid (moderate)	Spanish, French, Russian, Modern Greek	Some inflection; BPE usually OK
hi (high)	Turkish, Finnish, Hungarian, Korean, Tamil, Arabic (templatic), Hebrew	Morphology-aware tokenization helps; UniMorph if available
extreme (polysynthetic)	Inuktitut, Navajo, Mohawk, West Greenlandic, Cherokee	Morpheme segmentation mandatory

What It Does

Approach Selection

Tier	UniMorph Available	Recommended Approach
lo / mid	any	Skip morph skill; BPE handles it
hi	YES (good coverage)	UniMorph paradigms + paradigm-completion augmentation
hi	NO	SIGMORPHON 2022/2023 segmenter + tokenizer audit
hi (templatic)	any	Root-pattern handler — special tooling needed
extreme	YES	UniMorph + FST (HFST/foma) + segmenter for OOV
extreme	NO	SIGMORPHON segmenter + community-collected paradigm samples

Segmenter Selection

Family	Recommended Segmenter
Agglutinative (Turkic, Uralic, Bantu, Korean)	SIGMORPHON 2023 winners; Morfessor as fallback
Polysynthetic (Eskimo, Athabaskan, Iroquoian)	UniMorph + FST if available; else SIGMORPHON 2023
Templatic (Semitic)	Custom root-pattern — do NOT use concatenative segmenters
Fusional with case (Slavic, Greek, Sanskrit)	Morfessor or stanza morphology

Never use BPE-as-segmenter for morphological analysis — BPE is a compression algorithm; it does not respect morpheme boundaries.

Input	Description
Target language + typology from scope	For tier classification
Fertility verdict from tokenize	May trigger morphology deep-dive

Output	Description
Morphology tier	lo / mid / hi / extreme
UniMorph coverage	Available paradigms / total estimate
Segmenter recommendation	SIGMORPHON / Morfessor / FST
Augmentation strategy	Paradigm completion × multiplier
`workspace_state.md` entry	Morphology plan

Example Usage

Language: Turkish (tur), Joshi Class 4, agglutinative-hi tier

Morphology Plan: Turkish
- Tier: hi (agglutinative)
- UniMorph coverage: good (Turkish well-covered)
- Approach: UniMorph paradigms + paradigm-completion augmentation
- Segmenter: SIGMORPHON 2023 (best agglutinative)
- FST: HFST Turkish available (Apertium)
- Augmentation: paradigm completion 15×, add <morph_aug> tag
- Expected gain: +2 BLEU MT, +3% NER F1

linguistic-scope — typology outliers flag polysynthesis, agglutination, templatic morphology
linguistic-tokenize — fertility audit triggers morphology analysis for hi/extreme tiers
linguistic-syntax — UD annotation benefits from morph-aware tokenization
linguistic-annotate — gold morphological annotation methodology

Was this page helpful?

linguistic-morph

Overview

Pipeline Position

When It Activates

Morphology Tier Classification

What It Does

Approach Selection

Segmenter Selection

Paradigm Completion Augmentation

FST Analyzers

Inputs & Outputs

Example Usage

On this page