MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-morph

Morphological analysis for the target language: UniMorph paradigm lookup, SIGMORPHON segmenters, FST/HFST analyzer recommendations, and morphology-aware data augmentation via paradigm completion.

Overview

Tokenizer fertility tells you BPE produces too many tokens. Morphology tells you why — and the mechanism determines the fix. Concatenative agglutination (Turkish) needs a different approach than templatic morphology (Arabic root-and-pattern) or polysynthesis (Inuktitut word-sentences). linguistic-morph classifies the morphology tier and selects the right tooling before expensive training runs.

Pipeline Position

Phase: Analyze (Phase 2)

Before this skill: linguistic-scope (typology outliers flag morphological complexity), linguistic-tokenize (fertility audit may trigger morph analysis)

After this skill: linguistic-syntax (UD annotation needs morph-aware tokenization), linguistic-eval (morphology-aware augmentation improves downstream metrics)

When It Activates

  • Target language is morphologically complex (agglutinative, polysynthetic, templatic, fusional with rich case)
  • Tokenizer fertility audit shows morpheme-tokenization gap
  • Building gold morphological annotations
  • Augmenting small training data via paradigm completion
  • Choosing between UniMorph paradigms vs ML segmenter vs FST analyzer

When NOT to use: Latin/Cyrillic fusional language with low morphology where standard fine-tune is enough. Pure tokenizer audit → linguistic-tokenize.

Morphology Tier Classification

TierExamplesImplications
lo (low)English, Mandarin, Vietnamese, IndonesianStandard tokenizer fine
mid (moderate)Spanish, French, Russian, Modern GreekSome inflection; BPE usually OK
hi (high)Turkish, Finnish, Hungarian, Korean, Tamil, Arabic (templatic), HebrewMorphology-aware tokenization helps; UniMorph if available
extreme (polysynthetic)Inuktitut, Navajo, Mohawk, West Greenlandic, CherokeeMorpheme segmentation mandatory

What It Does

Approach Selection

TierUniMorph AvailableRecommended Approach
lo / midanySkip morph skill; BPE handles it
hiYES (good coverage)UniMorph paradigms + paradigm-completion augmentation
hiNOSIGMORPHON 2022/2023 segmenter + tokenizer audit
hi (templatic)anyRoot-pattern handler — special tooling needed
extremeYESUniMorph + FST (HFST/foma) + segmenter for OOV
extremeNOSIGMORPHON segmenter + community-collected paradigm samples

Segmenter Selection

FamilyRecommended Segmenter
Agglutinative (Turkic, Uralic, Bantu, Korean)SIGMORPHON 2023 winners; Morfessor as fallback
Polysynthetic (Eskimo, Athabaskan, Iroquoian)UniMorph + FST if available; else SIGMORPHON 2023
Templatic (Semitic)Custom root-pattern — do NOT use concatenative segmenters
Fusional with case (Slavic, Greek, Sanskrit)Morfessor or stanza morphology

Never use BPE-as-segmenter for morphological analysis — BPE is a compression algorithm; it does not respect morpheme boundaries.

Paradigm Completion Augmentation

For Class 1–2 agglutinative targets: pull UniMorph paradigms, generate 10–30 inflected forms per lemma in training corpus, add as augmented training data. Typical gain: 1–3 BLEU on MT, 2–5% on POS/NER.

FST Analyzers

For languages with documented FSTs: HFST (~50 languages, open-source) or foma. Check Apertium repos for many under-resourced languages. When no FST exists: SIGMORPHON segmenter is a faster path than building an FST from scratch.

Inputs & Outputs

InputDescription
Target language + typology from scopeFor tier classification
Fertility verdict from tokenizeMay trigger morphology deep-dive
OutputDescription
Morphology tierlo / mid / hi / extreme
UniMorph coverageAvailable paradigms / total estimate
Segmenter recommendationSIGMORPHON / Morfessor / FST
Augmentation strategyParadigm completion × multiplier
workspace_state.md entryMorphology plan

Example Usage

Language: Turkish (tur), Joshi Class 4, agglutinative-hi tier

Morphology Plan: Turkish
- Tier: hi (agglutinative)
- UniMorph coverage: good (Turkish well-covered)
- Approach: UniMorph paradigms + paradigm-completion augmentation
- Segmenter: SIGMORPHON 2023 (best agglutinative)
- FST: HFST Turkish available (Apertium)
- Augmentation: paradigm completion 15×, add <morph_aug> tag
- Expected gain: +2 BLEU MT, +3% NER F1
Was this page helpful?
Edit on GitHub

Last updated on

On this page