Pipeline Architecture
The Linguistic Agent Skills suite is organized around a 5-phase pipeline that reflects how experienced computational linguists approach low-resource language work.
The Five Phases
Scope → Acquire → Analyze → Evaluate → Release
↑ ↑ ↑ ↑ ↑
└────────┴─────────┴──────────┴──────────┘
(refinement loops)Phases are not strictly sequential — they overlap and loop back. The orchestrator provides the skeleton; specialists own the content.
Phase 0 — Scope
Purpose: Identify the target language precisely and set strategic direction before touching any data.
Exit criterion: workspace_state.md has ISO 639-3 code, Joshi class (0–5), typology vector, script policy, and ethics seed.
Skills: linguistic-scope, linguistic-scripts, linguistic-tokenize (initial fertility estimate), linguistic-ethics (early gate)
Key decisions made:
- Language identity (ISO 639-3 + Glottolog) — prevents macrolanguage mistakes
- Resource class (Joshi 0–5) — determines every downstream strategy
- Typological outliers — flags polysynthesis, tone, agglutination before data decisions
- Best transfer source — via URIEL distance, not by intuition
- Script/normalization policy — protects diacritics for tone languages
- Ethics depth — community engagement requirements from vitality status
Phase 1 — Acquire
Purpose: Gather monolingual and parallel data ethically and reproducibly.
Exit criterion: Reproducible data manifest (sources, licenses, sizes, dedup stats) + tokenizer plan + adapter strategy.
Skills: linguistic-corpus, linguistic-bitext, linguistic-transfer, linguistic-tokenize, linguistic-scripts (normalization), linguistic-ethics (per-dataset gate)
Key decisions made:
- Corpus sources (with register balance and contamination audit)
- Bitext embedding model (LASER3 vs SONAR by language family)
- Alignment threshold (1.03–1.06 by resource class)
- Synthetic bitext strategy (back-translation, pivot, dictionary substitution)
- Vocab extension method (FOCUS/OFA/HyperOfa/full retrain)
- LoRA rank and adapter configuration
Phase 2 — Analyze
Purpose: Run linguistic analysis layers needed for the specific project.
Exit criterion: Required analysis artifacts produced. Not all skills run for every project — orchestrator routes based on need.
Skills: linguistic-morph, linguistic-syntax, linguistic-annotate, linguistic-semantics, linguistic-discourse, linguistic-speech
Key decisions made (as needed):
- Morphology tier (lo/mid/hi/extreme) and segmenter selection
- UD treebank strategy (fine-tune vs cross-lingual transfer)
- Agreement-probe construction for grammatical eval
- OMW coverage gaps and MWE catalog needs
- Discourse framework (RST/PDTB/GUM) for long-context eval
- Audio pipeline (ELAN/FLEx → Lhotse) for spoken data
Phase 3 — Evaluate
Purpose: Honestly measure performance with metrics fit for the target language.
Exit criterion: Eval report with benchmark selection, metric selection, contamination flags, and per-stratum breakdown.
Skills: linguistic-eval
Key decisions made:
- Benchmark (FLORES+, NTREX-128, Belebele, AfroBench, IndicXTREME, SEACrowd)
- Metrics (chrF++/COMET/GEMBA-MQM — never BLEU as primary for morphologically-rich)
- Contamination handling (FLORES in pretrain mix → lower bound only)
- Per-dialect, per-register, per-direction stratification
Phase 4 — Release
Purpose: Final ethics gate, attribution completeness, and model card.
Exit criterion: Release mode decision (Open/Community-gated/Restricted) with complete model card.
Skills: linguistic-ethics (final gate)
Key decisions made:
- License compatibility of combined training data
- Attribution registry completeness
- Community sign-off requirements
- Release mode (Open/Community-gated/Restricted)
- Model card completeness
Workspace State
All phase outputs flow through workspace_state.md — the shared memory between specialist skills. The orchestrator reads this file on every invocation to resume seamlessly across sessions.
Structure:
## Targets
- Language: Yoruba (yor) | Glottolog: yoru1245
- Resource class (Joshi 0-5): 2
- Vitality (EGIDS): 2
## Script Policy
- Normalization: NFC
- Diacritics: PRESERVE
## Tokenizer Plan
- Fertility: 2.43×
- Method: OFA vocab extension
## Transfer Plan
- Source: Igbo (URIEL=0.18)
- LoRA rank: 16, alpha: 32
## Ethics Status
- Seed: COMPLETE (2026-05-22)
- Datasets cleared: Bible-NLP, MADLAD-400, Wikipedia
## Open Questions
- Q1: Register mix target (web% vs news%)Optional Phase 4 Skills
Three Mindset stubs activate when specific scenarios apply:
linguistic-codeswitch— when the target community uses code-switching extensivelylinguistic-historical— when bootstrapping a Class 0–1 language via cognate sets from a related Class 3+ languagelinguistic-lexicon— when building a domain lexicon for RAG or MT post-edit
Last updated on
Concepts
Core concepts behind the Linguistic Agent Skills suite — pipeline architecture, typological profiling, Joshi classification, shared utilities, and quality gating.
Typological Profiling
How URIEL typological distance vectors are used to select transfer sources and predict ML model behavior for low-resource languages.