How It Works
The Linguistic Agent Skills suite is organized around a 5-phase pipeline that reflects how experienced computational linguists actually approach low-resource language work. The linguistic-orchestrator skill coordinates routing between specialist skills and tracks workspace state in workspace_state.md.
The 5-Phase Pipeline
Scope → Acquire → Analyze → Evaluate → Release
↑ ↑ ↑ ↑ ↑
└────────┴─────────┴──────────┴──────────┘
(refinement loops)Phases overlap and loop back. The orchestrator provides the skeleton; specialists own the content.
Phase 0 — Scope
Goal: Identify the target language precisely and set the strategic direction before touching any data.
| Step | Specialist | What It Does |
|---|---|---|
| Language disambiguation | linguistic-scope | ISO 639-3 + Glottolog resolution; macrolanguage disambiguation |
| Resource classification | linguistic-scope | Joshi 0–5 classification; data availability scan |
| Typological profiling | linguistic-scope | WALS/Grambank/URIEL features; transfer-source recommendation |
| Script policy | linguistic-scripts | Unicode block(s), NFC/NFKC decision, diacritic preservation |
| Ethics seed | linguistic-ethics | FPIC awareness, vitality-driven community engagement depth |
Phase exit: workspace_state.md has ISO code, Joshi class, typology vector, and script policy.
Phase 1 — Acquire
Goal: Gather monolingual and parallel data ethically and reproducibly.
| Step | Specialist | What It Does |
|---|---|---|
| Monolingual corpora | linguistic-corpus | OLDI/CulturaX/MADLAD-400/Glot500/Wikipedia catalog; LID; MinHash dedup |
| Parallel data | linguistic-bitext | LASER3/SONAR mining; Vecalign alignment; synthetic bitext |
| Tokenizer audit | linguistic-tokenize | Fertility ratio; vocab extension method (FOCUS/OFA/HyperOfa) |
| Adapter strategy | linguistic-transfer | LoRA rank by URIEL distance; MAD-X; catastrophic-forgetting plan |
| Per-dataset ethics | linguistic-ethics | License audit; attribution registry; sacred-text gating |
Phase exit: Reproducible data manifest (sources, licenses, sizes, dedup stats) + tokenizer plan.
Phase 2 — Analyze
Goal: Run linguistic analysis layers needed for evaluation, augmentation, or downstream training.
| Step | Specialist | What It Does |
|---|---|---|
| Morphology | linguistic-morph | UniMorph paradigms; SIGMORPHON segmenters; FST/HFST |
| Syntax | linguistic-syntax | UD treebank ingestion; cross-lingual parser transfer; agreement probes |
| Semantics | linguistic-semantics | WordNet/OMW; FrameNet; PropBank SRL; MWE/PARSEME |
| Discourse | linguistic-discourse | RST/PDTB/GUM; coreference; coherence-aware eval |
| Speech | linguistic-speech | ELAN/Praat/FLEx → Lhotse; G2P/IPA; MMS/Whisper ASR |
| Annotation | linguistic-annotate | IAA metric selection; guideline authoring; adjudication |
Phase exit: Required analysis artifacts produced.
Phase 3 — Evaluate
Goal: Honestly measure performance with metrics fit for the language.
The linguistic-eval skill is A-tier because eval results drive release decisions. It enforces:
- chrF++/COMET/GEMBA-MQM over BLEU for morphologically-rich languages
- Per-dialect and per-register breakdowns — aggregate scores hide systematic failures
- Contamination-aware reporting — FLORES-200 is in many pretrain mixes; report it as a lower bound
- BLiMP-style grammatical-knowledge probes per language
Phase 4 — Release
linguistic-ethics serves as the release gate — final license compatibility check, attribution registry completeness, community sign-off, and model card authoring. Release modes: Open, Community-gated, or Restricted.
Workspace State
Every session writes structured state to workspace_state.md in the current working directory. This file is the shared memory between specialist skills — scope writes language identity, scripts writes normalization policy, corpus writes the data manifest, and so on. The orchestrator reads it on every invocation to resume seamlessly.
Natural Language vs Slash Commands
Both trigger identical behavior:
- "help me build an LLM for Yoruba" → orchestrator routes to scope → ethics → corpus → ...
/linguistic:lifecycle→ same entry point
Slash commands are explicit shortcuts. Natural language works equally well for every operation.
Last updated on