How It Works

The Linguistic Agent Skills suite is organized around a 5-phase pipeline that reflects how experienced computational linguists actually approach low-resource language work. The linguistic-orchestrator skill coordinates routing between specialist skills and tracks workspace state in workspace_state.md.

The 5-Phase Pipeline

Scope → Acquire → Analyze → Evaluate → Release
  ↑        ↑         ↑          ↑          ↑
  └────────┴─────────┴──────────┴──────────┘
               (refinement loops)

Phases overlap and loop back. The orchestrator provides the skeleton; specialists own the content.

Phase 0 — Scope

Goal: Identify the target language precisely and set the strategic direction before touching any data.

Step	Specialist	What It Does
Language disambiguation	`linguistic-scope`	ISO 639-3 + Glottolog resolution; macrolanguage disambiguation
Resource classification	`linguistic-scope`	Joshi 0–5 classification; data availability scan
Typological profiling	`linguistic-scope`	WALS/Grambank/URIEL features; transfer-source recommendation
Script policy	`linguistic-scripts`	Unicode block(s), NFC/NFKC decision, diacritic preservation
Ethics seed	`linguistic-ethics`	FPIC awareness, vitality-driven community engagement depth

Phase exit: workspace_state.md has ISO code, Joshi class, typology vector, and script policy.

Phase 1 — Acquire

Goal: Gather monolingual and parallel data ethically and reproducibly.

Step	Specialist	What It Does
Monolingual corpora	`linguistic-corpus`	OLDI/CulturaX/MADLAD-400/Glot500/Wikipedia catalog; LID; MinHash dedup
Parallel data	`linguistic-bitext`	LASER3/SONAR mining; Vecalign alignment; synthetic bitext
Tokenizer audit	`linguistic-tokenize`	Fertility ratio; vocab extension method (FOCUS/OFA/HyperOfa)
Adapter strategy	`linguistic-transfer`	LoRA rank by URIEL distance; MAD-X; catastrophic-forgetting plan
Per-dataset ethics	`linguistic-ethics`	License audit; attribution registry; sacred-text gating

Phase exit: Reproducible data manifest (sources, licenses, sizes, dedup stats) + tokenizer plan.

Phase 2 — Analyze

Goal: Run linguistic analysis layers needed for evaluation, augmentation, or downstream training.

Step	Specialist	What It Does
Morphology	`linguistic-morph`	UniMorph paradigms; SIGMORPHON segmenters; FST/HFST
Syntax	`linguistic-syntax`	UD treebank ingestion; cross-lingual parser transfer; agreement probes
Semantics	`linguistic-semantics`	WordNet/OMW; FrameNet; PropBank SRL; MWE/PARSEME
Discourse	`linguistic-discourse`	RST/PDTB/GUM; coreference; coherence-aware eval
Speech	`linguistic-speech`	ELAN/Praat/FLEx → Lhotse; G2P/IPA; MMS/Whisper ASR
Annotation	`linguistic-annotate`	IAA metric selection; guideline authoring; adjudication

Phase exit: Required analysis artifacts produced.

Phase 3 — Evaluate

Goal: Honestly measure performance with metrics fit for the language.

The linguistic-eval skill is A-tier because eval results drive release decisions. It enforces:

chrF++/COMET/GEMBA-MQM over BLEU for morphologically-rich languages
Per-dialect and per-register breakdowns — aggregate scores hide systematic failures
Contamination-aware reporting — FLORES-200 is in many pretrain mixes; report it as a lower bound
BLiMP-style grammatical-knowledge probes per language

Phase 4 — Release

linguistic-ethics serves as the release gate — final license compatibility check, attribution registry completeness, community sign-off, and model card authoring. Release modes: Open, Community-gated, or Restricted.

Workspace State

Every session writes structured state to workspace_state.md in the current working directory. This file is the shared memory between specialist skills — scope writes language identity, scripts writes normalization policy, corpus writes the data manifest, and so on. The orchestrator reads it on every invocation to resume seamlessly.

Natural Language vs Slash Commands

Both trigger identical behavior:

"help me build an LLM for Yoruba" → orchestrator routes to scope → ethics → corpus → ...
/linguistic:lifecycle → same entry point

Slash commands are explicit shortcuts. Natural language works equally well for every operation.

Was this page helpful?

On this page