Pipeline Architecture

The Linguistic Agent Skills suite is organized around a 5-phase pipeline that reflects how experienced computational linguists approach low-resource language work.

The Five Phases

Scope → Acquire → Analyze → Evaluate → Release
  ↑        ↑         ↑          ↑          ↑
  └────────┴─────────┴──────────┴──────────┘
               (refinement loops)

Phases are not strictly sequential — they overlap and loop back. The orchestrator provides the skeleton; specialists own the content.

Phase 0 — Scope

Purpose: Identify the target language precisely and set strategic direction before touching any data.

Exit criterion: workspace_state.md has ISO 639-3 code, Joshi class (0–5), typology vector, script policy, and ethics seed.

Skills: linguistic-scope, linguistic-scripts, linguistic-tokenize (initial fertility estimate), linguistic-ethics (early gate)

Key decisions made:

Language identity (ISO 639-3 + Glottolog) — prevents macrolanguage mistakes
Resource class (Joshi 0–5) — determines every downstream strategy
Typological outliers — flags polysynthesis, tone, agglutination before data decisions
Best transfer source — via URIEL distance, not by intuition
Script/normalization policy — protects diacritics for tone languages
Ethics depth — community engagement requirements from vitality status

Phase 1 — Acquire

Purpose: Gather monolingual and parallel data ethically and reproducibly.

Exit criterion: Reproducible data manifest (sources, licenses, sizes, dedup stats) + tokenizer plan + adapter strategy.

Skills: linguistic-corpus, linguistic-bitext, linguistic-transfer, linguistic-tokenize, linguistic-scripts (normalization), linguistic-ethics (per-dataset gate)

Key decisions made:

Corpus sources (with register balance and contamination audit)
Bitext embedding model (LASER3 vs SONAR by language family)
Alignment threshold (1.03–1.06 by resource class)
Synthetic bitext strategy (back-translation, pivot, dictionary substitution)
Vocab extension method (FOCUS/OFA/HyperOfa/full retrain)
LoRA rank and adapter configuration

Phase 2 — Analyze

Purpose: Run linguistic analysis layers needed for the specific project.

Exit criterion: Required analysis artifacts produced. Not all skills run for every project — orchestrator routes based on need.

Skills: linguistic-morph, linguistic-syntax, linguistic-annotate, linguistic-semantics, linguistic-discourse, linguistic-speech

Key decisions made (as needed):

Morphology tier (lo/mid/hi/extreme) and segmenter selection
UD treebank strategy (fine-tune vs cross-lingual transfer)
Agreement-probe construction for grammatical eval
OMW coverage gaps and MWE catalog needs
Discourse framework (RST/PDTB/GUM) for long-context eval
Audio pipeline (ELAN/FLEx → Lhotse) for spoken data

Phase 3 — Evaluate

Purpose: Honestly measure performance with metrics fit for the target language.

Exit criterion: Eval report with benchmark selection, metric selection, contamination flags, and per-stratum breakdown.

Skills: linguistic-eval

Key decisions made:

Benchmark (FLORES+, NTREX-128, Belebele, AfroBench, IndicXTREME, SEACrowd)
Metrics (chrF++/COMET/GEMBA-MQM — never BLEU as primary for morphologically-rich)
Contamination handling (FLORES in pretrain mix → lower bound only)
Per-dialect, per-register, per-direction stratification

Phase 4 — Release

Purpose: Final ethics gate, attribution completeness, and model card.

Exit criterion: Release mode decision (Open/Community-gated/Restricted) with complete model card.

Skills: linguistic-ethics (final gate)

Key decisions made:

License compatibility of combined training data
Attribution registry completeness
Community sign-off requirements
Release mode (Open/Community-gated/Restricted)
Model card completeness

Workspace State

All phase outputs flow through workspace_state.md — the shared memory between specialist skills. The orchestrator reads this file on every invocation to resume seamlessly across sessions.

Structure:

## Targets
- Language: Yoruba (yor) | Glottolog: yoru1245
- Resource class (Joshi 0-5): 2
- Vitality (EGIDS): 2

## Script Policy
- Normalization: NFC
- Diacritics: PRESERVE

## Tokenizer Plan
- Fertility: 2.43×
- Method: OFA vocab extension

## Transfer Plan
- Source: Igbo (URIEL=0.18)
- LoRA rank: 16, alpha: 32

## Ethics Status
- Seed: COMPLETE (2026-05-22)
- Datasets cleared: Bible-NLP, MADLAD-400, Wikipedia

## Open Questions
- Q1: Register mix target (web% vs news%)

Optional Phase 4 Skills

Three Mindset stubs activate when specific scenarios apply:

linguistic-codeswitch — when the target community uses code-switching extensively
linguistic-historical — when bootstrapping a Class 0–1 language via cognate sets from a related Class 3+ language
linguistic-lexicon — when building a domain lexicon for RAG or MT post-edit

Was this page helpful?

On this page