Pipeline Workflow Guide
This guide walks through a complete linguistic pipeline run from target language selection to model release. The example uses Khmer (khm), a Joshi Class 2 language with an abugida script, to illustrate decisions at each phase.
Before You Start
Ensure the suite is installed and workspace_state.md does not exist in your project directory (or delete it to start fresh). Then open a Claude Code session.
Phase 0 — Scope
Goal: Identify the language precisely and set strategic direction.
Step 1: Enter the pipeline
help me build an LLM for KhmerThe orchestrator creates workspace_state.md and routes to linguistic-scope.
Step 2: Language disambiguation
Khmer (ISO: khm) is unambiguous — no macrolanguage disambiguation needed. Scope resolves:
- ISO 639-3: khm
- Glottolog: khmr1253
- Family: Austroasiatic > Khmer
- Script: Khmer abugida (U+1780–U+17FF)
- Resource class: Joshi 2 ("Hopefuls")
- Vitality: EGIDS 1 (National language of Cambodia — standard FPIC)
Step 3: Typological profile
Key outliers for Khmer:
- Abugida script — character_coverage must be 0.99999+ in SentencePiece
- Analytic morphology — fertility tier "lo-mid"; standard BPE adequate
- SOV word order — agreement probe needed for word-order violations
- No grammatical gender or tonal contrasts
Transfer source: Vietnamese (vie, URIEL=0.31) — same Austroasiatic family, Class 3+ data available.
Step 4: Script policy
linguistic-scripts sets:
- Normalization: NFC (NEVER NFKC for Khmer — destroys conjuncts)
- Diacritics: PRESERVE (not tonal, but vowel-diacritics are mandatory)
- No romanization needed for training pipeline
Step 5: Ethics seed
EGIDS 1 (National language) — standard FPIC. No sacred-text flags. Community engagement at standard depth.
Phase 0 complete. workspace_state.md has ISO code, Joshi class, typology, script policy, ethics seed.
Phase 1 — Acquire
Goal: Gather monolingual and parallel data ethically and reproducibly.
Corpus identification
linguistic-corpus catalogs:
| Source | Size | License | Register | Notes |
|---|---|---|---|---|
| MADLAD-400 khm | 800MB | CC-BY-4.0 | web 75%, wiki 20%, news 5% | Good quality |
| Wikipedia (khm) | 40MB | CC-BY-SA-3.0 | encyclopedic 100% | SA propagation note |
| Bible-NLP (khm) | 3MB | CC-BY-4.0 | liturgical 100% | Flag: archaic register |
Ethics check: Bible-NLP passes (CC-BY; EGIDS 1; no community restrictions). Wikipedia SA propagation noted.
Post-dedup (MinHash threshold=0.9, shingle=3 for abugida): 180M tokens, 15% dedup rate, register: web 72% / wiki 22% / liturgical 6% — within acceptable bounds.
Bitext mining
linguistic-bitext mines English-Khmer parallel data:
- Embedding: LASER3 (adequate for Austroasiatic)
- Aligner: Vecalign
- Margin threshold: 1.04 (Class 2 target) + 50-pair spot-check
- Result: ~62K real pairs (OPUS + FLORES training split)
- Synthetic: back-translation T=0.8 → 80K additional pairs
Tokenizer audit
linguistic-tokenize on khm with tiktoken-cl100k_base:
- Fertility: 4.1× — EXTEND MANDATORY (abugida; ideographic-like density)
- Method: OFA vocab extension (parallel data available)
- SentencePiece: character_coverage=0.99999, byte_fallback=true, vocab_size=48K
Transfer plan
linguistic-transfer:
- URIEL distance to Vietnamese (best source): 0.31 → LoRA r=16, alpha=32, all-linear modules
- Forgetting mitigation: 15% Vietnamese in training mix
- Tool: Unsloth (single-GPU QLoRA)
- Base: mBART-large-50 (good Khmer seed)
Phase 1 complete. Data manifest, tokenizer plan, transfer plan all in workspace_state.md.
Phase 2 — Analyze
Goal: Run linguistic analysis layers needed for the project.
For Khmer (analytic morphology, abugida, Class 2):
- morph: Tier "lo-mid" — no morpheme segmentation needed; BPE handles it. Skip deep morph analysis.
- syntax: No Khmer UD treebank; cross-lingual transfer from Vietnamese (URIEL=0.31). Trankit. Agreement probes: word-order (100 pairs), numeral-classifier (80 pairs).
- semantics: OMW coverage ~4K synsets. MWE catalog needed for idiom-heavy text. COMET-22 has Khmer coverage — use for MT eval.
Phase 2 complete (targeted — not all analysis skills needed for every project).
Phase 3 — Evaluate
Goal: Honest quality measurement.
linguistic-eval selects:
- Benchmark: Belebele (122 languages, includes Khmer) for reading comprehension; FLORES+ MT (flag contamination risk)
- MT metric: chrF++ + COMET-22 (Khmer coverage confirmed). BLEU supplementary only.
- Probes: word-order (100 pairs), numeral-classifier (80 pairs)
- Stratification: per-register (Bible vs web vs news — 3 slices)
- Contamination: FLORES contamination confirmed → report as lower bound; use Belebele as primary
Phase 3 complete. Eval report in workspace_state.md.
Phase 4 — Release
Goal: Final ethics gate and model card.
linguistic-ethics final check:
- All sources CC-BY-4.0 or CC-BY-SA-3.0 (SA propagation noted in model card)
- Attribution registry complete
- Community sign-off: standard FPIC (national language, no restricted-access data)
- Release mode: OPEN
Model card sections written: datasets, licenses, lineage, ethics statement, intended uses, limitations, contact.
Summary
| Phase | Duration (estimated) | Key Outputs |
|---|---|---|
| Scope | 15–30 min | Language profile, script policy, ethics seed |
| Acquire | 1–3 days | 260M token corpus, 142K bitext pairs, tokenizer plan |
| Analyze | 2–4 hours | UD cross-lingual plan, 180 agreement probes |
| Evaluate | 1–2 hours | Eval report, contamination flags |
| Release | 30–60 min | Model card, release decision |
The pipeline produces fully documented, reproducible artifacts at each phase. workspace_state.md carries the state forward across sessions.
Last updated on
Low-Resource Languages Guide
Joshi classification system, finding data for under-resourced languages, and ethical considerations for endangered-language ML work.
Ethics and FPIC Guide
FPIC and CARE principles in linguistic AI projects — how the ethics skill gates the pipeline, community engagement requirements, and sacred-text handling.