Pipeline Workflow Guide

This guide walks through a complete linguistic pipeline run from target language selection to model release. The example uses Khmer (khm), a Joshi Class 2 language with an abugida script, to illustrate decisions at each phase.

Before You Start

Ensure the suite is installed and workspace_state.md does not exist in your project directory (or delete it to start fresh). Then open a Claude Code session.

Phase 0 — Scope

Goal: Identify the language precisely and set strategic direction.

Step 1: Enter the pipeline

help me build an LLM for Khmer

The orchestrator creates workspace_state.md and routes to linguistic-scope.

Step 2: Language disambiguation

Khmer (ISO: khm) is unambiguous — no macrolanguage disambiguation needed. Scope resolves:

ISO 639-3: khm
Glottolog: khmr1253
Family: Austroasiatic > Khmer
Script: Khmer abugida (U+1780–U+17FF)
Resource class: Joshi 2 ("Hopefuls")
Vitality: EGIDS 1 (National language of Cambodia — standard FPIC)

Step 3: Typological profile

Key outliers for Khmer:

Abugida script — character_coverage must be 0.99999+ in SentencePiece
Analytic morphology — fertility tier "lo-mid"; standard BPE adequate
SOV word order — agreement probe needed for word-order violations
No grammatical gender or tonal contrasts

Transfer source: Vietnamese (vie, URIEL=0.31) — same Austroasiatic family, Class 3+ data available.

Step 4: Script policy

linguistic-scripts sets:

Normalization: NFC (NEVER NFKC for Khmer — destroys conjuncts)
Diacritics: PRESERVE (not tonal, but vowel-diacritics are mandatory)
No romanization needed for training pipeline

Step 5: Ethics seed

EGIDS 1 (National language) — standard FPIC. No sacred-text flags. Community engagement at standard depth.

Phase 0 complete. workspace_state.md has ISO code, Joshi class, typology, script policy, ethics seed.

Phase 1 — Acquire

Goal: Gather monolingual and parallel data ethically and reproducibly.

Corpus identification

linguistic-corpus catalogs:

Source	Size	License	Register	Notes
MADLAD-400 khm	800MB	CC-BY-4.0	web 75%, wiki 20%, news 5%	Good quality
Wikipedia (khm)	40MB	CC-BY-SA-3.0	encyclopedic 100%	SA propagation note
Bible-NLP (khm)	3MB	CC-BY-4.0	liturgical 100%	Flag: archaic register

Ethics check: Bible-NLP passes (CC-BY; EGIDS 1; no community restrictions). Wikipedia SA propagation noted.

Post-dedup (MinHash threshold=0.9, shingle=3 for abugida): 180M tokens, 15% dedup rate, register: web 72% / wiki 22% / liturgical 6% — within acceptable bounds.

Bitext mining

linguistic-bitext mines English-Khmer parallel data:

Embedding: LASER3 (adequate for Austroasiatic)
Aligner: Vecalign
Margin threshold: 1.04 (Class 2 target) + 50-pair spot-check
Result: ~62K real pairs (OPUS + FLORES training split)
Synthetic: back-translation T=0.8 → 80K additional pairs

Tokenizer audit

linguistic-tokenize on khm with tiktoken-cl100k_base:

Fertility: 4.1× — EXTEND MANDATORY (abugida; ideographic-like density)
Method: OFA vocab extension (parallel data available)
SentencePiece: character_coverage=0.99999, byte_fallback=true, vocab_size=48K

Transfer plan

linguistic-transfer:

URIEL distance to Vietnamese (best source): 0.31 → LoRA r=16, alpha=32, all-linear modules
Forgetting mitigation: 15% Vietnamese in training mix
Tool: Unsloth (single-GPU QLoRA)
Base: mBART-large-50 (good Khmer seed)

Phase 1 complete. Data manifest, tokenizer plan, transfer plan all in workspace_state.md.

Phase 2 — Analyze

Goal: Run linguistic analysis layers needed for the project.

For Khmer (analytic morphology, abugida, Class 2):

morph: Tier "lo-mid" — no morpheme segmentation needed; BPE handles it. Skip deep morph analysis.
syntax: No Khmer UD treebank; cross-lingual transfer from Vietnamese (URIEL=0.31). Trankit. Agreement probes: word-order (100 pairs), numeral-classifier (80 pairs).
semantics: OMW coverage ~4K synsets. MWE catalog needed for idiom-heavy text. COMET-22 has Khmer coverage — use for MT eval.

Phase 2 complete (targeted — not all analysis skills needed for every project).

Phase 3 — Evaluate

Goal: Honest quality measurement.

linguistic-eval selects:

Benchmark: Belebele (122 languages, includes Khmer) for reading comprehension; FLORES+ MT (flag contamination risk)
MT metric: chrF++ + COMET-22 (Khmer coverage confirmed). BLEU supplementary only.
Probes: word-order (100 pairs), numeral-classifier (80 pairs)
Stratification: per-register (Bible vs web vs news — 3 slices)
Contamination: FLORES contamination confirmed → report as lower bound; use Belebele as primary

Phase 3 complete. Eval report in workspace_state.md.

Phase 4 — Release

Goal: Final ethics gate and model card.

linguistic-ethics final check:

All sources CC-BY-4.0 or CC-BY-SA-3.0 (SA propagation noted in model card)
Attribution registry complete
Community sign-off: standard FPIC (national language, no restricted-access data)
Release mode: OPEN

Model card sections written: datasets, licenses, lineage, ethics statement, intended uses, limitations, contact.

Summary

Phase	Duration (estimated)	Key Outputs
Scope	15–30 min	Language profile, script policy, ethics seed
Acquire	1–3 days	260M token corpus, 142K bitext pairs, tokenizer plan
Analyze	2–4 hours	UD cross-lingual plan, 180 agreement probes
Evaluate	1–2 hours	Eval report, contamination flags
Release	30–60 min	Model card, release decision

The pipeline produces fully documented, reproducible artifacts at each phase. workspace_state.md carries the state forward across sessions.

Was this page helpful?

On this page