MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Guides

Pipeline Workflow Guide

This guide walks through a complete linguistic pipeline run from target language selection to model release. The example uses Khmer (khm), a Joshi Class 2 language with an abugida script, to illustrate decisions at each phase.

Before You Start

Ensure the suite is installed and workspace_state.md does not exist in your project directory (or delete it to start fresh). Then open a Claude Code session.

Phase 0 — Scope

Goal: Identify the language precisely and set strategic direction.

Step 1: Enter the pipeline

help me build an LLM for Khmer

The orchestrator creates workspace_state.md and routes to linguistic-scope.

Step 2: Language disambiguation

Khmer (ISO: khm) is unambiguous — no macrolanguage disambiguation needed. Scope resolves:

  • ISO 639-3: khm
  • Glottolog: khmr1253
  • Family: Austroasiatic > Khmer
  • Script: Khmer abugida (U+1780–U+17FF)
  • Resource class: Joshi 2 ("Hopefuls")
  • Vitality: EGIDS 1 (National language of Cambodia — standard FPIC)

Step 3: Typological profile

Key outliers for Khmer:

  • Abugida script — character_coverage must be 0.99999+ in SentencePiece
  • Analytic morphology — fertility tier "lo-mid"; standard BPE adequate
  • SOV word order — agreement probe needed for word-order violations
  • No grammatical gender or tonal contrasts

Transfer source: Vietnamese (vie, URIEL=0.31) — same Austroasiatic family, Class 3+ data available.

Step 4: Script policy

linguistic-scripts sets:

  • Normalization: NFC (NEVER NFKC for Khmer — destroys conjuncts)
  • Diacritics: PRESERVE (not tonal, but vowel-diacritics are mandatory)
  • No romanization needed for training pipeline

Step 5: Ethics seed

EGIDS 1 (National language) — standard FPIC. No sacred-text flags. Community engagement at standard depth.

Phase 0 complete. workspace_state.md has ISO code, Joshi class, typology, script policy, ethics seed.

Phase 1 — Acquire

Goal: Gather monolingual and parallel data ethically and reproducibly.

Corpus identification

linguistic-corpus catalogs:

SourceSizeLicenseRegisterNotes
MADLAD-400 khm800MBCC-BY-4.0web 75%, wiki 20%, news 5%Good quality
Wikipedia (khm)40MBCC-BY-SA-3.0encyclopedic 100%SA propagation note
Bible-NLP (khm)3MBCC-BY-4.0liturgical 100%Flag: archaic register

Ethics check: Bible-NLP passes (CC-BY; EGIDS 1; no community restrictions). Wikipedia SA propagation noted.

Post-dedup (MinHash threshold=0.9, shingle=3 for abugida): 180M tokens, 15% dedup rate, register: web 72% / wiki 22% / liturgical 6% — within acceptable bounds.

Bitext mining

linguistic-bitext mines English-Khmer parallel data:

  • Embedding: LASER3 (adequate for Austroasiatic)
  • Aligner: Vecalign
  • Margin threshold: 1.04 (Class 2 target) + 50-pair spot-check
  • Result: ~62K real pairs (OPUS + FLORES training split)
  • Synthetic: back-translation T=0.8 → 80K additional pairs

Tokenizer audit

linguistic-tokenize on khm with tiktoken-cl100k_base:

  • Fertility: 4.1× — EXTEND MANDATORY (abugida; ideographic-like density)
  • Method: OFA vocab extension (parallel data available)
  • SentencePiece: character_coverage=0.99999, byte_fallback=true, vocab_size=48K

Transfer plan

linguistic-transfer:

  • URIEL distance to Vietnamese (best source): 0.31 → LoRA r=16, alpha=32, all-linear modules
  • Forgetting mitigation: 15% Vietnamese in training mix
  • Tool: Unsloth (single-GPU QLoRA)
  • Base: mBART-large-50 (good Khmer seed)

Phase 1 complete. Data manifest, tokenizer plan, transfer plan all in workspace_state.md.

Phase 2 — Analyze

Goal: Run linguistic analysis layers needed for the project.

For Khmer (analytic morphology, abugida, Class 2):

  • morph: Tier "lo-mid" — no morpheme segmentation needed; BPE handles it. Skip deep morph analysis.
  • syntax: No Khmer UD treebank; cross-lingual transfer from Vietnamese (URIEL=0.31). Trankit. Agreement probes: word-order (100 pairs), numeral-classifier (80 pairs).
  • semantics: OMW coverage ~4K synsets. MWE catalog needed for idiom-heavy text. COMET-22 has Khmer coverage — use for MT eval.

Phase 2 complete (targeted — not all analysis skills needed for every project).

Phase 3 — Evaluate

Goal: Honest quality measurement.

linguistic-eval selects:

  • Benchmark: Belebele (122 languages, includes Khmer) for reading comprehension; FLORES+ MT (flag contamination risk)
  • MT metric: chrF++ + COMET-22 (Khmer coverage confirmed). BLEU supplementary only.
  • Probes: word-order (100 pairs), numeral-classifier (80 pairs)
  • Stratification: per-register (Bible vs web vs news — 3 slices)
  • Contamination: FLORES contamination confirmed → report as lower bound; use Belebele as primary

Phase 3 complete. Eval report in workspace_state.md.

Phase 4 — Release

Goal: Final ethics gate and model card.

linguistic-ethics final check:

  • All sources CC-BY-4.0 or CC-BY-SA-3.0 (SA propagation noted in model card)
  • Attribution registry complete
  • Community sign-off: standard FPIC (national language, no restricted-access data)
  • Release mode: OPEN

Model card sections written: datasets, licenses, lineage, ethics statement, intended uses, limitations, contact.

Summary

PhaseDuration (estimated)Key Outputs
Scope15–30 minLanguage profile, script policy, ethics seed
Acquire1–3 days260M token corpus, 142K bitext pairs, tokenizer plan
Analyze2–4 hoursUD cross-lingual plan, 180 agreement probes
Evaluate1–2 hoursEval report, contamination flags
Release30–60 minModel card, release decision

The pipeline produces fully documented, reproducible artifacts at each phase. workspace_state.md carries the state forward across sessions.

Was this page helpful?
Edit on GitHub

Last updated on

On this page