Tutorial: Full Pipeline
Difficulty: Advanced
Time: 2–3 hours
Skills used: All phases — orchestrator, scope, scripts, ethics, corpus, bitext, tokenize, transfer, morph, syntax, eval
Language: Igbo (ibo) — Nigeria, Joshi Class 1, Niger-Congo
This tutorial runs all five pipeline phases for Igbo — a Class 1 language with limited data but active NLP research community. By the end you'll have a complete, documented pipeline ready for training.
Prerequisites
- Linguistic Agent Skills installed (see Installation)
- Familiarity with Explore a Language Profile and Build a Corpus
- New project directory (empty)
Phase 0 — Scope
Trigger
help me build an LLM for IgboLanguage disambiguation
Igbo (ISO: ibo) is unambiguous — no macrolanguage disambiguation needed.
ISO 639-3: ibo
Glottolog: igbo1259
Family: Niger-Congo > Atlantic-Congo > Volta-Congo > Igboid
Script: Latin (with tone diacritics: dot-below ọ, ụ; accent marks)
Resource class: Joshi 1 — "The Scraping-Bys"
Vitality (EGIDS): 2 — Provincial (widely spoken in Nigeria)Joshi Class 1 implications: Wikipedia exists (~4,200 articles). Minimal labeled data. No comprehensive benchmarks. Strategy: continued pretraining + OFA vocab extension + LoRA.
Typological profile
Typological outliers for Igbo:
• Tone: HIGH/LOW/DOWNSTEP — diacritic preservation MANDATORY
• SVO word order
• Isolating morphology (low fertility expected: ~1.9×)
• Serial verb constructions
• Pro-drop: partial
• No grammatical gender
Transfer source candidates:
1. Yoruba (yor) — URIEL=0.18 — same branch (Volta-Congo) + tone + Latin + Class 2
2. Twi (twi) — URIEL=0.29 — Niger-Congo + tone + Class 2
3. Hausa (hau) — URIEL=0.41 — West African contact + Class 2 (different family)
English: URIEL=0.63 — NOT recommendedScript policy
Script Policy: Igbo
- Normalization: NFC
- Diacritics: PRESERVE (ọ, ụ with dot-below carry meaning; tone marks á/à carry meaning)
- Confusable risk: MEDIUM (dot-below characters may confuse with base characters in noisy web data)
- Action: TR39 + dot-below confusable detection before dedupEthics seed
EGIDS 2 (Provincial), standard FPIC. Note: Igbo has active community NLP organizations (Masakhane community). Engage with community norms for commercial use.
Phase 0 complete. Workspace state written.
Phase 1 — Acquire
Corpus sources
Corpus Catalog: Igbo (ibo)
Source | Size | License | Register % | Notes
Wikipedia (ibo) | 12MB | CC-BY-SA | encyclopedic 100% | Quality; SA propagation
Bible-NLP (ibo) | 3.5MB | CC-BY | liturgical 100% | Tone-marked; archaic register
CulturaX (ibo) | 28MB | CC-BY | web 85%, social 15% | Mixed quality; LID needed
MasakhaNER | 0.3MB | CC-BY | news 100% | Gold-labeled NER; high valuePost-LID (GlotLID paragraph-level) + NFC + dedup (threshold=0.9):
- Total: 8.8M tokens
- Register: web 58%, encyclopedic 19%, liturgical 8%, news 15% — balanced
Tokenizer audit
Tokenizer: tiktoken-cl100k_base on Igbo
Fertility: 2.8 / 1.4 = 2.0× — at the EXTEND RECOMMENDED threshold
Dot-below characters (ọ, ụ): covered via combining characters; fertility bump confirmed
Recommendation: OFA vocab extension (parallel data available via Bible-NLP)
SentencePiece config:
model_type=unigram, vocab_size=32K, character_coverage=0.9999,
byte_fallback=true, split_digits=trueBitext mining
Bitext: English ↔ Igbo
Embedding: SONAR (Niger-Congo; LASER3 gaps for Igboid)
Threshold: 1.04 + 50-pair spot-check (Class 1)
Sources: Bible-NLP parallel (28K pairs) + OPUS (4K pairs)
Synthetic: back-translation T=0.8 → 45K pairs
Total: ~77K pairsTransfer plan
Transfer Plan: Igbo (Class 1)
- Approach: OFA vocab extension + LoRA
- Source: Yoruba (URIEL=0.18, Class 2, data available)
- LoRA: rank=16, alpha=32, all-linear modules
- embed_tokens: YES (vocab extended)
- Forgetting mitigation: 15% Yoruba in training mix
- Tool: Unsloth (single GPU)
- Base: NLLB-200 distilled (covers Igbo; good seed for Niger-Congo)Phase 1 complete. Manifest: 8.8M tokens, 77K bitext pairs, tokenizer plan, transfer plan.
Phase 2 — Analyze
For Igbo (isolating morphology, Class 1), only targeted analysis is needed:
Morphology
Morphology tier: lo-mid (isolating; standard BPE handles it)
UniMorph coverage: sparse for Igbo
Action: SKIP deep morph analysis — fertility audit handled at tokenize phaseSyntax
UD treebank: IgboNLP treebank (small; ~2,000 sentences — PUD-style)
Approach: cross-lingual transfer (treebank too small for training)
Transfer source: Yoruba UD (URIEL=0.18; training-size available)
Parser: Trankit (XLM-R based; best low-resource)
Agreement probes:
• Tone preservation in minimal pairs (150 pairs — MANDATORY for tone language)
• SVO word-order violations (120 pairs)
• Serial verb construction (80 pairs — Igbo-specific)Ethics check on MasakhaNER
MasakhaNER Igbo: CC-BY-4.0
CARE check: PASS — developed by Masakhane (community-driven NLP)
Attribution: COMPLETE (cite Adelani et al. 2021)
Sacred-text: NONE
Release mode: OPENPhase 2 complete. Syntax plan + 350 agreement probes in workspace state.
Phase 3 — Evaluate
Eval Plan: Igbo
Benchmark:
- FLORES-200 (flag: in pretrain mix → lower bound only)
- MasakhaNER F1 (NER; gold-labeled; not in pretrain)
- AfriSenti-Igbo (sentiment; Igbo-specific)
- Custom agreement probes (350 pairs across 3 phenomena)
Metrics:
- MT: chrF++ primary + GEMBA-MQM (COMET-22 Igbo coverage: PARTIAL — verify)
- NER: F1 per-tag (PER, ORG, LOC, DATE)
- Sentiment: macro-F1 (pos/neg/neutral)
Contamination:
- FLORES confirmed in NLLB pretrain → lower bound only
- MasakhaNER: CLEAN (not in major pretrain mixes)
- AfriSenti: CLEAN
Stratification:
- Per-register (liturgical vs news vs web)
- Per-direction for MT (En→Ibo vs Ibo→En)Phase 3 complete. Eval report in workspace state.
Phase 4 — Release
Ethics Final Gate: Igbo
License mix:
Wikipedia: CC-BY-SA (SA propagates → model outputs CC-BY-SA compatible)
CulturaX: CC-BY (OK)
Bible-NLP: CC-BY (OK; flag register in model card)
MasakhaNER: CC-BY (OK; cite Adelani et al.)
CARE check: PASS
- CulturaX: PASS
- MasakhaNER: PASS (Masakhane community-driven; serves source community)
- Bible-NLP: NEEDS-WORK → flag in model card (liturgical register; community use norms)
FPIC: COMPLETE (EGIDS 2; standard process)
Sacred-text: Bible-NLP flagged in model card
Release mode: OPEN
Condition: model card must note CC-BY-SA from Wikipedia; Bible register note; cite MasakhaneModel card sections written
- Datasets + licenses + lineage (4 sources, all documented)
- Ethics statement (CARE alignment; Masakhane community acknowledgment)
- Intended uses: NER, sentiment, MT En↔Ibo — open research and commercial
- Limitations: Class 1 language; limited parallel data; FLORES contamination; liturgical register in bitext
- Contact for community concerns
Phase 4 complete. Pipeline complete.
Summary
| Phase | Duration | Key Outputs |
|---|---|---|
| Scope | 20 min | Language profile, script policy, ethics seed |
| Acquire | 90 min | 8.8M tokens, 77K pairs, tokenizer + transfer plan |
| Analyze | 30 min | Syntax plan, 350 agreement probes |
| Evaluate | 20 min | Multi-benchmark eval plan, contamination flags |
| Release | 20 min | Model card, Open release decision |
Open Findings at Completion
/linguistic:findings
HIGH (0): none
MEDIUM (2):
[linguistic-bitext] Bitext liturgical % = 36% — flag in MT eval
[linguistic-eval] FLORES-200 confirmed in pretrain — report as lower bound
LOW (1):
[linguistic-ethics] Wikipedia CC-BY-SA propagation — noted in model cardAll findings documented in workspace_state.md and model card. Pipeline is complete and release-ready.
Last updated on