Tutorial: Full Pipeline

Difficulty: Advanced
Time: 2–3 hours
Skills used: All phases — orchestrator, scope, scripts, ethics, corpus, bitext, tokenize, transfer, morph, syntax, eval
Language: Igbo (ibo) — Nigeria, Joshi Class 1, Niger-Congo

This tutorial runs all five pipeline phases for Igbo — a Class 1 language with limited data but active NLP research community. By the end you'll have a complete, documented pipeline ready for training.

Prerequisites

Linguistic Agent Skills installed (see Installation)
Familiarity with Explore a Language Profile and Build a Corpus
New project directory (empty)

Phase 0 — Scope

Trigger

help me build an LLM for Igbo

Language disambiguation

Igbo (ISO: ibo) is unambiguous — no macrolanguage disambiguation needed.

ISO 639-3: ibo
Glottolog: igbo1259
Family: Niger-Congo > Atlantic-Congo > Volta-Congo > Igboid
Script: Latin (with tone diacritics: dot-below ọ, ụ; accent marks)
Resource class: Joshi 1 — "The Scraping-Bys"
Vitality (EGIDS): 2 — Provincial (widely spoken in Nigeria)

Joshi Class 1 implications: Wikipedia exists (~4,200 articles). Minimal labeled data. No comprehensive benchmarks. Strategy: continued pretraining + OFA vocab extension + LoRA.

Typological profile

Typological outliers for Igbo:
  • Tone: HIGH/LOW/DOWNSTEP — diacritic preservation MANDATORY
  • SVO word order
  • Isolating morphology (low fertility expected: ~1.9×)
  • Serial verb constructions
  • Pro-drop: partial
  • No grammatical gender

Transfer source candidates:
  1. Yoruba (yor) — URIEL=0.18 — same branch (Volta-Congo) + tone + Latin + Class 2
  2. Twi (twi) — URIEL=0.29 — Niger-Congo + tone + Class 2
  3. Hausa (hau) — URIEL=0.41 — West African contact + Class 2 (different family)
  English: URIEL=0.63 — NOT recommended

Script policy

Script Policy: Igbo
- Normalization: NFC
- Diacritics: PRESERVE (ọ, ụ with dot-below carry meaning; tone marks á/à carry meaning)
- Confusable risk: MEDIUM (dot-below characters may confuse with base characters in noisy web data)
- Action: TR39 + dot-below confusable detection before dedup

Corpus Catalog: Igbo (ibo)

Source          | Size  | License   | Register %             | Notes
Wikipedia (ibo) | 12MB  | CC-BY-SA  | encyclopedic 100%      | Quality; SA propagation
Bible-NLP (ibo) | 3.5MB | CC-BY     | liturgical 100%        | Tone-marked; archaic register
CulturaX (ibo)  | 28MB  | CC-BY     | web 85%, social 15%    | Mixed quality; LID needed
MasakhaNER      | 0.3MB | CC-BY     | news 100%              | Gold-labeled NER; high value

Post-LID (GlotLID paragraph-level) + NFC + dedup (threshold=0.9):

Total: 8.8M tokens
Register: web 58%, encyclopedic 19%, liturgical 8%, news 15% — balanced

Tokenizer audit

Tokenizer: tiktoken-cl100k_base on Igbo
Fertility: 2.8 / 1.4 = 2.0× — at the EXTEND RECOMMENDED threshold
Dot-below characters (ọ, ụ): covered via combining characters; fertility bump confirmed

Recommendation: OFA vocab extension (parallel data available via Bible-NLP)
SentencePiece config:
  model_type=unigram, vocab_size=32K, character_coverage=0.9999,
  byte_fallback=true, split_digits=true

Bitext mining

Bitext: English ↔ Igbo
Embedding: SONAR (Niger-Congo; LASER3 gaps for Igboid)
Threshold: 1.04 + 50-pair spot-check (Class 1)
Sources: Bible-NLP parallel (28K pairs) + OPUS (4K pairs)
Synthetic: back-translation T=0.8 → 45K pairs
Total: ~77K pairs

Transfer plan

Transfer Plan: Igbo (Class 1)
- Approach: OFA vocab extension + LoRA
- Source: Yoruba (URIEL=0.18, Class 2, data available)
- LoRA: rank=16, alpha=32, all-linear modules
- embed_tokens: YES (vocab extended)
- Forgetting mitigation: 15% Yoruba in training mix
- Tool: Unsloth (single GPU)
- Base: NLLB-200 distilled (covers Igbo; good seed for Niger-Congo)

Phase 1 complete. Manifest: 8.8M tokens, 77K bitext pairs, tokenizer plan, transfer plan.

Phase 2 — Analyze

For Igbo (isolating morphology, Class 1), only targeted analysis is needed:

Morphology

Morphology tier: lo-mid (isolating; standard BPE handles it)
UniMorph coverage: sparse for Igbo
Action: SKIP deep morph analysis — fertility audit handled at tokenize phase

Syntax

UD treebank: IgboNLP treebank (small; ~2,000 sentences — PUD-style)
Approach: cross-lingual transfer (treebank too small for training)
Transfer source: Yoruba UD (URIEL=0.18; training-size available)
Parser: Trankit (XLM-R based; best low-resource)

Agreement probes:
  • Tone preservation in minimal pairs (150 pairs — MANDATORY for tone language)
  • SVO word-order violations (120 pairs)
  • Serial verb construction (80 pairs — Igbo-specific)

Ethics check on MasakhaNER

MasakhaNER Igbo: CC-BY-4.0
CARE check: PASS — developed by Masakhane (community-driven NLP)
Attribution: COMPLETE (cite Adelani et al. 2021)
Sacred-text: NONE
Release mode: OPEN

Phase 2 complete. Syntax plan + 350 agreement probes in workspace state.

Phase 3 — Evaluate

Eval Plan: Igbo
Benchmark:
  - FLORES-200 (flag: in pretrain mix → lower bound only)
  - MasakhaNER F1 (NER; gold-labeled; not in pretrain)
  - AfriSenti-Igbo (sentiment; Igbo-specific)
  - Custom agreement probes (350 pairs across 3 phenomena)

Metrics:
  - MT: chrF++ primary + GEMBA-MQM (COMET-22 Igbo coverage: PARTIAL — verify)
  - NER: F1 per-tag (PER, ORG, LOC, DATE)
  - Sentiment: macro-F1 (pos/neg/neutral)

Contamination:
  - FLORES confirmed in NLLB pretrain → lower bound only
  - MasakhaNER: CLEAN (not in major pretrain mixes)
  - AfriSenti: CLEAN

Stratification:
  - Per-register (liturgical vs news vs web)
  - Per-direction for MT (En→Ibo vs Ibo→En)

Phase 3 complete. Eval report in workspace state.

Phase 4 — Release

Ethics Final Gate: Igbo

License mix:
  Wikipedia: CC-BY-SA (SA propagates → model outputs CC-BY-SA compatible)
  CulturaX: CC-BY (OK)
  Bible-NLP: CC-BY (OK; flag register in model card)
  MasakhaNER: CC-BY (OK; cite Adelani et al.)

CARE check: PASS
  - CulturaX: PASS
  - MasakhaNER: PASS (Masakhane community-driven; serves source community)
  - Bible-NLP: NEEDS-WORK → flag in model card (liturgical register; community use norms)

FPIC: COMPLETE (EGIDS 2; standard process)
Sacred-text: Bible-NLP flagged in model card

Release mode: OPEN
  Condition: model card must note CC-BY-SA from Wikipedia; Bible register note; cite Masakhane

Model card sections written

Datasets + licenses + lineage (4 sources, all documented)
Ethics statement (CARE alignment; Masakhane community acknowledgment)
Intended uses: NER, sentiment, MT En↔Ibo — open research and commercial
Limitations: Class 1 language; limited parallel data; FLORES contamination; liturgical register in bitext
Contact for community concerns

Phase 4 complete. Pipeline complete.

Summary

Phase	Duration	Key Outputs
Scope	20 min	Language profile, script policy, ethics seed
Acquire	90 min	8.8M tokens, 77K pairs, tokenizer + transfer plan
Analyze	30 min	Syntax plan, 350 agreement probes
Evaluate	20 min	Multi-benchmark eval plan, contamination flags
Release	20 min	Model card, Open release decision

Open Findings at Completion

/linguistic:findings

HIGH (0): none

MEDIUM (2):
  [linguistic-bitext] Bitext liturgical % = 36% — flag in MT eval
  [linguistic-eval] FLORES-200 confirmed in pretrain — report as lower bound

LOW (1):
  [linguistic-ethics] Wikipedia CC-BY-SA propagation — noted in model card

All findings documented in workspace_state.md and model card. Pipeline is complete and release-ready.

Was this page helpful?