Tutorial: Build a Corpus
Difficulty: Medium
Time: 45–60 minutes
Skills used: linguistic-corpus, linguistic-bitext, linguistic-ethics
Language: Twi (twi) — continuing from Explore a Language Profile
This tutorial covers the Acquire phase: finding and vetting monolingual corpus sources, running language-ID and dedup, auditing for eval contamination, and mining parallel data for MT.
Prerequisites
- Completed Explore a Language Profile (or have
workspace_state.mdwith Twi scope complete) - Linguistic Agent Skills installed
Step 1 — Resume the Session
Open a Claude Code session in your Twi project directory:
/linguistic:statusExpected:
[Phase: Scope | Language: Twi (twi) | Resource Class: 2 | Last skill: linguistic-ethics | Open findings: 0]Advance to the Acquire phase:
what data is available for Twi?The orchestrator routes to linguistic-corpus.
Step 2 — Corpus Catalog
linguistic-corpus enumerates candidate sources:
Corpus Catalog: Twi (twi)
Source | Size | License | Register % | Notes
----------------|-------|--------------|-----------------------------|-----------------------
CulturaX (twi) | 45MB | CC-BY-4.0 | web 80%, social 15%, news 5%| Quality varies; web crawl
Wikipedia (twi) | 8MB | CC-BY-SA-3.0 | encyclopedic 100% | Small but quality; SA note
Bible-NLP (twi) | 3.5MB | CC-BY-4.0 | liturgical 100% | Archaic register; tone-marked
OPUS (twi) | 12MB | various | religious 60%, news 40% | Mixed; check per-source licensesFinding generated: Bible-NLP + OPUS religious data = 48% of catalog — HIGH finding (>30% threshold).
Step 3 — Ethics Check
Before proceeding, linguistic-ethics reviews each candidate:
Ethics Assessment: Twi Corpus Sources
CulturaX (twi): CC-BY-4.0 — PASS (standard attribution)
Wikipedia (twi): CC-BY-SA-3.0 — PASS (note ShareAlike propagation)
Bible-NLP (twi): CC-BY-4.0 — NEEDS-WORK
- Archaic/liturgical register (tone language: diacritics present, good)
- Community use norms: flag for model card; non-commercial generative use preferred
OPUS religious slice: CC-BY-4.0 — PASS (note register)
Mix compatibility: OK for open release (no NC in mix; SA from Wikipedia)
SA propagation note: Wikipedia slice → model outputs should be CC-BY-SA compatible
Recommendation: Proceed with all sources; flag Bible% in manifest; limit to ≤30% of final mixStep 4 — Language-ID and Normalization
linguistic-corpus applies paragraph-level LID with GlotLID:
Language-ID Audit: CulturaX Twi slice
Per-paragraph LID results:
twi (confirmed): 78.4%
en (English inclusions): 12.1%
ak (generic Akan): 6.3%
other: 3.2%
Action: filter to twi-confirmed paragraphs (78.4%); remove English and other-language inclusions
Effective size after LID filter: 35MB (from 45MB)Normalization applied: NFC (from workspace script policy), diacritics validated present.
Step 5 — Deduplication
MinHash deduplication with low-resource settings:
Deduplication: Twi Corpus
Config: num_perm=256, threshold=0.9 (low-resource setting), shingle_size=5
Applied after: NFC normalization + TR39 confusable fold (dedup key only)
Results:
CulturaX (post-LID): 35MB → 29MB (17% removed)
Wikipedia: 8MB → 7.8MB (2.5% removed — mostly templates)
Bible-NLP: 3.5MB → 3.5MB (0% — already clean; liturgical text)
OPUS (twi): 12MB → 10.2MB (15% removed)
Total post-dedup: 50.5MB / ~12M tokensNote: threshold=0.9 (not 0.8) — standard 0.8 would have over-merged short texts and lost ~25% of valid Twi entries.
Step 6 — Contamination Audit
Two-sided contamination check:
Contamination Audit: Twi
(a) Train mix vs project eval set: PASS — no overlap detected
(b) Eval set vs base-model pretrain:
FLORES-200 (twi): CONFIRMED in NLLB pretrain + likely in Llama-3 cutoff
→ Report FLORES-200 as lower bound on quality; use NTREX or custom held-out as primary
Finding (MEDIUM): FLORES-200 contamination — report as lower boundStep 7 — Register Balance
Final register composition:
Register Balance: Twi Corpus
Source | Tokens | Register | %
CulturaX filtered | 6.9M | web/social | 58%
Wikipedia | 1.9M | encyclopedic | 16%
Bible-NLP | 0.8M | liturgical | 7% ← within ≤30% threshold
OPUS (news) | 2.3M | news | 19%
Total: 11.9M tokens
Register: web 58%, news 19%, encyclopedic 16%, liturgical 7% — BALANCED ✓
Bible% = 7% — within acceptable rangeStep 8 — Parallel Data (Bitext)
For MT work, linguistic-bitext mines English-Twi parallel data:
Bitext Plan: English ↔ Twi
Embedding: SONAR (Niger-Congo languages; LASER3 coverage gap for Kwa)
Aligner: Vecalign
Margin threshold: 1.04 + 50-pair manual spot-check (Class 2 target)
Length-ratio filter: [0.45, 2.5]
Sources mined:
Bible-NLP parallel: 31,000 pairs (after filter)
OPUS (news + subtitles): 8,400 pairs
Synthetic (back-translation T=0.8): 40,000 pairs
Total bitext: ~79,000 pairs
Register: liturgical 39%, news 11%, synthetic 51%
Finding (MEDIUM): liturgical % = 39% — above 30% threshold for archaic-register MT risk
Action: supplement with contemporary web parallel if available; flag in evalStep 9 — Update Workspace State
/linguistic:statusExpected:
[Phase: Acquire | Language: Twi (twi) | Resource Class: 2 | Last skill: linguistic-bitext | Open findings: 3]Check findings:
/linguistic:findingsExpected output:
HIGH (1):
[linguistic-corpus] Initial catalog: Bible+religious data = 48% of raw catalog
Action: completed — reduced to 7% in final mix ✓ (resolved)
MEDIUM (2):
[linguistic-eval] FLORES-200 contamination confirmed
Action: report as lower bound; use NTREX as primary
[linguistic-bitext] Bitext liturgical % = 39% (threshold: 30%)
Action: flag in MT eval; supplement if contemporary parallel found
LOW (0): noneStep 10 — Next Steps
Your Acquire phase artifacts:
workspace_state.md— updated with corpus manifest, dedup stats, bitext plan- 11.9M token monolingual corpus (balanced register)
- 79K sentence pairs (with liturgical-register flag)
- 2 open MEDIUM findings (documented)
Next steps:
- Run
linguistic-tokenizefor fertility audit on Twi corpus - Run
linguistic-transferfor LoRA adapter configuration - Full pipeline tutorial — all phases end-to-end
Last updated on
Tutorial: Explore a Language Profile
Hands-on walkthrough using linguistic-scope and linguistic-scripts to build a complete language profile for a new project target.
Tutorial: Full Pipeline
End-to-end walkthrough of all 5 phases — Scope, Acquire, Analyze, Evaluate, and Release — for building a low-resource LLM for Igbo (ibo), a Joshi Class 1 language.