Tutorial: Build a Corpus

Difficulty: Medium
Time: 45–60 minutes
Skills used: linguistic-corpus, linguistic-bitext, linguistic-ethics
Language: Twi (twi) — continuing from Explore a Language Profile

This tutorial covers the Acquire phase: finding and vetting monolingual corpus sources, running language-ID and dedup, auditing for eval contamination, and mining parallel data for MT.

Prerequisites

Completed Explore a Language Profile (or have workspace_state.md with Twi scope complete)
Linguistic Agent Skills installed

Step 1 — Resume the Session

Open a Claude Code session in your Twi project directory:

/linguistic:status

Expected:

[Phase: Scope | Language: Twi (twi) | Resource Class: 2 | Last skill: linguistic-ethics | Open findings: 0]

Advance to the Acquire phase:

what data is available for Twi?

The orchestrator routes to linguistic-corpus.

Step 2 — Corpus Catalog

linguistic-corpus enumerates candidate sources:

Corpus Catalog: Twi (twi)

Source          | Size  | License      | Register %                  | Notes
----------------|-------|--------------|-----------------------------|-----------------------
CulturaX (twi)  | 45MB  | CC-BY-4.0    | web 80%, social 15%, news 5%| Quality varies; web crawl
Wikipedia (twi) | 8MB   | CC-BY-SA-3.0 | encyclopedic 100%           | Small but quality; SA note
Bible-NLP (twi) | 3.5MB | CC-BY-4.0    | liturgical 100%             | Archaic register; tone-marked
OPUS (twi)      | 12MB  | various      | religious 60%, news 40%     | Mixed; check per-source licenses

Finding generated: Bible-NLP + OPUS religious data = 48% of catalog — HIGH finding (>30% threshold).

Step 3 — Ethics Check

Before proceeding, linguistic-ethics reviews each candidate:

Ethics Assessment: Twi Corpus Sources

CulturaX (twi): CC-BY-4.0 — PASS (standard attribution)
Wikipedia (twi): CC-BY-SA-3.0 — PASS (note ShareAlike propagation)
Bible-NLP (twi): CC-BY-4.0 — NEEDS-WORK
  - Archaic/liturgical register (tone language: diacritics present, good)
  - Community use norms: flag for model card; non-commercial generative use preferred
OPUS religious slice: CC-BY-4.0 — PASS (note register)

Mix compatibility: OK for open release (no NC in mix; SA from Wikipedia)
SA propagation note: Wikipedia slice → model outputs should be CC-BY-SA compatible

Recommendation: Proceed with all sources; flag Bible% in manifest; limit to ≤30% of final mix

Step 4 — Language-ID and Normalization

linguistic-corpus applies paragraph-level LID with GlotLID:

Language-ID Audit: CulturaX Twi slice

Per-paragraph LID results:
  twi (confirmed): 78.4%
  en (English inclusions): 12.1%
  ak (generic Akan): 6.3%
  other: 3.2%

Action: filter to twi-confirmed paragraphs (78.4%); remove English and other-language inclusions
Effective size after LID filter: 35MB (from 45MB)

Normalization applied: NFC (from workspace script policy), diacritics validated present.

Step 5 — Deduplication

MinHash deduplication with low-resource settings:

Deduplication: Twi Corpus

Config: num_perm=256, threshold=0.9 (low-resource setting), shingle_size=5
Applied after: NFC normalization + TR39 confusable fold (dedup key only)

Results:
  CulturaX (post-LID): 35MB → 29MB (17% removed)
  Wikipedia: 8MB → 7.8MB (2.5% removed — mostly templates)
  Bible-NLP: 3.5MB → 3.5MB (0% — already clean; liturgical text)
  OPUS (twi): 12MB → 10.2MB (15% removed)

Total post-dedup: 50.5MB / ~12M tokens

Note: threshold=0.9 (not 0.8) — standard 0.8 would have over-merged short texts and lost ~25% of valid Twi entries.

Step 6 — Contamination Audit

Two-sided contamination check:

Contamination Audit: Twi

(a) Train mix vs project eval set: PASS — no overlap detected
(b) Eval set vs base-model pretrain:
  FLORES-200 (twi): CONFIRMED in NLLB pretrain + likely in Llama-3 cutoff
  → Report FLORES-200 as lower bound on quality; use NTREX or custom held-out as primary

Finding (MEDIUM): FLORES-200 contamination — report as lower bound

Step 7 — Register Balance

Final register composition:

Register Balance: Twi Corpus

Source             | Tokens | Register      | %
CulturaX filtered  | 6.9M   | web/social    | 58%
Wikipedia          | 1.9M   | encyclopedic  | 16%
Bible-NLP          | 0.8M   | liturgical    | 7%   ← within ≤30% threshold
OPUS (news)        | 2.3M   | news          | 19%

Total: 11.9M tokens
Register: web 58%, news 19%, encyclopedic 16%, liturgical 7% — BALANCED ✓
Bible% = 7% — within acceptable range

Step 8 — Parallel Data (Bitext)

For MT work, linguistic-bitext mines English-Twi parallel data:

Bitext Plan: English ↔ Twi

Embedding: SONAR (Niger-Congo languages; LASER3 coverage gap for Kwa)
Aligner: Vecalign
Margin threshold: 1.04 + 50-pair manual spot-check (Class 2 target)
Length-ratio filter: [0.45, 2.5]

Sources mined:
  Bible-NLP parallel: 31,000 pairs (after filter)
  OPUS (news + subtitles): 8,400 pairs
  Synthetic (back-translation T=0.8): 40,000 pairs

Total bitext: ~79,000 pairs
Register: liturgical 39%, news 11%, synthetic 51%

Finding (MEDIUM): liturgical % = 39% — above 30% threshold for archaic-register MT risk
Action: supplement with contemporary web parallel if available; flag in eval

Step 9 — Update Workspace State

/linguistic:status

Expected:

[Phase: Acquire | Language: Twi (twi) | Resource Class: 2 | Last skill: linguistic-bitext | Open findings: 3]

Check findings:

/linguistic:findings

Expected output:

HIGH (1):
  [linguistic-corpus] Initial catalog: Bible+religious data = 48% of raw catalog
  Action: completed — reduced to 7% in final mix ✓ (resolved)

MEDIUM (2):
  [linguistic-eval] FLORES-200 contamination confirmed
  Action: report as lower bound; use NTREX as primary

  [linguistic-bitext] Bitext liturgical % = 39% (threshold: 30%)
  Action: flag in MT eval; supplement if contemporary parallel found

LOW (0): none

Step 10 — Next Steps

Your Acquire phase artifacts:

workspace_state.md — updated with corpus manifest, dedup stats, bitext plan
11.9M token monolingual corpus (balanced register)
79K sentence pairs (with liturgical-register flag)
2 open MEDIUM findings (documented)

Next steps:

Run linguistic-tokenize for fertility audit on Twi corpus
Run linguistic-transfer for LoRA adapter configuration
Full pipeline tutorial — all phases end-to-end

Was this page helpful?

On this page