Overview
linguistic-corpus is the Acquire-phase specialist for monolingual data. It catalogs available sources for the target language (OLDI, CulturaX, MADLAD-400, Glot500, Wikipedia, Common Crawl), applies paragraph-level language identification, runs MinHash deduplication with low-resource-appropriate thresholds, and produces a reproducible corpus manifest.
Corpus quality problems are vastly cheaper to fix before training than after. A contaminated eval set discovered post-training means either transparent disclosure in the model card or an expensive retrain. A Bible-dominated corpus produces a model that sounds archaic in everyday use. Cebuano's 6M-article Wikipedia — mostly bot-generated near-duplicates — does not mean Joshi class 5 data quality.
Every dataset identified by this skill must pass through linguistic-ethics before entering the mix.
Pipeline Position
This skill operates in Phase 1 — Acquire of the linguistic pipeline.
Preceding skills: linguistic-scope (Joshi class, language identity), linguistic-scripts (normalization policy), linguistic-ethics (per-dataset gate)
Following skills: linguistic-tokenize (fertility audit on corpus), linguistic-bitext (if parallel data also needed), linguistic-transfer (training data planning)
When It Activates
- User asks "where do I get data for [language]"
- Building a monolingual training corpus from heterogeneous sources
- Diagnosing model behavior suggesting corpus problems (register-collapse, eval-set memorization, domain bias)
- Auditing an existing corpus before training (dedup stats, contamination, register balance)
- Routed by
linguistic-orchestratorat the start of Acquire phase
When NOT to use: For parallel/bitext data → linguistic-bitext. For tokenizer-level audit → linguistic-tokenize. For per-dataset ethics → always route through linguistic-ethics first.
What It Does
- Enumerates candidate corpora with source URL, size estimate, license, register distribution, and known issues
- Applies paragraph-level language-ID: GlotLID (low-resource), FastText 176-lang (high-resource speed), CLD3 as fallback — never document-level LID on mixed-script or code-switched corpora
- Routes to
linguistic-scriptsfor NFC + confusable-fold normalization before MinHash — otherwise look-alike duplicates survive - Runs MinHash deduplication with low-resource defaults:
num_perm=256,threshold=0.9for low-resource (not 0.8 — standard threshold over-merges short texts, losing 20–30% of valid distinct entries),shingle_size=5chars (Latin/Cyrillic) or3(Han/Indic) - Runs two-sided contamination audit: (a) train mix vs project eval set; (b) eval set vs base-model pretrain proxies (FLORES-200 is in many pretrain mixes — report as lower bound, not fair eval)
- Reports register balance and flags: Bible >30% (archaic register risk), news >70% (event-bias risk), web-only (no register diversity)
- Produces a reproducible corpus manifest
Example Usage
Target: Swahili (swa), Joshi Class 3
## Corpus Manifest: Swahili (build 2026-05-22)
| Source | Size | License | Register % | Notes |
|-----------------|-------|--------------|----------------------|--------------------------|
| MADLAD-400 swa | 2.1GB | CC-BY-4.0 | web 70%, wiki 20% | Overlaps CulturaX; dedup |
| Wikipedia (swa) | 180MB | CC-BY-SA-3.0 | encyclopedic 100% | SA propagation note |
| OPUS Bible | 4MB | CC-BY-4.0 | liturgical 100% | Flag: archaic register |
**Total tokens (post-dedup):** 420M
**Dedup ratio:** 18% removed (threshold=0.88, num_perm=256)
**Contamination check:** PASS — no FLORES-200 overlap detected
**Register balance:** web 62% / wiki 25% / liturgical 5% / news 8% — acceptable
**Recommended next step:** linguistic-tokenize for fertility auditRelated Skills
- linguistic-scope — provides Joshi class and language identity before catalog enumeration
- linguistic-ethics — per-dataset ethics check before any source enters the mix
- linguistic-scripts — Unicode normalization + confusable fold before MinHash
- linguistic-tokenize — fertility audit after corpus is curated
- linguistic-bitext — if parallel data is also needed
Related Skills from Other Suites
- Data Loading — complementary data ingestion for multilingual datasets
Last updated on
linguistic-ethics
Apply CARE / FPIC / community-sovereignty / license-compliance / sacred-text gating across every linguistic project phase. A-tier; mandatory for every non-English dataset. Routed early (Scope) and again at Release.
linguistic-bitext
Mine, align, filter, and synthesize parallel corpora for low-resource MT. Use before training any MT model — alignment threshold and register balance choices cascade through every downstream eval.