linguistic-corpus

Overview

linguistic-corpus is the Acquire-phase specialist for monolingual data. It catalogs available sources for the target language (OLDI, CulturaX, MADLAD-400, Glot500, Wikipedia, Common Crawl), applies paragraph-level language identification, runs MinHash deduplication with low-resource-appropriate thresholds, and produces a reproducible corpus manifest.

Corpus quality problems are vastly cheaper to fix before training than after. A contaminated eval set discovered post-training means either transparent disclosure in the model card or an expensive retrain. A Bible-dominated corpus produces a model that sounds archaic in everyday use. Cebuano's 6M-article Wikipedia — mostly bot-generated near-duplicates — does not mean Joshi class 5 data quality.

Every dataset identified by this skill must pass through linguistic-ethics before entering the mix.

Pipeline Position

This skill operates in Phase 1 — Acquire of the linguistic pipeline.

Preceding skills: linguistic-scope (Joshi class, language identity), linguistic-scripts (normalization policy), linguistic-ethics (per-dataset gate) Following skills: linguistic-tokenize (fertility audit on corpus), linguistic-bitext (if parallel data also needed), linguistic-transfer (training data planning)

When It Activates

User asks "where do I get data for [language]"
Building a monolingual training corpus from heterogeneous sources
Diagnosing model behavior suggesting corpus problems (register-collapse, eval-set memorization, domain bias)
Auditing an existing corpus before training (dedup stats, contamination, register balance)
Routed by linguistic-orchestrator at the start of Acquire phase

When NOT to use: For parallel/bitext data → linguistic-bitext. For tokenizer-level audit → linguistic-tokenize. For per-dataset ethics → always route through linguistic-ethics first.

What It Does

Enumerates candidate corpora with source URL, size estimate, license, register distribution, and known issues
Applies paragraph-level language-ID: GlotLID (low-resource), FastText 176-lang (high-resource speed), CLD3 as fallback — never document-level LID on mixed-script or code-switched corpora
Routes to linguistic-scripts for NFC + confusable-fold normalization before MinHash — otherwise look-alike duplicates survive
Runs MinHash deduplication with low-resource defaults: num_perm=256, threshold=0.9 for low-resource (not 0.8 — standard threshold over-merges short texts, losing 20–30% of valid distinct entries), shingle_size=5 chars (Latin/Cyrillic) or 3 (Han/Indic)
Runs two-sided contamination audit: (a) train mix vs project eval set; (b) eval set vs base-model pretrain proxies (FLORES-200 is in many pretrain mixes — report as lower bound, not fair eval)
Reports register balance and flags: Bible >30% (archaic register risk), news >70% (event-bias risk), web-only (no register diversity)
Produces a reproducible corpus manifest

Example Usage

Target: Swahili (swa), Joshi Class 3

## Corpus Manifest: Swahili (build 2026-05-22)

| Source          | Size  | License      | Register %           | Notes                    |
|-----------------|-------|--------------|----------------------|--------------------------|
| MADLAD-400 swa  | 2.1GB | CC-BY-4.0    | web 70%, wiki 20%    | Overlaps CulturaX; dedup |
| Wikipedia (swa) | 180MB | CC-BY-SA-3.0 | encyclopedic 100%    | SA propagation note      |
| OPUS Bible      | 4MB   | CC-BY-4.0    | liturgical 100%      | Flag: archaic register   |

**Total tokens (post-dedup):** 420M
**Dedup ratio:** 18% removed (threshold=0.88, num_perm=256)
**Contamination check:** PASS — no FLORES-200 overlap detected
**Register balance:** web 62% / wiki 25% / liturgical 5% / news 8% — acceptable
**Recommended next step:** linguistic-tokenize for fertility audit

linguistic-scope — provides Joshi class and language identity before catalog enumeration
linguistic-ethics — per-dataset ethics check before any source enters the mix
linguistic-scripts — Unicode normalization + confusable fold before MinHash
linguistic-tokenize — fertility audit after corpus is curated
linguistic-bitext — if parallel data is also needed

Data Loading — complementary data ingestion for multilingual datasets

Was this page helpful?

Overview

Pipeline Position

When It Activates

What It Does

Example Usage

Related Skills

Related Skills from Other Suites

On this page