MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

Overview

linguistic-corpus is the Acquire-phase specialist for monolingual data. It catalogs available sources for the target language (OLDI, CulturaX, MADLAD-400, Glot500, Wikipedia, Common Crawl), applies paragraph-level language identification, runs MinHash deduplication with low-resource-appropriate thresholds, and produces a reproducible corpus manifest.

Corpus quality problems are vastly cheaper to fix before training than after. A contaminated eval set discovered post-training means either transparent disclosure in the model card or an expensive retrain. A Bible-dominated corpus produces a model that sounds archaic in everyday use. Cebuano's 6M-article Wikipedia — mostly bot-generated near-duplicates — does not mean Joshi class 5 data quality.

Every dataset identified by this skill must pass through linguistic-ethics before entering the mix.

Pipeline Position

This skill operates in Phase 1 — Acquire of the linguistic pipeline.

Preceding skills: linguistic-scope (Joshi class, language identity), linguistic-scripts (normalization policy), linguistic-ethics (per-dataset gate) Following skills: linguistic-tokenize (fertility audit on corpus), linguistic-bitext (if parallel data also needed), linguistic-transfer (training data planning)

When It Activates

  • User asks "where do I get data for [language]"
  • Building a monolingual training corpus from heterogeneous sources
  • Diagnosing model behavior suggesting corpus problems (register-collapse, eval-set memorization, domain bias)
  • Auditing an existing corpus before training (dedup stats, contamination, register balance)
  • Routed by linguistic-orchestrator at the start of Acquire phase

When NOT to use: For parallel/bitext data → linguistic-bitext. For tokenizer-level audit → linguistic-tokenize. For per-dataset ethics → always route through linguistic-ethics first.

What It Does

  • Enumerates candidate corpora with source URL, size estimate, license, register distribution, and known issues
  • Applies paragraph-level language-ID: GlotLID (low-resource), FastText 176-lang (high-resource speed), CLD3 as fallback — never document-level LID on mixed-script or code-switched corpora
  • Routes to linguistic-scripts for NFC + confusable-fold normalization before MinHash — otherwise look-alike duplicates survive
  • Runs MinHash deduplication with low-resource defaults: num_perm=256, threshold=0.9 for low-resource (not 0.8 — standard threshold over-merges short texts, losing 20–30% of valid distinct entries), shingle_size=5 chars (Latin/Cyrillic) or 3 (Han/Indic)
  • Runs two-sided contamination audit: (a) train mix vs project eval set; (b) eval set vs base-model pretrain proxies (FLORES-200 is in many pretrain mixes — report as lower bound, not fair eval)
  • Reports register balance and flags: Bible >30% (archaic register risk), news >70% (event-bias risk), web-only (no register diversity)
  • Produces a reproducible corpus manifest

Example Usage

Target: Swahili (swa), Joshi Class 3

## Corpus Manifest: Swahili (build 2026-05-22)

| Source          | Size  | License      | Register %           | Notes                    |
|-----------------|-------|--------------|----------------------|--------------------------|
| MADLAD-400 swa  | 2.1GB | CC-BY-4.0    | web 70%, wiki 20%    | Overlaps CulturaX; dedup |
| Wikipedia (swa) | 180MB | CC-BY-SA-3.0 | encyclopedic 100%    | SA propagation note      |
| OPUS Bible      | 4MB   | CC-BY-4.0    | liturgical 100%      | Flag: archaic register   |

**Total tokens (post-dedup):** 420M
**Dedup ratio:** 18% removed (threshold=0.88, num_perm=256)
**Contamination check:** PASS — no FLORES-200 overlap detected
**Register balance:** web 62% / wiki 25% / liturgical 5% / news 8% — acceptable
**Recommended next step:** linguistic-tokenize for fertility audit
  • Data Loading — complementary data ingestion for multilingual datasets
Was this page helpful?
Edit on GitHub

Last updated on

On this page