Linguistic Agent Skills
A suite of 18 Claude Code Skills bringing computational-linguistics expertise to AI engineers building LLMs for low-resource languages — the ~7,000 languages outside the English/Mandarin/Spanish frontier.
What This Suite Does
Building an LLM for Yoruba, Khmer, Quechua, Cantonese, or Twi requires linguistic decisions that pure-ML engineers routinely miss. The Linguistic Agent Skills suite captures those non-obvious decision points as agent skills: not introductions to linguistics (Claude already knows the textbooks), but the integration knowledge an experienced computational linguist applies when bridging theory to ML practice.
Every skill activates on natural-language triggers. Mention a target language, ask about tokenizer fertility, say "help me build an LLM for Yoruba" — and the orchestrator routes to the right specialist automatically.
The 18 Skills
The suite consists of 14 specialist skills, 1 orchestrator, and 3 optional Mindset stubs organized across a 5-phase pipeline:
| Phase | Skills | Purpose |
|---|---|---|
| Scope | scope, scripts, tokenize, ethics | Language ID, resource class, typology, script policy, ethics seed |
| Acquire | corpus, bitext, transfer | Monolingual + parallel data, vocab/adapter strategy |
| Analyze | morph, syntax, annotate, semantics, discourse, speech | Linguistic-layer analysis as needed |
| Evaluate | eval | Honest metrics for the target language |
| Optional | codeswitch, historical, lexicon | Mindset stubs for Phase 4 decisions |
The orchestrator (linguistic-orchestrator) is the entry point — it coordinates routing and tracks workspace state across sessions.
Low-Resource Language Focus
The suite is purpose-built for Joshi classes 0–4 — languages ranging from fully undocumented (Class 0) to benchmark-covered but resource-limited (Class 4). Every skill surfaces the resource-class implications of each decision, because the right strategy for Yoruba (Class 2) is fundamentally different from the right strategy for Turkish (Class 4) or English (Class 5).
Key capabilities:
- Language identification — ISO 639-3 + Glottolog disambiguation, including macrolanguage disambiguation (Chinese → Mandarin/Cantonese/Wu)
- Script + Unicode handling — normalization policy, diacritic preservation for tone languages, TR39 confusable folding
- Tokenizer audits — fertility ratio analysis, vocab extension method selection (FOCUS/OFA/HyperOfa)
- Ethical compliance — FPIC/CARE principles, sacred-text gating, license compatibility, attribution lineage
- Data curation — corpus catalog, paragraph-level language-ID, MinHash dedup, contamination audit
- Bitext mining — LASER3/SONAR embeddings, Vecalign alignment, synthetic bitext generation
- Transfer learning — LoRA rank by URIEL typological distance, MAD-X adapters, catastrophic-forgetting mitigation
- Evaluation — chrF++/COMET/GEMBA-MQM, BLiMP-style grammatical probes, contamination-aware reporting
Integration with the MAGIC Ecosystem
The Linguistic Agent Skills suite is a sibling repo to the Data Agent Skills suite. They share structural patterns and can be used together: the data suite handles general-purpose tabular and text data pipelines; the linguistic suite handles language-specific decisions within those pipelines.
See Cross-Suite Integration for patterns on using both suites together.
Suite Status
The suite is complete as of 2026-04-23: all 18 skills shipped, 226/226 tests passing, all quality gates met (skill-judge 8-dimension 120-point rubric). Entry-point skills (orchestrator, scope, ethics, eval) scored A− (102+/120); specialist skills scored A−; Mindset stubs scored B+ (96+/120).
Last updated on