Linguistic Agent Skills

A suite of 18 Claude Code Skills bringing computational-linguistics expertise to AI engineers building LLMs for low-resource languages — the ~7,000 languages outside the English/Mandarin/Spanish frontier.

What This Suite Does

Building an LLM for Yoruba, Khmer, Quechua, Cantonese, or Twi requires linguistic decisions that pure-ML engineers routinely miss. The Linguistic Agent Skills suite captures those non-obvious decision points as agent skills: not introductions to linguistics (Claude already knows the textbooks), but the integration knowledge an experienced computational linguist applies when bridging theory to ML practice.

Every skill activates on natural-language triggers. Mention a target language, ask about tokenizer fertility, say "help me build an LLM for Yoruba" — and the orchestrator routes to the right specialist automatically.

The 18 Skills

The suite consists of 14 specialist skills, 1 orchestrator, and 3 optional Mindset stubs organized across a 5-phase pipeline:

Phase	Skills	Purpose
Scope	scope, scripts, tokenize, ethics	Language ID, resource class, typology, script policy, ethics seed
Acquire	corpus, bitext, transfer	Monolingual + parallel data, vocab/adapter strategy
Analyze	morph, syntax, annotate, semantics, discourse, speech	Linguistic-layer analysis as needed
Evaluate	eval	Honest metrics for the target language
Optional	codeswitch, historical, lexicon	Mindset stubs for Phase 4 decisions

The orchestrator (linguistic-orchestrator) is the entry point — it coordinates routing and tracks workspace state across sessions.

Low-Resource Language Focus

The suite is purpose-built for Joshi classes 0–4 — languages ranging from fully undocumented (Class 0) to benchmark-covered but resource-limited (Class 4). Every skill surfaces the resource-class implications of each decision, because the right strategy for Yoruba (Class 2) is fundamentally different from the right strategy for Turkish (Class 4) or English (Class 5).

Key capabilities:

Language identification — ISO 639-3 + Glottolog disambiguation, including macrolanguage disambiguation (Chinese → Mandarin/Cantonese/Wu)
Script + Unicode handling — normalization policy, diacritic preservation for tone languages, TR39 confusable folding
Tokenizer audits — fertility ratio analysis, vocab extension method selection (FOCUS/OFA/HyperOfa)
Ethical compliance — FPIC/CARE principles, sacred-text gating, license compatibility, attribution lineage
Data curation — corpus catalog, paragraph-level language-ID, MinHash dedup, contamination audit
Bitext mining — LASER3/SONAR embeddings, Vecalign alignment, synthetic bitext generation
Transfer learning — LoRA rank by URIEL typological distance, MAD-X adapters, catastrophic-forgetting mitigation
Evaluation — chrF++/COMET/GEMBA-MQM, BLiMP-style grammatical probes, contamination-aware reporting

Integration with the MAGIC Ecosystem

The Linguistic Agent Skills suite is a sibling repo to the Data Agent Skills suite. They share structural patterns and can be used together: the data suite handles general-purpose tabular and text data pipelines; the linguistic suite handles language-specific decisions within those pipelines.

See Cross-Suite Integration for patterns on using both suites together.

Suite Status

The suite is complete as of 2026-04-23: all 18 skills shipped, 226/226 tests passing, all quality gates met (skill-judge 8-dimension 120-point rubric). Entry-point skills (orchestrator, scope, ethics, eval) scored A− (102+/120); specialist skills scored A−; Mindset stubs scored B+ (96+/120).

Was this page helpful?