MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-scope

Identify a target language precisely and set the strategic direction for any LLM/NLP project. This is the mandatory first step — getting the language identity wrong wastes weeks of downstream work.

Overview

linguistic-scope is the foundation of every linguistic pipeline run. It resolves ambiguous language names to canonical identifiers, classifies the language's resource availability, extracts the typological features that matter for ML decisions, and recommends the best transfer source. None of the downstream skills can be used responsibly without the outputs this skill produces.

The most costly engineering mistakes in low-resource NLP — training on the wrong Quechua variant, treating Arabic as monolithic, assuming English is the best transfer source for Yoruba — are all prevented by running scope first and making its reasoning explicit.

Pipeline Position

Phase: Scope (Phase 0) — runs first in every pipeline

Before this skill: Nothing. This is the mandatory entry point for any new language target.

After this skill: linguistic-scripts (normalization policy), linguistic-ethics (early FPIC gate), then linguistic-corpus or linguistic-bitext in the Acquire phase.

When It Activates

  • User names a target language for any NLP/LLM project
  • Workflow needs ISO 639-3 + Glottolog identity before touching data
  • Resource-class assessment (Joshi 0–5) is required to pick a strategy
  • Choosing a transfer source (which related language to bootstrap from)
  • User names a language that is potentially a macrolanguage: Chinese, Arabic, Pashto, Quechua, Sami
  • Determining vitality status (does this need community engagement?)

When NOT to use: The language is already disambiguated, classified, and the typology vector is in workspace_state.md — proceed to the next specialist.

What It Does

1. Language Disambiguation

Resolves any input string to {ISO 639-3, Glottolog ID, family, default script}. If the result is a macrolanguage (zho, ara, pus, que, smi), scope stops and presents subtags as numbered options — it never proceeds past a macrolanguage without user confirmation.

2. Joshi Resource Classification

Computes the Joshi 0–5 class from cached signals: Wikipedia presence, OPUS presence, FLORES-200 inclusion, NLLB inclusion, dataset count on HuggingFace.

ClassLabelExamplesTypical Strategy
0"The Left-Behinds"Dahalo, YaguaBootstrap from related language; field documentation
1"The Scraping-Bys"Igbo, Marathi (low end)Continued pretraining + adapter
2"Hopefuls"Yoruba, Khmer, TwiVocab extension + LoRA
3"Rising Stars"Swahili, Indonesian (low end)Standard fine-tune + careful eval
4"Underdogs"Vietnamese, Turkish, TamilStandard fine-tune
5"Winners"English, Mandarin, SpanishStandard everything

3. Typological Profile

Surfaces outlier features from the URIEL vector that require targeted handling: polysynthesis, tone, agglutination, root-and-pattern morphology, evidentiality, classifier systems, switch reference.

4. Transfer Source Selection

Computes URIEL typological distance to top-100 candidate sources and recommends top-3 with justifications. English is often not the best choice — for Yoruba, Igbo (distance 0.18) outperforms English (distance 0.62) by 2–5× on transfer tasks.

5. Vitality Assessment

Maps UNESCO/EGIDS vitality to required ethics engagement depth. Languages at EGIDS 6b–7 (threatened/shifting) require mandatory community pre-engagement before any data acquisition.

Inputs & Outputs

InputDescription
Language name / ISO codeAny string identifying the target language
OutputDescription
ISO 639-3 + Glottolog IDCanonical language identifier
Joshi class (0–5)Resource availability classification
Typological outliersFeatures requiring special handling
Top-3 transfer sourcesURIEL-distance-ranked candidates
Vitality statusEthics engagement depth required
workspace_state.md entryStructured scope record for downstream skills

Example Usage

Input: "help me build an LLM for Yoruba"

Output:

## Scope: Yoruba

- **ISO 639-3**: yor
- **Glottolog**: yoru1245
- **Family**: Niger-Congo > Atlantic-Congo > Volta-Congo
- **Default script(s)**: Latin (with tonal diacritics)
- **Resource class (Joshi 0-5)**: 2 — "Hopefuls"
- **Vitality (EGIDS)**: 2 — Standard FPIC + license check
- **Typological outliers**: tone (lexical), SVO, isolating
- **Top transfer source(s)**:
  1. Igbo (ibo) — URIEL=0.18 — same family + tone + Latin script + Class 1 data
  2. Hausa (hau) — URIEL=0.34 — regional contact + tone + Class 2 data
  3. Swahili (swa) — URIEL=0.41 — same family branch + Class 3 data
  English distance 0.62 — NOT recommended as primary source.
- **Strategic recommendation**: Use Igbo as transfer source; apply vocab extension + LoRA (Class 2); FLORES-200 eval with chrF++ primary metric.
Was this page helpful?
Edit on GitHub

Last updated on

On this page