linguistic-scope
Identify a target language precisely and set the strategic direction for any LLM/NLP project. This is the mandatory first step — getting the language identity wrong wastes weeks of downstream work.
Overview
linguistic-scope is the foundation of every linguistic pipeline run. It resolves ambiguous language names to canonical identifiers, classifies the language's resource availability, extracts the typological features that matter for ML decisions, and recommends the best transfer source. None of the downstream skills can be used responsibly without the outputs this skill produces.
The most costly engineering mistakes in low-resource NLP — training on the wrong Quechua variant, treating Arabic as monolithic, assuming English is the best transfer source for Yoruba — are all prevented by running scope first and making its reasoning explicit.
Pipeline Position
Phase: Scope (Phase 0) — runs first in every pipeline
Before this skill: Nothing. This is the mandatory entry point for any new language target.
After this skill: linguistic-scripts (normalization policy), linguistic-ethics (early FPIC gate), then linguistic-corpus or linguistic-bitext in the Acquire phase.
When It Activates
- User names a target language for any NLP/LLM project
- Workflow needs ISO 639-3 + Glottolog identity before touching data
- Resource-class assessment (Joshi 0–5) is required to pick a strategy
- Choosing a transfer source (which related language to bootstrap from)
- User names a language that is potentially a macrolanguage: Chinese, Arabic, Pashto, Quechua, Sami
- Determining vitality status (does this need community engagement?)
When NOT to use: The language is already disambiguated, classified, and the typology vector is in workspace_state.md — proceed to the next specialist.
What It Does
1. Language Disambiguation
Resolves any input string to {ISO 639-3, Glottolog ID, family, default script}. If the result is a macrolanguage (zho, ara, pus, que, smi), scope stops and presents subtags as numbered options — it never proceeds past a macrolanguage without user confirmation.
2. Joshi Resource Classification
Computes the Joshi 0–5 class from cached signals: Wikipedia presence, OPUS presence, FLORES-200 inclusion, NLLB inclusion, dataset count on HuggingFace.
| Class | Label | Examples | Typical Strategy |
|---|---|---|---|
| 0 | "The Left-Behinds" | Dahalo, Yagua | Bootstrap from related language; field documentation |
| 1 | "The Scraping-Bys" | Igbo, Marathi (low end) | Continued pretraining + adapter |
| 2 | "Hopefuls" | Yoruba, Khmer, Twi | Vocab extension + LoRA |
| 3 | "Rising Stars" | Swahili, Indonesian (low end) | Standard fine-tune + careful eval |
| 4 | "Underdogs" | Vietnamese, Turkish, Tamil | Standard fine-tune |
| 5 | "Winners" | English, Mandarin, Spanish | Standard everything |
3. Typological Profile
Surfaces outlier features from the URIEL vector that require targeted handling: polysynthesis, tone, agglutination, root-and-pattern morphology, evidentiality, classifier systems, switch reference.
4. Transfer Source Selection
Computes URIEL typological distance to top-100 candidate sources and recommends top-3 with justifications. English is often not the best choice — for Yoruba, Igbo (distance 0.18) outperforms English (distance 0.62) by 2–5× on transfer tasks.
5. Vitality Assessment
Maps UNESCO/EGIDS vitality to required ethics engagement depth. Languages at EGIDS 6b–7 (threatened/shifting) require mandatory community pre-engagement before any data acquisition.
Inputs & Outputs
| Input | Description |
|---|---|
| Language name / ISO code | Any string identifying the target language |
| Output | Description |
|---|---|
| ISO 639-3 + Glottolog ID | Canonical language identifier |
| Joshi class (0–5) | Resource availability classification |
| Typological outliers | Features requiring special handling |
| Top-3 transfer sources | URIEL-distance-ranked candidates |
| Vitality status | Ethics engagement depth required |
workspace_state.md entry | Structured scope record for downstream skills |
Example Usage
Input: "help me build an LLM for Yoruba"
Output:
## Scope: Yoruba
- **ISO 639-3**: yor
- **Glottolog**: yoru1245
- **Family**: Niger-Congo > Atlantic-Congo > Volta-Congo
- **Default script(s)**: Latin (with tonal diacritics)
- **Resource class (Joshi 0-5)**: 2 — "Hopefuls"
- **Vitality (EGIDS)**: 2 — Standard FPIC + license check
- **Typological outliers**: tone (lexical), SVO, isolating
- **Top transfer source(s)**:
1. Igbo (ibo) — URIEL=0.18 — same family + tone + Latin script + Class 1 data
2. Hausa (hau) — URIEL=0.34 — regional contact + tone + Class 2 data
3. Swahili (swa) — URIEL=0.41 — same family branch + Class 3 data
English distance 0.62 — NOT recommended as primary source.
- **Strategic recommendation**: Use Igbo as transfer source; apply vocab extension + LoRA (Class 2); FLORES-200 eval with chrF++ primary metric.Related Skills
linguistic-scripts— sets normalization policy based on scope outputlinguistic-ethics— uses vitality status to set ethics depthlinguistic-transfer— uses URIEL distance for LoRA rank selectionlinguistic-eval— uses Joshi class for benchmark selection
Last updated on
linguistic-orchestrator
Entry point for any linguistic/NLP/LLM-for-low-resource-language task. Coordinates the 5-phase pipeline (Scope→Acquire→Analyze→Evaluate→Release) and routes to the right specialist skill. A-tier.
linguistic-scripts
Decide Unicode normalization policy, detect script confusables, recommend romanization/transliteration, and protect diacritics for the target language. Runs before any tokenizer training, deduplication, or bitext mining.