linguistic-scope

Identify a target language precisely and set the strategic direction for any LLM/NLP project. This is the mandatory first step — getting the language identity wrong wastes weeks of downstream work.

Overview

linguistic-scope is the foundation of every linguistic pipeline run. It resolves ambiguous language names to canonical identifiers, classifies the language's resource availability, extracts the typological features that matter for ML decisions, and recommends the best transfer source. None of the downstream skills can be used responsibly without the outputs this skill produces.

The most costly engineering mistakes in low-resource NLP — training on the wrong Quechua variant, treating Arabic as monolithic, assuming English is the best transfer source for Yoruba — are all prevented by running scope first and making its reasoning explicit.

Pipeline Position

Phase: Scope (Phase 0) — runs first in every pipeline

Before this skill: Nothing. This is the mandatory entry point for any new language target.

After this skill: linguistic-scripts (normalization policy), linguistic-ethics (early FPIC gate), then linguistic-corpus or linguistic-bitext in the Acquire phase.

When It Activates

User names a target language for any NLP/LLM project
Workflow needs ISO 639-3 + Glottolog identity before touching data
Resource-class assessment (Joshi 0–5) is required to pick a strategy
Choosing a transfer source (which related language to bootstrap from)
User names a language that is potentially a macrolanguage: Chinese, Arabic, Pashto, Quechua, Sami
Determining vitality status (does this need community engagement?)

When NOT to use: The language is already disambiguated, classified, and the typology vector is in workspace_state.md — proceed to the next specialist.

What It Does

1. Language Disambiguation

Resolves any input string to {ISO 639-3, Glottolog ID, family, default script}. If the result is a macrolanguage (zho, ara, pus, que, smi), scope stops and presents subtags as numbered options — it never proceeds past a macrolanguage without user confirmation.

2. Joshi Resource Classification

Computes the Joshi 0–5 class from cached signals: Wikipedia presence, OPUS presence, FLORES-200 inclusion, NLLB inclusion, dataset count on HuggingFace.

Class	Label	Examples	Typical Strategy
0	"The Left-Behinds"	Dahalo, Yagua	Bootstrap from related language; field documentation
1	"The Scraping-Bys"	Igbo, Marathi (low end)	Continued pretraining + adapter
2	"Hopefuls"	Yoruba, Khmer, Twi	Vocab extension + LoRA
3	"Rising Stars"	Swahili, Indonesian (low end)	Standard fine-tune + careful eval
4	"Underdogs"	Vietnamese, Turkish, Tamil	Standard fine-tune
5	"Winners"	English, Mandarin, Spanish	Standard everything

Input	Description
Language name / ISO code	Any string identifying the target language

Output	Description
ISO 639-3 + Glottolog ID	Canonical language identifier
Joshi class (0–5)	Resource availability classification
Typological outliers	Features requiring special handling
Top-3 transfer sources	URIEL-distance-ranked candidates
Vitality status	Ethics engagement depth required
`workspace_state.md` entry	Structured scope record for downstream skills

Example Usage

Input: "help me build an LLM for Yoruba"

Output:

## Scope: Yoruba

- **ISO 639-3**: yor
- **Glottolog**: yoru1245
- **Family**: Niger-Congo > Atlantic-Congo > Volta-Congo
- **Default script(s)**: Latin (with tonal diacritics)
- **Resource class (Joshi 0-5)**: 2 — "Hopefuls"
- **Vitality (EGIDS)**: 2 — Standard FPIC + license check
- **Typological outliers**: tone (lexical), SVO, isolating
- **Top transfer source(s)**:
  1. Igbo (ibo) — URIEL=0.18 — same family + tone + Latin script + Class 1 data
  2. Hausa (hau) — URIEL=0.34 — regional contact + tone + Class 2 data
  3. Swahili (swa) — URIEL=0.41 — same family branch + Class 3 data
  English distance 0.62 — NOT recommended as primary source.
- **Strategic recommendation**: Use Igbo as transfer source; apply vocab extension + LoRA (Class 2); FLORES-200 eval with chrF++ primary metric.

linguistic-scripts — sets normalization policy based on scope output
linguistic-ethics — uses vitality status to set ethics depth
linguistic-transfer — uses URIEL distance for LoRA rank selection
linguistic-eval — uses Joshi class for benchmark selection

Was this page helpful?

linguistic-scope

Overview

Pipeline Position

When It Activates

What It Does

1. Language Disambiguation

2. Joshi Resource Classification

3. Typological Profile

4. Transfer Source Selection

5. Vitality Assessment

Inputs & Outputs

Example Usage

On this page