Quick Start

This walkthrough takes you through your first language analysis using the Linguistic Agent Skills suite. You'll scope a low-resource language, get its typological profile, and establish a script policy — the foundation for any LLM project.

Prerequisites

Complete Installation first. You need Claude Code running with the skills and commands linked.

Step 1 — Start a Pipeline Session

Open a Claude Code session in your project directory and type:

help me build an LLM for Yoruba

The linguistic-orchestrator activates automatically. It creates workspace_state.md in your current directory and routes to linguistic-scope.

Alternatively, use the explicit slash command:

/linguistic:lifecycle

Step 2 — Language Disambiguation

linguistic-scope immediately runs disambiguation. Yoruba (yor) is an unambiguous ISO 639-3 code, so it proceeds directly. If you had said "Chinese" or "Arabic", it would pause and present subtags as numbered options — macrolanguage disambiguation is mandatory before any data work begins.

Expected output:

ISO 639-3: yor
Glottolog: yoru1245
Family: Niger-Congo > Atlantic-Congo > Volta-Congo > Benue-Congo > Yoruboid
Default script: Latin (with tonal diacritics)

Step 3 — Resource Classification

The skill computes the Joshi 0–5 resource class from cached signals: Wikipedia presence, OPUS presence, FLORES-200 inclusion, NLLB inclusion, and HuggingFace dataset count.

Expected output for Yoruba:

Resource class (Joshi): 2 — "Hopefuls"
  Some labelled data, no full benchmark coverage.
  Recommended strategy: Vocab extension + LoRA.
  Transfer source candidates: Igbo (0.18), Hausa (0.34), Swahili (0.41)
  English distance: 0.62 — NOT recommended as primary transfer source.

This class changes every downstream decision — tokenizer strategy, eval suite, adapter rank, ethics depth. The orchestrator records it in workspace_state.md.

Step 4 — Typological Profile

The skill surfaces outlier features from the URIEL vector that require targeted handling:

Typological outliers for Yoruba (yor):
  • Tone: HIGH/LOW tone (á/à), nasal (ọ̀) — diacritic preservation MANDATORY
  • SVO word order: standard
  • Isolating morphology: low fertility expected (~1.8-2.2×)
  • No grammatical gender
  • Subject-verb agreement: limited

Step 5 — Script Policy (linguistic-scripts)

The orchestrator routes to linguistic-scripts to set normalization policy:

Script Policy for Yoruba:
  Primary script: Latin (Unicode block U+0000-U+007F + combining diacritics)
  Normalization: NFC (default)
  Diacritics: PRESERVE — tone language (stripping = data corruption)
  Romanization: N/A (already Latin-script)
  Confusable risk: LOW
  ZWJ/ZWNJ: NORMALIZE

This policy is recorded in workspace_state.md and applies to every downstream step.

Step 6 — Ethics Seed (linguistic-ethics)

Before any data recommendation, linguistic-ethics runs an early gate:

Ethics seed for Yoruba:
  Vitality (EGIDS): 2 — Provincial (widely spoken in Nigeria)
  Ethics depth: Standard FPIC + license check
  Community engagement: Standard (no mandatory pre-engagement for this vitality level)
  Sacred-text flag: None identified

Step 7 — Review the Workspace State

At any point, check where you are:

/linguistic:status

Output:

[Phase: Scope | Language: Yoruba (yor) | Resource Class: 2 | Skills routed: scope, scripts, ethics | Open findings: 1]

For a full review:

/linguistic:review

Step 8 — Next Steps

With Scope complete, you're ready for the Acquire phase. The orchestrator will suggest:

/linguistic:propose

This generates a full plan.md covering all 5 phases — corpus sources (with ethics flags), tokenizer strategy, adapter plan, eval suite, and release gating — all tailored to Yoruba Class 2.

What Just Happened

In these 8 steps, the suite automatically:

Resolved the language to a canonical ISO 639-3 + Glottolog identifier
Classified it as Joshi Class 2 with typology-informed strategy
Identified that English is a poor transfer source (URIEL distance 0.62) and recommended Igbo as the primary
Set a script policy that protects tone diacritics from being stripped
Ran an ethics seed that gates all future data recommendations
Wrote structured state to workspace_state.md for cross-session continuity

All of this from a single natural-language prompt — "help me build an LLM for Yoruba".

Was this page helpful?

On this page