linguistic-tokenize
Audit tokenizer fertility for a target language and recommend the right SentencePiece configuration, vocab-extension strategy, and byte-fallback policy. A fertility ratio above 2.0× versus English means every downstream cost — latency, context window, training compute — is multiplied proportionally.
Overview
Tokenizer fertility is the single most impactful early decision in a low-resource LLM project. A Class 2 language like Yoruba on tiktoken-cl100k_base hits a fertility ratio of ~2.4× — every token costs 2.4× more context, generation, and training compute than English. linguistic-tokenize makes this cost explicit and recommends the lowest-cost path to fix it before any training begins.
Pipeline Position
Phase: Scope (Phase 0) / Acquire (Phase 1)
Before this skill: linguistic-scope (resource class, typology), linguistic-scripts (normalization policy)
After this skill: linguistic-corpus (data prep), linguistic-transfer (vocab extension aligns with adapter choice), linguistic-eval (measuring on a benchmark)
When It Activates
- Any new language being added to an existing pretrained model
- Symptoms of high tokenizer fertility: slow generation, context-window blowup, OOV
- Choosing between training a new tokenizer or extending an existing one
- Selecting SentencePiece config (BPE vs Unigram, coverage threshold, byte fallback)
- Picking a vocab-extension method (FOCUS / OFA / HyperOfa / full retrain)
When NOT to use: The language is well-covered by the existing tokenizer (fertility ≤ 2.0× vs English baseline) and scope/scripts have already validated the approach.
What It Does
Fertility Audit
Computes tokens-per-word for the target language vs the English baseline (~1.4 for tiktoken-cl100k_base on Wikipedia English).
| Ratio | Verdict |
|---|---|
| ≤ 1.5× | No action needed |
| 1.5–2.0× | Optional vocab extension |
| 2.0–3.0× | Vocab extension recommended |
| ≥ 3.0× | Vocab extension mandatory |
Vocab Extension Method Selection
| Situation | Recommendation |
|---|---|
| Class 3–4 + fertility 2.0–3.0× + same-family source | Continued pretraining + Unigram coverage tuning |
| Class 2–3 + fertility ≥ 3.0× + parallel data exists | OFA vocab extension + LoRA |
| Class 1–2 + fertility ≥ 3.0× + minimal target data | HyperOfa + LoRA + multilingual base |
| Class 0–1 + no usable base | Full retrain (last resort) |
FOCUS works when source and target share script and high token overlap. OFA works when parallel data exists. HyperOfa works when data is minimal and the pair is typologically distant.
SentencePiece Configuration
| Parameter | Latin/Cyrillic | Indic/Abugida | Han/CJK |
|---|---|---|---|
model_type | unigram | unigram | unigram |
vocab_size | 32–64K | 32–64K | 64–128K |
character_coverage | 0.9999 | 0.99999 | 0.99999 |
byte_fallback | true (Class 0–2: mandatory) | TRUE | TRUE |
split_digits | true | true | true |
character_coverage=0.9995 is too low for ideographic/abugida scripts — it drops semantically-significant rare characters. Use 0.99999+ for Han, Devanagari family, Arabic, Khmer, Myanmar, Tibetan.
Byte Fallback Policy
Byte fallback is mandatory for Class 0–2 languages. Without it, the first OOD test produces <unk> cascades. Every modern tokenizer supports it — always turn it on for low-resource work.
Inputs & Outputs
| Input | Description |
|---|---|
| Target language + tokenizer name | For fertility computation |
| Joshi class from scope | For approach selection |
| URIEL distance from scope | For source overlap estimation |
| Output | Description |
|---|---|
| Fertility ratio | Target / English baseline |
| Extension verdict | NO ACTION / OPTIONAL / RECOMMENDED / MANDATORY |
| Recommended method | FOCUS / OFA / HyperOfa / Full retrain |
| SentencePiece config | Key parameters |
| Byte fallback decision | ON / OFF with rationale |
workspace_state.md entry | Tokenizer plan for downstream skills |
Example Usage
Language: Yoruba (yor), tokenizer: tiktoken-cl100k_base
Tokenizer Plan: Yoruba
- Fertility: 3.4 / 1.4 = 2.43×
- Verdict: EXTEND MANDATORY
- Method: OFA vocab extension (parallel data via Bible-NLP + OPUS)
- SentencePiece: model_type=unigram, vocab_size=32K,
character_coverage=0.9999, byte_fallback=true, split_digits=true
- Byte fallback: ON — Class 2
- BPE-dropout: 0.1 for general generation; 0.0 for structured output
- Estimated cost reduction: latency ×0.55, training-tokens ×0.58Related Skills
linguistic-scope— provides Joshi class and typology for approach selectionlinguistic-scripts— normalization policy applied pre-tokenizelinguistic-corpus— data prep in flight alongside tokenizer worklinguistic-transfer— vocab extension method must align with adapter choice
Related Skills from Other Suites
- Data Cleaning — text normalization patterns
Last updated on
linguistic-scripts
Decide Unicode normalization policy, detect script confusables, recommend romanization/transliteration, and protect diacritics for the target language. Runs before any tokenizer training, deduplication, or bitext mining.
linguistic-ethics
Apply CARE / FPIC / community-sovereignty / license-compliance / sacred-text gating across every linguistic project phase. A-tier; mandatory for every non-English dataset. Routed early (Scope) and again at Release.