linguistic-tokenize

Audit tokenizer fertility for a target language and recommend the right SentencePiece configuration, vocab-extension strategy, and byte-fallback policy. A fertility ratio above 2.0× versus English means every downstream cost — latency, context window, training compute — is multiplied proportionally.

Overview

Tokenizer fertility is the single most impactful early decision in a low-resource LLM project. A Class 2 language like Yoruba on tiktoken-cl100k_base hits a fertility ratio of ~2.4× — every token costs 2.4× more context, generation, and training compute than English. linguistic-tokenize makes this cost explicit and recommends the lowest-cost path to fix it before any training begins.

Pipeline Position

Phase: Scope (Phase 0) / Acquire (Phase 1)

Before this skill: linguistic-scope (resource class, typology), linguistic-scripts (normalization policy)

After this skill: linguistic-corpus (data prep), linguistic-transfer (vocab extension aligns with adapter choice), linguistic-eval (measuring on a benchmark)

When It Activates

Any new language being added to an existing pretrained model
Symptoms of high tokenizer fertility: slow generation, context-window blowup, OOV
Choosing between training a new tokenizer or extending an existing one
Selecting SentencePiece config (BPE vs Unigram, coverage threshold, byte fallback)
Picking a vocab-extension method (FOCUS / OFA / HyperOfa / full retrain)

When NOT to use: The language is well-covered by the existing tokenizer (fertility ≤ 2.0× vs English baseline) and scope/scripts have already validated the approach.

What It Does

Fertility Audit

Computes tokens-per-word for the target language vs the English baseline (~1.4 for tiktoken-cl100k_base on Wikipedia English).

Ratio	Verdict
≤ 1.5×	No action needed
1.5–2.0×	Optional vocab extension
2.0–3.0×	Vocab extension recommended
≥ 3.0×	Vocab extension mandatory

Vocab Extension Method Selection

Situation	Recommendation
Class 3–4 + fertility 2.0–3.0× + same-family source	Continued pretraining + Unigram coverage tuning
Class 2–3 + fertility ≥ 3.0× + parallel data exists	OFA vocab extension + LoRA
Class 1–2 + fertility ≥ 3.0× + minimal target data	HyperOfa + LoRA + multilingual base
Class 0–1 + no usable base	Full retrain (last resort)

FOCUS works when source and target share script and high token overlap. OFA works when parallel data exists. HyperOfa works when data is minimal and the pair is typologically distant.

SentencePiece Configuration

Parameter	Latin/Cyrillic	Indic/Abugida	Han/CJK
`model_type`	unigram	unigram	unigram
`vocab_size`	32–64K	32–64K	64–128K
`character_coverage`	0.9999	0.99999	0.99999
`byte_fallback`	true (Class 0–2: mandatory)	TRUE	TRUE
`split_digits`	true	true	true

character_coverage=0.9995 is too low for ideographic/abugida scripts — it drops semantically-significant rare characters. Use 0.99999+ for Han, Devanagari family, Arabic, Khmer, Myanmar, Tibetan.

Byte Fallback Policy

Byte fallback is mandatory for Class 0–2 languages. Without it, the first OOD test produces <unk> cascades. Every modern tokenizer supports it — always turn it on for low-resource work.

Inputs & Outputs

Input	Description
Target language + tokenizer name	For fertility computation
Joshi class from scope	For approach selection
URIEL distance from scope	For source overlap estimation

Output	Description
Fertility ratio	Target / English baseline
Extension verdict	NO ACTION / OPTIONAL / RECOMMENDED / MANDATORY
Recommended method	FOCUS / OFA / HyperOfa / Full retrain
SentencePiece config	Key parameters
Byte fallback decision	ON / OFF with rationale
`workspace_state.md` entry	Tokenizer plan for downstream skills

Example Usage

Language: Yoruba (yor), tokenizer: tiktoken-cl100k_base

Tokenizer Plan: Yoruba
- Fertility: 3.4 / 1.4 = 2.43×
- Verdict: EXTEND MANDATORY
- Method: OFA vocab extension (parallel data via Bible-NLP + OPUS)
- SentencePiece: model_type=unigram, vocab_size=32K,
    character_coverage=0.9999, byte_fallback=true, split_digits=true
- Byte fallback: ON — Class 2
- BPE-dropout: 0.1 for general generation; 0.0 for structured output
- Estimated cost reduction: latency ×0.55, training-tokens ×0.58

linguistic-scope — provides Joshi class and typology for approach selection
linguistic-scripts — normalization policy applied pre-tokenize
linguistic-corpus — data prep in flight alongside tokenizer work
linguistic-transfer — vocab extension method must align with adapter choice

Data Cleaning — text normalization patterns

Was this page helpful?

On this page