MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-tokenize

Audit tokenizer fertility for a target language and recommend the right SentencePiece configuration, vocab-extension strategy, and byte-fallback policy. A fertility ratio above 2.0× versus English means every downstream cost — latency, context window, training compute — is multiplied proportionally.

Overview

Tokenizer fertility is the single most impactful early decision in a low-resource LLM project. A Class 2 language like Yoruba on tiktoken-cl100k_base hits a fertility ratio of ~2.4× — every token costs 2.4× more context, generation, and training compute than English. linguistic-tokenize makes this cost explicit and recommends the lowest-cost path to fix it before any training begins.

Pipeline Position

Phase: Scope (Phase 0) / Acquire (Phase 1)

Before this skill: linguistic-scope (resource class, typology), linguistic-scripts (normalization policy)

After this skill: linguistic-corpus (data prep), linguistic-transfer (vocab extension aligns with adapter choice), linguistic-eval (measuring on a benchmark)

When It Activates

  • Any new language being added to an existing pretrained model
  • Symptoms of high tokenizer fertility: slow generation, context-window blowup, OOV
  • Choosing between training a new tokenizer or extending an existing one
  • Selecting SentencePiece config (BPE vs Unigram, coverage threshold, byte fallback)
  • Picking a vocab-extension method (FOCUS / OFA / HyperOfa / full retrain)

When NOT to use: The language is well-covered by the existing tokenizer (fertility ≤ 2.0× vs English baseline) and scope/scripts have already validated the approach.

What It Does

Fertility Audit

Computes tokens-per-word for the target language vs the English baseline (~1.4 for tiktoken-cl100k_base on Wikipedia English).

RatioVerdict
≤ 1.5×No action needed
1.5–2.0×Optional vocab extension
2.0–3.0×Vocab extension recommended
≥ 3.0×Vocab extension mandatory

Vocab Extension Method Selection

SituationRecommendation
Class 3–4 + fertility 2.0–3.0× + same-family sourceContinued pretraining + Unigram coverage tuning
Class 2–3 + fertility ≥ 3.0× + parallel data existsOFA vocab extension + LoRA
Class 1–2 + fertility ≥ 3.0× + minimal target dataHyperOfa + LoRA + multilingual base
Class 0–1 + no usable baseFull retrain (last resort)

FOCUS works when source and target share script and high token overlap. OFA works when parallel data exists. HyperOfa works when data is minimal and the pair is typologically distant.

SentencePiece Configuration

ParameterLatin/CyrillicIndic/AbugidaHan/CJK
model_typeunigramunigramunigram
vocab_size32–64K32–64K64–128K
character_coverage0.99990.999990.99999
byte_fallbacktrue (Class 0–2: mandatory)TRUETRUE
split_digitstruetruetrue

character_coverage=0.9995 is too low for ideographic/abugida scripts — it drops semantically-significant rare characters. Use 0.99999+ for Han, Devanagari family, Arabic, Khmer, Myanmar, Tibetan.

Byte Fallback Policy

Byte fallback is mandatory for Class 0–2 languages. Without it, the first OOD test produces <unk> cascades. Every modern tokenizer supports it — always turn it on for low-resource work.

Inputs & Outputs

InputDescription
Target language + tokenizer nameFor fertility computation
Joshi class from scopeFor approach selection
URIEL distance from scopeFor source overlap estimation
OutputDescription
Fertility ratioTarget / English baseline
Extension verdictNO ACTION / OPTIONAL / RECOMMENDED / MANDATORY
Recommended methodFOCUS / OFA / HyperOfa / Full retrain
SentencePiece configKey parameters
Byte fallback decisionON / OFF with rationale
workspace_state.md entryTokenizer plan for downstream skills

Example Usage

Language: Yoruba (yor), tokenizer: tiktoken-cl100k_base

Tokenizer Plan: Yoruba
- Fertility: 3.4 / 1.4 = 2.43×
- Verdict: EXTEND MANDATORY
- Method: OFA vocab extension (parallel data via Bible-NLP + OPUS)
- SentencePiece: model_type=unigram, vocab_size=32K,
    character_coverage=0.9999, byte_fallback=true, split_digits=true
- Byte fallback: ON — Class 2
- BPE-dropout: 0.1 for general generation; 0.0 for structured output
- Estimated cost reduction: latency ×0.55, training-tokens ×0.58
Was this page helpful?
Edit on GitHub

Last updated on

On this page