linguistic-scripts
Decide Unicode normalization policy, detect script confusables, protect diacritics, and recommend romanization for the target language. Silent script issues compound through every downstream step — this skill must run before tokenizer training, deduplication, or bitext mining.
Overview
Script policy is the invisible infrastructure of any NLP pipeline. Choose the wrong normalization and 5–15% of your corpus silently duplicates. Strip diacritics from a tone language and word-level meaning collapses. Apply NFKC to Arabic presentation forms and rendering breaks downstream. linguistic-scripts makes the right policy explicit and records it in workspace_state.md so every subsequent skill applies it consistently.
Pipeline Position
Phase: Scope (Phase 0) — runs immediately after linguistic-scope
Before this skill: linguistic-scope (to know the target script)
After this skill: linguistic-corpus (normalization at Acquire), linguistic-tokenize (pre-tokenize NFC), linguistic-bitext (pre-dedup confusable folding)
When It Activates
- Setting normalization policy (NFC vs NFKC) for a target script
- Diagnosing garbled output, unknown glyphs, mojibake
- Pre-deduplication confusable folding for mixed-script corpora
- Romanization / transliteration table selection
- Diacritic restoration or preservation for tone languages
- Multi-script corpus handling (mixed Devanagari + Latin, Kazakh Cyrillic + Latin)
When NOT to use: The script policy is already in workspace_state.md and downstream specialists are following it.
What It Does
NFC vs NFKC Policy
NFC (Canonical Composition) is the safe default — it collapses canonically-equivalent forms with no semantic loss. NFKC additionally collapses compatibility characters (ligatures, superscripts, full-width forms) and is destructive for many scripts.
| Script | Recommended | Rationale |
|---|---|---|
| Modern Latin, Cyrillic | NFC | Safe; no compatibility chars expected |
| Devanagari, Bengali, Tamil, Indic | NFC | NEVER NFKC — destroys some conjuncts |
| Arabic, Hebrew | NFC | NEVER NFKC — Arabic presentation forms carry rendering info |
| CJK ideographic | NFC | NFKC collapses full-width ASCII; usually not desired |
| Historical text (long-s, fi/fl ligatures) | NFC | NFKC destroys long-s distinction |
TR39 Confusable Folding
Apply Unicode TR39 skeleton mapping before deduplication and bitext alignment. Without it, Cyrillic "а" (U+0430) and Latin "a" (U+0061) appear as distinct tokens — causing 5–15% of corpus to be duplicated under different scripts. Confusable folding is a dedup key only; never store fold-output as canonical text.
Diacritic Preservation for Tone Languages
For Yoruba, Vietnamese, Hausa, Mandarin pinyin, Igbo, and Twi, diacritic stripping is catastrophic data corruption. The skill blocks any pipeline calling unidecode() or stripping combining marks on these languages.
| Language | Diacritic Role | Stripping Cost |
|---|---|---|
| Yoruba | High/low tone (á/à), nasal (ọ̀) | Word-level meaning loss |
| Vietnamese | 6 tones (á/à/ả/ã/ạ + base) | Random meaning soup |
| Hausa | High/low tone marking | Verb/noun ambiguation |
| Mandarin pinyin | 4 tones (mā/má/mǎ/mà) | Same problem |
Romanization Scheme Selection
| Script | Recommended | Reversible? |
|---|---|---|
| Devanagari | IAST | Yes |
| Cyrillic (Russian) | ISO 9 / GOST 7.79 | Yes |
| Arabic | ALA-LC or DIN 31635 | Approx |
| Han (Mandarin) | Pinyin (with tones) | No |
| Hangul | Revised Romanization (2000) | Yes |
| Kana | Hepburn | Yes |
Inputs & Outputs
| Input | Description |
|---|---|
| Target language + script from scope | Used to determine block(s) and policy |
| Corpus samples (optional) | For confusable detection and encoding diagnosis |
| Output | Description |
|---|---|
| Normalization policy | NFC/NFKC/NFD decision with rationale |
| Diacritic policy | PRESERVE / OPTIONAL / STRIPPABLE |
| Romanization scheme | Name + reversibility flag |
| Confusable risk level | LOW / MEDIUM / HIGH |
| ZWJ/ZWNJ policy | NORMALIZE / PRESERVE |
workspace_state.md entry | Script policy record applied by all downstream skills |
Example Usage
Language: Yoruba (yor)
Script Policy: Yoruba
- Primary script: Latin (U+0000-U+007F + combining diacritics)
- Normalization: NFC
- Diacritics: PRESERVE (tone language — á/à/ọ̀ carry word meaning)
- Romanization: N/A (already Latin-script)
- Confusable risk: LOW
- ZWJ/ZWNJ: NORMALIZE
Apply order: BOM strip → NFC → diacritic validation → TR39 fold (dedup key only)Related Skills
linguistic-scope— provides the target script identitylinguistic-corpus— applies normalization at acquire timelinguistic-tokenize— uses NFC as pre-tokenize baselinelinguistic-bitext— applies confusable folding before alignment
Last updated on
linguistic-scope
Identify a target language precisely and set the strategic direction for any LLM/NLP project. Handles ISO 639-3 resolution, Joshi resource classification, typological profiling, and transfer-source selection.
linguistic-tokenize
Audit tokenizer fertility for a target language and recommend SentencePiece config, vocab-extension strategy (FOCUS/OFA/HyperOfa), and byte-fallback policy. Non-negotiable for Class 0–3 languages.