linguistic-scripts

Decide Unicode normalization policy, detect script confusables, protect diacritics, and recommend romanization for the target language. Silent script issues compound through every downstream step — this skill must run before tokenizer training, deduplication, or bitext mining.

Overview

Script policy is the invisible infrastructure of any NLP pipeline. Choose the wrong normalization and 5–15% of your corpus silently duplicates. Strip diacritics from a tone language and word-level meaning collapses. Apply NFKC to Arabic presentation forms and rendering breaks downstream. linguistic-scripts makes the right policy explicit and records it in workspace_state.md so every subsequent skill applies it consistently.

Pipeline Position

Phase: Scope (Phase 0) — runs immediately after linguistic-scope

Before this skill: linguistic-scope (to know the target script)

After this skill: linguistic-corpus (normalization at Acquire), linguistic-tokenize (pre-tokenize NFC), linguistic-bitext (pre-dedup confusable folding)

When It Activates

Setting normalization policy (NFC vs NFKC) for a target script
Diagnosing garbled output, unknown glyphs, mojibake
Pre-deduplication confusable folding for mixed-script corpora
Romanization / transliteration table selection
Diacritic restoration or preservation for tone languages
Multi-script corpus handling (mixed Devanagari + Latin, Kazakh Cyrillic + Latin)

When NOT to use: The script policy is already in workspace_state.md and downstream specialists are following it.

What It Does

NFC vs NFKC Policy

NFC (Canonical Composition) is the safe default — it collapses canonically-equivalent forms with no semantic loss. NFKC additionally collapses compatibility characters (ligatures, superscripts, full-width forms) and is destructive for many scripts.

Script	Recommended	Rationale
Modern Latin, Cyrillic	NFC	Safe; no compatibility chars expected
Devanagari, Bengali, Tamil, Indic	NFC	NEVER NFKC — destroys some conjuncts
Arabic, Hebrew	NFC	NEVER NFKC — Arabic presentation forms carry rendering info
CJK ideographic	NFC	NFKC collapses full-width ASCII; usually not desired
Historical text (long-s, fi/fl ligatures)	NFC	NFKC destroys long-s distinction

TR39 Confusable Folding

Apply Unicode TR39 skeleton mapping before deduplication and bitext alignment. Without it, Cyrillic "а" (U+0430) and Latin "a" (U+0061) appear as distinct tokens — causing 5–15% of corpus to be duplicated under different scripts. Confusable folding is a dedup key only; never store fold-output as canonical text.

Diacritic Preservation for Tone Languages

For Yoruba, Vietnamese, Hausa, Mandarin pinyin, Igbo, and Twi, diacritic stripping is catastrophic data corruption. The skill blocks any pipeline calling unidecode() or stripping combining marks on these languages.

Language	Diacritic Role	Stripping Cost
Yoruba	High/low tone (á/à), nasal (ọ̀)	Word-level meaning loss
Vietnamese	6 tones (á/à/ả/ã/ạ + base)	Random meaning soup
Hausa	High/low tone marking	Verb/noun ambiguation
Mandarin pinyin	4 tones (mā/má/mǎ/mà)	Same problem

Romanization Scheme Selection

Script	Recommended	Reversible?
Devanagari	IAST	Yes
Cyrillic (Russian)	ISO 9 / GOST 7.79	Yes
Arabic	ALA-LC or DIN 31635	Approx
Han (Mandarin)	Pinyin (with tones)	No
Hangul	Revised Romanization (2000)	Yes
Kana	Hepburn	Yes

Inputs & Outputs

Input	Description
Target language + script from scope	Used to determine block(s) and policy
Corpus samples (optional)	For confusable detection and encoding diagnosis

Output	Description
Normalization policy	NFC/NFKC/NFD decision with rationale
Diacritic policy	PRESERVE / OPTIONAL / STRIPPABLE
Romanization scheme	Name + reversibility flag
Confusable risk level	LOW / MEDIUM / HIGH
ZWJ/ZWNJ policy	NORMALIZE / PRESERVE
`workspace_state.md` entry	Script policy record applied by all downstream skills

Example Usage

Language: Yoruba (yor)

Script Policy: Yoruba
- Primary script: Latin (U+0000-U+007F + combining diacritics)
- Normalization: NFC
- Diacritics: PRESERVE (tone language — á/à/ọ̀ carry word meaning)
- Romanization: N/A (already Latin-script)
- Confusable risk: LOW
- ZWJ/ZWNJ: NORMALIZE
Apply order: BOM strip → NFC → diacritic validation → TR39 fold (dedup key only)

linguistic-scope — provides the target script identity
linguistic-corpus — applies normalization at acquire time
linguistic-tokenize — uses NFC as pre-tokenize baseline
linguistic-bitext — applies confusable folding before alignment

Was this page helpful?

On this page