MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-scripts

Decide Unicode normalization policy, detect script confusables, protect diacritics, and recommend romanization for the target language. Silent script issues compound through every downstream step — this skill must run before tokenizer training, deduplication, or bitext mining.

Overview

Script policy is the invisible infrastructure of any NLP pipeline. Choose the wrong normalization and 5–15% of your corpus silently duplicates. Strip diacritics from a tone language and word-level meaning collapses. Apply NFKC to Arabic presentation forms and rendering breaks downstream. linguistic-scripts makes the right policy explicit and records it in workspace_state.md so every subsequent skill applies it consistently.

Pipeline Position

Phase: Scope (Phase 0) — runs immediately after linguistic-scope

Before this skill: linguistic-scope (to know the target script)

After this skill: linguistic-corpus (normalization at Acquire), linguistic-tokenize (pre-tokenize NFC), linguistic-bitext (pre-dedup confusable folding)

When It Activates

  • Setting normalization policy (NFC vs NFKC) for a target script
  • Diagnosing garbled output, unknown glyphs, mojibake
  • Pre-deduplication confusable folding for mixed-script corpora
  • Romanization / transliteration table selection
  • Diacritic restoration or preservation for tone languages
  • Multi-script corpus handling (mixed Devanagari + Latin, Kazakh Cyrillic + Latin)

When NOT to use: The script policy is already in workspace_state.md and downstream specialists are following it.

What It Does

NFC vs NFKC Policy

NFC (Canonical Composition) is the safe default — it collapses canonically-equivalent forms with no semantic loss. NFKC additionally collapses compatibility characters (ligatures, superscripts, full-width forms) and is destructive for many scripts.

ScriptRecommendedRationale
Modern Latin, CyrillicNFCSafe; no compatibility chars expected
Devanagari, Bengali, Tamil, IndicNFCNEVER NFKC — destroys some conjuncts
Arabic, HebrewNFCNEVER NFKC — Arabic presentation forms carry rendering info
CJK ideographicNFCNFKC collapses full-width ASCII; usually not desired
Historical text (long-s, fi/fl ligatures)NFCNFKC destroys long-s distinction

TR39 Confusable Folding

Apply Unicode TR39 skeleton mapping before deduplication and bitext alignment. Without it, Cyrillic "а" (U+0430) and Latin "a" (U+0061) appear as distinct tokens — causing 5–15% of corpus to be duplicated under different scripts. Confusable folding is a dedup key only; never store fold-output as canonical text.

Diacritic Preservation for Tone Languages

For Yoruba, Vietnamese, Hausa, Mandarin pinyin, Igbo, and Twi, diacritic stripping is catastrophic data corruption. The skill blocks any pipeline calling unidecode() or stripping combining marks on these languages.

LanguageDiacritic RoleStripping Cost
YorubaHigh/low tone (á/à), nasal (ọ̀)Word-level meaning loss
Vietnamese6 tones (á/à/ả/ã/ạ + base)Random meaning soup
HausaHigh/low tone markingVerb/noun ambiguation
Mandarin pinyin4 tones (mā/má/mǎ/mà)Same problem

Romanization Scheme Selection

ScriptRecommendedReversible?
DevanagariIASTYes
Cyrillic (Russian)ISO 9 / GOST 7.79Yes
ArabicALA-LC or DIN 31635Approx
Han (Mandarin)Pinyin (with tones)No
HangulRevised Romanization (2000)Yes
KanaHepburnYes

Inputs & Outputs

InputDescription
Target language + script from scopeUsed to determine block(s) and policy
Corpus samples (optional)For confusable detection and encoding diagnosis
OutputDescription
Normalization policyNFC/NFKC/NFD decision with rationale
Diacritic policyPRESERVE / OPTIONAL / STRIPPABLE
Romanization schemeName + reversibility flag
Confusable risk levelLOW / MEDIUM / HIGH
ZWJ/ZWNJ policyNORMALIZE / PRESERVE
workspace_state.md entryScript policy record applied by all downstream skills

Example Usage

Language: Yoruba (yor)

Script Policy: Yoruba
- Primary script: Latin (U+0000-U+007F + combining diacritics)
- Normalization: NFC
- Diacritics: PRESERVE (tone language — á/à/ọ̀ carry word meaning)
- Romanization: N/A (already Latin-script)
- Confusable risk: LOW
- ZWJ/ZWNJ: NORMALIZE
Apply order: BOM strip → NFC → diacritic validation → TR39 fold (dedup key only)
Was this page helpful?
Edit on GitHub

Last updated on

On this page