linguistic-lexicon
Lexicography for ML: dictionary-building methodology, sense-splitting vs lumping decisions, MWE inventories for RAG glossary injection and MT post-edit, citation-form conventions, and variant handling. Optional Mindset specialist.
Overview
Lexicography is the unsexy specialty that quietly determines whether RAG, MT, and structured extraction work. A lexicon with wrong sense splits produces bad WSD eval scores. A lexicon without MWEs produces silent literal mistranslation ("kick the bucket" → three unrelated words). Treat lexicon construction as ML infrastructure, not an afterthought.
Pipeline Position
Phase: Optional (Phase 4) — activate when building a domain lexicon for RAG, MT post-edit, or WSD eval
When to activate: After linguistic-semantics identifies OMW coverage gaps or when domain-specific terminology is needed for RAG grounding
When It Activates
- Building a target-language lexicon for MT post-edit, RAG glossary injection, or technical-domain control
- Deciding sense-splitting vs sense-lumping policy for WSD eval
- Designing MWE inventory for low-resource MT
- Citation-form conventions per script / language family
When NOT to use: Sense-disambiguation eval directly → linguistic-semantics. Pure morphological paradigms → linguistic-morph.
What It Does
Sense Splitting vs Lumping
A corpus-design decision, not just lexicographic preference:
- Granular splits: enable fine-grained WSD eval; harder to annotate consistently
- Lumped splits: easier annotation consistency; trivialize WSD eval
Choose by use case. Atkins & Rundell (The Oxford Guide to Practical Lexicography, 2008) is canonical on this decision.
MWE Inventories
MWEs (multi-word expressions, idioms, light verbs) drive RAG glossary injection and MT post-edit quality. "Kick the bucket" / "let the cat out of the bag" — without MWE-aware processing, MT silently mistranslates literally. Cross-reference linguistic-semantics/references/mwe_parseme.md.
Citation-Form Conventions
| Script/Family | Citation Form |
|---|---|
| Latin/Cyrillic (inflectional) | Nominative singular / infinitive |
| Semitic (Arabic, Hebrew) | Root form |
| Isolating (Mandarin, Thai) | Character sequence |
| Per-script convention applies | Document in lexicon metadata |
Variant Handling
Spelling variants, dialect variants, pre-reform vs post-reform spellings — document policy explicitly. Undocumented variant handling compounds noise across the lexicon.
Cross-Lingual Lexicon Sources
| Resource | Coverage | Use |
|---|---|---|
| Wiktionary | Multilingual, community-curated | Starting source; quality varies |
| DBnary | Wiktionary as RDF/structured | Programmatic access |
| PanLex | Meta-lexical resource, many bilingual dicts | Low-resource bridging |
| MUSE (Facebook) | Bilingual lexicons, 110 languages | Cross-lingual embeddings |
Never treat Wiktionary as gold for production work — useful starting source, quality varies widely.
Low-Resource Lexicon Construction
Bootstrap via cognate sets (cross-reference linguistic-historical) → community refinement → curator pass.
Inputs & Outputs
| Input | Description |
|---|---|
| Target language + domain | For lexicon scope |
| OMW coverage from semantics | Identifies gaps to fill |
| Use case | MT post-edit / RAG / WSD eval |
| Output | Description |
|---|---|
| Sense-splitting policy | Granular / lumped + rationale |
| MWE inventory | Idioms + light verbs + phrasal verbs |
| Citation-form conventions | Per-script documentation |
| Variant-handling policy | Spelling/dialect variants |
| Lexicon entries | Bootstrapped from cognates / Wiktionary |
Example Usage
Language: Swahili (swa), use case: RAG glossary injection for legal-domain chatbot
Lexicon Plan: Swahili Legal Domain (RAG)
- Use case: RAG glossary injection (term-level precision critical)
- Sense policy: GRANULAR splits (WSD matters for legal terms;
"haki" = right/justice/truth — must distinguish for RAG)
- MWE strategy: PARSEME-aligned where available; 300-term custom legal MWE catalog
- Citation form: nominative (Latin-script Swahili; standard)
- Variant handling: pre-1930s colonial spelling vs modern normalized — use modern
- Sources: Wiktionary baseline (review curator pass needed);
DBnary for structured export; community review for legal terms
- Coverage: Wiktionary covers ~60% of target; 40% needs community contributionRelated Skills
linguistic-semantics— OMW coverage gaps identify what lexicon needs to filllinguistic-historical— cognate-based lexicon bootstrap for Class 0–1linguistic-morph— citation-form conventions depend on morphology type
Last updated on
linguistic-historical
Historical/comparative linguistics primitives for ML data augmentation — cognate sets, Swadesh lists for bilingual-lexicon bootstrapping, regular sound-correspondence rules. Optional Mindset specialist for Class 0–1 languages.
Commands Reference
All 10 /linguistic slash commands for the Linguistic Agent Skills suite