linguistic-lexicon

Lexicography for ML: dictionary-building methodology, sense-splitting vs lumping decisions, MWE inventories for RAG glossary injection and MT post-edit, citation-form conventions, and variant handling. Optional Mindset specialist.

Overview

Lexicography is the unsexy specialty that quietly determines whether RAG, MT, and structured extraction work. A lexicon with wrong sense splits produces bad WSD eval scores. A lexicon without MWEs produces silent literal mistranslation ("kick the bucket" → three unrelated words). Treat lexicon construction as ML infrastructure, not an afterthought.

Pipeline Position

Phase: Optional (Phase 4) — activate when building a domain lexicon for RAG, MT post-edit, or WSD eval

When to activate: After linguistic-semantics identifies OMW coverage gaps or when domain-specific terminology is needed for RAG grounding

When It Activates

Building a target-language lexicon for MT post-edit, RAG glossary injection, or technical-domain control
Deciding sense-splitting vs sense-lumping policy for WSD eval
Designing MWE inventory for low-resource MT
Citation-form conventions per script / language family

When NOT to use: Sense-disambiguation eval directly → linguistic-semantics. Pure morphological paradigms → linguistic-morph.

What It Does

Sense Splitting vs Lumping

A corpus-design decision, not just lexicographic preference:

Granular splits: enable fine-grained WSD eval; harder to annotate consistently
Lumped splits: easier annotation consistency; trivialize WSD eval

Choose by use case. Atkins & Rundell (The Oxford Guide to Practical Lexicography, 2008) is canonical on this decision.

MWE Inventories

MWEs (multi-word expressions, idioms, light verbs) drive RAG glossary injection and MT post-edit quality. "Kick the bucket" / "let the cat out of the bag" — without MWE-aware processing, MT silently mistranslates literally. Cross-reference linguistic-semantics/references/mwe_parseme.md.

Citation-Form Conventions

Script/Family	Citation Form
Latin/Cyrillic (inflectional)	Nominative singular / infinitive
Semitic (Arabic, Hebrew)	Root form
Isolating (Mandarin, Thai)	Character sequence
Per-script convention applies	Document in lexicon metadata

Variant Handling

Spelling variants, dialect variants, pre-reform vs post-reform spellings — document policy explicitly. Undocumented variant handling compounds noise across the lexicon.

Cross-Lingual Lexicon Sources

Resource	Coverage	Use
Wiktionary	Multilingual, community-curated	Starting source; quality varies
DBnary	Wiktionary as RDF/structured	Programmatic access
PanLex	Meta-lexical resource, many bilingual dicts	Low-resource bridging
MUSE (Facebook)	Bilingual lexicons, 110 languages	Cross-lingual embeddings

Never treat Wiktionary as gold for production work — useful starting source, quality varies widely.

Low-Resource Lexicon Construction

Bootstrap via cognate sets (cross-reference linguistic-historical) → community refinement → curator pass.

Inputs & Outputs

Input	Description
Target language + domain	For lexicon scope
OMW coverage from semantics	Identifies gaps to fill
Use case	MT post-edit / RAG / WSD eval

Output	Description
Sense-splitting policy	Granular / lumped + rationale
MWE inventory	Idioms + light verbs + phrasal verbs
Citation-form conventions	Per-script documentation
Variant-handling policy	Spelling/dialect variants
Lexicon entries	Bootstrapped from cognates / Wiktionary

Example Usage

Language: Swahili (swa), use case: RAG glossary injection for legal-domain chatbot

Lexicon Plan: Swahili Legal Domain (RAG)
- Use case: RAG glossary injection (term-level precision critical)
- Sense policy: GRANULAR splits (WSD matters for legal terms;
    "haki" = right/justice/truth — must distinguish for RAG)
- MWE strategy: PARSEME-aligned where available; 300-term custom legal MWE catalog
- Citation form: nominative (Latin-script Swahili; standard)
- Variant handling: pre-1930s colonial spelling vs modern normalized — use modern
- Sources: Wiktionary baseline (review curator pass needed);
    DBnary for structured export; community review for legal terms
- Coverage: Wiktionary covers ~60% of target; 40% needs community contribution

linguistic-semantics — OMW coverage gaps identify what lexicon needs to fill
linguistic-historical — cognate-based lexicon bootstrap for Class 0–1
linguistic-morph — citation-form conventions depend on morphology type

Was this page helpful?

On this page