MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-lexicon

Lexicography for ML: dictionary-building methodology, sense-splitting vs lumping decisions, MWE inventories for RAG glossary injection and MT post-edit, citation-form conventions, and variant handling. Optional Mindset specialist.

Overview

Lexicography is the unsexy specialty that quietly determines whether RAG, MT, and structured extraction work. A lexicon with wrong sense splits produces bad WSD eval scores. A lexicon without MWEs produces silent literal mistranslation ("kick the bucket" → three unrelated words). Treat lexicon construction as ML infrastructure, not an afterthought.

Pipeline Position

Phase: Optional (Phase 4) — activate when building a domain lexicon for RAG, MT post-edit, or WSD eval

When to activate: After linguistic-semantics identifies OMW coverage gaps or when domain-specific terminology is needed for RAG grounding

When It Activates

  • Building a target-language lexicon for MT post-edit, RAG glossary injection, or technical-domain control
  • Deciding sense-splitting vs sense-lumping policy for WSD eval
  • Designing MWE inventory for low-resource MT
  • Citation-form conventions per script / language family

When NOT to use: Sense-disambiguation eval directly → linguistic-semantics. Pure morphological paradigms → linguistic-morph.

What It Does

Sense Splitting vs Lumping

A corpus-design decision, not just lexicographic preference:

  • Granular splits: enable fine-grained WSD eval; harder to annotate consistently
  • Lumped splits: easier annotation consistency; trivialize WSD eval

Choose by use case. Atkins & Rundell (The Oxford Guide to Practical Lexicography, 2008) is canonical on this decision.

MWE Inventories

MWEs (multi-word expressions, idioms, light verbs) drive RAG glossary injection and MT post-edit quality. "Kick the bucket" / "let the cat out of the bag" — without MWE-aware processing, MT silently mistranslates literally. Cross-reference linguistic-semantics/references/mwe_parseme.md.

Citation-Form Conventions

Script/FamilyCitation Form
Latin/Cyrillic (inflectional)Nominative singular / infinitive
Semitic (Arabic, Hebrew)Root form
Isolating (Mandarin, Thai)Character sequence
Per-script convention appliesDocument in lexicon metadata

Variant Handling

Spelling variants, dialect variants, pre-reform vs post-reform spellings — document policy explicitly. Undocumented variant handling compounds noise across the lexicon.

Cross-Lingual Lexicon Sources

ResourceCoverageUse
WiktionaryMultilingual, community-curatedStarting source; quality varies
DBnaryWiktionary as RDF/structuredProgrammatic access
PanLexMeta-lexical resource, many bilingual dictsLow-resource bridging
MUSE (Facebook)Bilingual lexicons, 110 languagesCross-lingual embeddings

Never treat Wiktionary as gold for production work — useful starting source, quality varies widely.

Low-Resource Lexicon Construction

Bootstrap via cognate sets (cross-reference linguistic-historical) → community refinement → curator pass.

Inputs & Outputs

InputDescription
Target language + domainFor lexicon scope
OMW coverage from semanticsIdentifies gaps to fill
Use caseMT post-edit / RAG / WSD eval
OutputDescription
Sense-splitting policyGranular / lumped + rationale
MWE inventoryIdioms + light verbs + phrasal verbs
Citation-form conventionsPer-script documentation
Variant-handling policySpelling/dialect variants
Lexicon entriesBootstrapped from cognates / Wiktionary

Example Usage

Language: Swahili (swa), use case: RAG glossary injection for legal-domain chatbot

Lexicon Plan: Swahili Legal Domain (RAG)
- Use case: RAG glossary injection (term-level precision critical)
- Sense policy: GRANULAR splits (WSD matters for legal terms;
    "haki" = right/justice/truth — must distinguish for RAG)
- MWE strategy: PARSEME-aligned where available; 300-term custom legal MWE catalog
- Citation form: nominative (Latin-script Swahili; standard)
- Variant handling: pre-1930s colonial spelling vs modern normalized — use modern
- Sources: Wiktionary baseline (review curator pass needed);
    DBnary for structured export; community review for legal terms
- Coverage: Wiktionary covers ~60% of target; 40% needs community contribution
Was this page helpful?
Edit on GitHub

Last updated on

On this page