MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-semantics

Lexical and frame semantics for the target language: WordNet/OMW coverage gaps, FrameNet/PropBank-style SRL guidance, MWE/PARSEME handling, and semantic-equivalence eval for cross-lingual RAG and retrieval.

Overview

MWE handling is the dominant low-resource MT failure mode — "kick the bucket" translated literally produces nonsense. WordNet/OMW coverage for many low-resource languages is 5–30K synsets versus English's 117K. PropBank frames are not 1:1 across languages. linguistic-semantics makes these gaps explicit and recommends targeted tools for each.

Pipeline Position

Phase: Analyze (Phase 2)

Before this skill: linguistic-scope (language identity for resource lookup), linguistic-morph / linguistic-syntax (prerequisite structure analysis)

After this skill: linguistic-eval (semantic-equivalence metrics), linguistic-annotate (sense annotation projects)

When It Activates

  • Need WordNet/OMW coverage for the target language
  • Building/evaluating SRL or frame-semantics annotation
  • Diagnosing MWE-related MT failures (idioms mistranslated literally)
  • Sense-equivalence eval for retrieval/RAG grounding
  • Adding semantic-grounded eval to an LLM-quality pipeline

When NOT to use: Purely surface-level eval (BLEU, chrF) → linguistic-eval. POS/dep parsing → linguistic-syntax. Pure annotation methodology → linguistic-annotate.

What It Does

WordNet / OMW Coverage

Per-language synset count varies 10–20× vs English. Languages like Yoruba have ~5K synsets vs English 117K. RAG grounding queries that depend on synset coverage silently fail in target languages when this gap is not accounted for.

LanguageOMW SynsetsCoverage vs English
English (Princeton WN)117Kbaseline
Spanish~38K32%
Arabic~28K24%
Swahili~10K9%
Yoruba~5K4%

Frame Semantics / SRL

PropBank-style SRL frames are NOT 1:1 across languages. English "GIVE" frame ≠ Spanish "DAR" frame structure exactly. For cross-lingual SRL: alignment via Predicate Matrix or MultiFrameNet is required.

FrameNet availability: Berkeley FrameNet for English; per-language FrameNet projects exist for ~20 languages. For others: Predicate Matrix bridges via English with coverage gaps.

MWE / PARSEME Handling

MWEs (multi-word expressions, idioms, light verbs) must be treated as single semantic units. Pre-tokenize MWEs before tokenization when possible using PARSEME-tagged corpora. Build a per-target MWE catalog if not available — even 500 idioms helps.

Never treat "kick the bucket" as 3 unrelated content words.

Semantic-Equivalence Eval

For RAG/retrieval/cross-lingual semantic similarity:

  • COMET-style learned metric > BLEU for semantic equivalence
  • LaBSE / SONAR cross-lingual embedding cosine for retrieval-style eval
  • COMET-22 coverage varies — check per-language before reporting; a "missing" language gets random numbers

Inputs & Outputs

InputDescription
Target language ISO codeFor WordNet/OMW/FrameNet lookup
Task typeMT / RAG / SRL / retrieval
OutputDescription
WordNet/OMW coverageSynset count + gap % vs Princeton WN
FrameNet statusAvailable / partial / absent
MWE strategyPARSEME-aligned / custom catalog / none
Semantic-equivalence metricCOMET / LaBSE / SONAR recommendation
workspace_state.md entrySemantics plan

Example Usage

Language: Swahili (swa), task: cross-lingual MT eval

Semantics Analysis: Swahili
- WordNet/OMW: ~10K synsets (9% coverage vs Princeton WN)
    Gap: technical/abstract vocabulary largely absent
- FrameNet: absent for Swahili; use Predicate Matrix bridge via English
- PropBank-style SRL: stanza morph + custom frame alignment
- MWE strategy: no PARSEME corpus; build 500-idiom pilot catalog
- Semantic-equivalence eval: COMET-22 (check Swahili coverage);
    LaBSE cosine as supplementary
- RAG grounding: flag OMW gap — 91% of English synsets unavailable
Was this page helpful?
Edit on GitHub

Last updated on

On this page