linguistic-semantics
Lexical and frame semantics for the target language: WordNet/OMW coverage gaps, FrameNet/PropBank-style SRL guidance, MWE/PARSEME handling, and semantic-equivalence eval for cross-lingual RAG and retrieval.
Overview
MWE handling is the dominant low-resource MT failure mode — "kick the bucket" translated literally produces nonsense. WordNet/OMW coverage for many low-resource languages is 5–30K synsets versus English's 117K. PropBank frames are not 1:1 across languages. linguistic-semantics makes these gaps explicit and recommends targeted tools for each.
Pipeline Position
Phase: Analyze (Phase 2)
Before this skill: linguistic-scope (language identity for resource lookup), linguistic-morph / linguistic-syntax (prerequisite structure analysis)
After this skill: linguistic-eval (semantic-equivalence metrics), linguistic-annotate (sense annotation projects)
When It Activates
- Need WordNet/OMW coverage for the target language
- Building/evaluating SRL or frame-semantics annotation
- Diagnosing MWE-related MT failures (idioms mistranslated literally)
- Sense-equivalence eval for retrieval/RAG grounding
- Adding semantic-grounded eval to an LLM-quality pipeline
When NOT to use: Purely surface-level eval (BLEU, chrF) → linguistic-eval. POS/dep parsing → linguistic-syntax. Pure annotation methodology → linguistic-annotate.
What It Does
WordNet / OMW Coverage
Per-language synset count varies 10–20× vs English. Languages like Yoruba have ~5K synsets vs English 117K. RAG grounding queries that depend on synset coverage silently fail in target languages when this gap is not accounted for.
| Language | OMW Synsets | Coverage vs English |
|---|---|---|
| English (Princeton WN) | 117K | baseline |
| Spanish | ~38K | 32% |
| Arabic | ~28K | 24% |
| Swahili | ~10K | 9% |
| Yoruba | ~5K | 4% |
Frame Semantics / SRL
PropBank-style SRL frames are NOT 1:1 across languages. English "GIVE" frame ≠ Spanish "DAR" frame structure exactly. For cross-lingual SRL: alignment via Predicate Matrix or MultiFrameNet is required.
FrameNet availability: Berkeley FrameNet for English; per-language FrameNet projects exist for ~20 languages. For others: Predicate Matrix bridges via English with coverage gaps.
MWE / PARSEME Handling
MWEs (multi-word expressions, idioms, light verbs) must be treated as single semantic units. Pre-tokenize MWEs before tokenization when possible using PARSEME-tagged corpora. Build a per-target MWE catalog if not available — even 500 idioms helps.
Never treat "kick the bucket" as 3 unrelated content words.
Semantic-Equivalence Eval
For RAG/retrieval/cross-lingual semantic similarity:
- COMET-style learned metric > BLEU for semantic equivalence
- LaBSE / SONAR cross-lingual embedding cosine for retrieval-style eval
- COMET-22 coverage varies — check per-language before reporting; a "missing" language gets random numbers
Inputs & Outputs
| Input | Description |
|---|---|
| Target language ISO code | For WordNet/OMW/FrameNet lookup |
| Task type | MT / RAG / SRL / retrieval |
| Output | Description |
|---|---|
| WordNet/OMW coverage | Synset count + gap % vs Princeton WN |
| FrameNet status | Available / partial / absent |
| MWE strategy | PARSEME-aligned / custom catalog / none |
| Semantic-equivalence metric | COMET / LaBSE / SONAR recommendation |
workspace_state.md entry | Semantics plan |
Example Usage
Language: Swahili (swa), task: cross-lingual MT eval
Semantics Analysis: Swahili
- WordNet/OMW: ~10K synsets (9% coverage vs Princeton WN)
Gap: technical/abstract vocabulary largely absent
- FrameNet: absent for Swahili; use Predicate Matrix bridge via English
- PropBank-style SRL: stanza morph + custom frame alignment
- MWE strategy: no PARSEME corpus; build 500-idiom pilot catalog
- Semantic-equivalence eval: COMET-22 (check Swahili coverage);
LaBSE cosine as supplementary
- RAG grounding: flag OMW gap — 91% of English synsets unavailableRelated Skills
linguistic-scope— language identity for resource lookuplinguistic-annotate— sense annotation project designlinguistic-eval— COMET/LaBSE as eval metricslinguistic-bitext— MWE handling in parallel corpus
Last updated on
linguistic-annotate
Design, run, and audit annotation projects — guideline authoring, IAA metric selection (Cohen κ/Fleiss κ/Krippendorff α/γ), adjudication workflow, active learning for sample selection.
linguistic-discourse
Discourse-level analysis — RST/PDTB/GUM framework selection, coreference (including zero-anaphora in pro-drop languages), discourse markers, and coherence-aware evaluation for long-context LLMs.