linguistic-semantics

Lexical and frame semantics for the target language: WordNet/OMW coverage gaps, FrameNet/PropBank-style SRL guidance, MWE/PARSEME handling, and semantic-equivalence eval for cross-lingual RAG and retrieval.

Overview

MWE handling is the dominant low-resource MT failure mode — "kick the bucket" translated literally produces nonsense. WordNet/OMW coverage for many low-resource languages is 5–30K synsets versus English's 117K. PropBank frames are not 1:1 across languages. linguistic-semantics makes these gaps explicit and recommends targeted tools for each.

Pipeline Position

Phase: Analyze (Phase 2)

Before this skill: linguistic-scope (language identity for resource lookup), linguistic-morph / linguistic-syntax (prerequisite structure analysis)

After this skill: linguistic-eval (semantic-equivalence metrics), linguistic-annotate (sense annotation projects)

When It Activates

Need WordNet/OMW coverage for the target language
Building/evaluating SRL or frame-semantics annotation
Diagnosing MWE-related MT failures (idioms mistranslated literally)
Sense-equivalence eval for retrieval/RAG grounding
Adding semantic-grounded eval to an LLM-quality pipeline

When NOT to use: Purely surface-level eval (BLEU, chrF) → linguistic-eval. POS/dep parsing → linguistic-syntax. Pure annotation methodology → linguistic-annotate.

What It Does

WordNet / OMW Coverage

Per-language synset count varies 10–20× vs English. Languages like Yoruba have ~5K synsets vs English 117K. RAG grounding queries that depend on synset coverage silently fail in target languages when this gap is not accounted for.

Language	OMW Synsets	Coverage vs English
English (Princeton WN)	117K	baseline
Spanish	~38K	32%
Arabic	~28K	24%
Swahili	~10K	9%
Yoruba	~5K	4%

COMET-style learned metric > BLEU for semantic equivalence
LaBSE / SONAR cross-lingual embedding cosine for retrieval-style eval
COMET-22 coverage varies — check per-language before reporting; a "missing" language gets random numbers

Inputs & Outputs

Input	Description
Target language ISO code	For WordNet/OMW/FrameNet lookup
Task type	MT / RAG / SRL / retrieval

Output	Description
WordNet/OMW coverage	Synset count + gap % vs Princeton WN
FrameNet status	Available / partial / absent
MWE strategy	PARSEME-aligned / custom catalog / none
Semantic-equivalence metric	COMET / LaBSE / SONAR recommendation
`workspace_state.md` entry	Semantics plan

Example Usage

Language: Swahili (swa), task: cross-lingual MT eval

Semantics Analysis: Swahili
- WordNet/OMW: ~10K synsets (9% coverage vs Princeton WN)
    Gap: technical/abstract vocabulary largely absent
- FrameNet: absent for Swahili; use Predicate Matrix bridge via English
- PropBank-style SRL: stanza morph + custom frame alignment
- MWE strategy: no PARSEME corpus; build 500-idiom pilot catalog
- Semantic-equivalence eval: COMET-22 (check Swahili coverage);
    LaBSE cosine as supplementary
- RAG grounding: flag OMW gap — 91% of English synsets unavailable

linguistic-scope — language identity for resource lookup
linguistic-annotate — sense annotation project design
linguistic-eval — COMET/LaBSE as eval metrics
linguistic-bitext — MWE handling in parallel corpus

Was this page helpful?

linguistic-semantics

Overview

Pipeline Position

When It Activates

What It Does

WordNet / OMW Coverage

Frame Semantics / SRL

MWE / PARSEME Handling

Semantic-Equivalence Eval

Inputs & Outputs

Example Usage

On this page