linguistic-annotate
Design, run, and audit annotation projects: guideline authoring, IAA metric selection, adjudication workflow, and active learning for limited annotation budgets.
Overview
Annotation is where linguistic data quality is made or broken. Calibration rounds cost 10× less than re-annotating bulk. Guideline drift is invisible until IAA drops. Single-annotator gold is opinion, not data. linguistic-annotate enforces the process discipline that separates reliable gold-standard datasets from expensive noise.
Pipeline Position
Phase: Analyze (Phase 2)
Before this skill: Task definition (NER, POS, parsing, sentiment, MT eval); language analysis from scope/morph/syntax
After this skill: Training data pipeline; linguistic-eval (gold-standard eval set)
When It Activates
- Designing a new annotation project
- Selecting an IAA metric for the task
- Calculating IAA from given counts
- Running adjudication after multi-annotator pass
- Active-learning sample selection for limited annotation budget
- Deciding whether to ship a single-annotator gold dataset (usually NO)
When NOT to use: Purely synthetic-data generation with no human labels needed. Existing gold-standard reuse without modification — just use it.
What It Does
IAA Metric Selection
| Task Type | Annotators | Metric | Why |
|---|---|---|---|
| Binary / nominal categorical | 2 | Cohen κ | Standard; chance-adjusted |
| Same, skewed prevalence | 2 | PABAK or F1-complement | κ misleads on imbalance |
| Same, ≥ 3 annotators | ≥ 3 | Fleiss κ | Multi-annotator κ |
| Ordinal labels (Likert, severity) | any | Krippendorff α (ordinal) | κ doesn't handle ordinal |
| Missing values per annotator | any | Krippendorff α | Handles missingness |
| Span / unitized (NER, coref) | any | γ (gamma) | Models span boundaries |
| Free-text overlap | any | F1 / BLEU / ROUGE | No κ-style metric |
Threshold convention: ≥ 0.8 = "good"; 0.67–0.8 = "tentative"; < 0.67 = unreliable. Always report bootstrap CI — single-number agreement on small samples is unreliable.
Cohen κ is misleading on highly skewed classes. When 90% of items are "negative", κ underestimates real agreement. Use PABAK or F1-complement for imbalanced tasks.
Annotation Workflow
Iterative: Draft v0 (20–30 examples) → Pilot (50 items, 2–3 annotators, discuss disagreements) → Calibration (100 items, all annotators, per-decision discussion) → Bulk (full corpus, 10% double-annotated for IAA monitoring) → Adjudication (curator resolves all disagreements).
Skipping calibration → 10–20% of bulk needs re-annotation. The cost asymmetry is brutal.
Adjudication
For each disagreement: curator reviews → picks resolution or tags "ambiguous — exclude" → updates guidelines if new edge-case pattern emerges. Track per-annotator disagreement rates — high rates indicate annotator drift, training gap, or guideline ambiguity.
Active Learning
| Approach | Best For | Tradeoff |
|---|---|---|
| Random sampling | Baseline | Simple; under-samples rare |
| Uncertainty sampling | Most cases | Catches uncertain examples |
| Diversity (cluster-based) | Very low resource | Best distribution coverage |
| Hybrid (uncertainty + diversity) | Production | Strongest in practice |
The default "uncertainty sampling" advice fails when target distribution is itself sparse — use clustering-based for very-low-resource where diverse coverage matters more than model blind spots.
Inputs & Outputs
| Input | Description |
|---|---|
| Task definition | Unit, labels, boundary policy |
| Language + domain | For guideline design |
| Annotation budget | For active learning approach |
| Output | Description |
|---|---|
| Annotation plan | Unit, labels, workflow stages |
| IAA metric selection | With rationale |
| Guideline document | Draft → calibrated |
| IAA scores | With bootstrap CI |
| Gold dataset | Post-adjudication |
Example Usage
Task: Named Entity Recognition for Yoruba news text
Annotation Plan: Yoruba NER (news)
- Unit: token span
- Labels: PER, ORG, LOC, DATE (closed set)
- Boundary policy: include leading articles in ORG spans
- IAA metric: γ (gamma) — span boundaries require span metric
- Annotators: 3 (2 native speakers + 1 computational linguist)
- Calibration: 100 items, all annotators, 2-hour session
- Bulk: 5,000 sentences; 10% double-annotated (n=500)
- Active learning: diversity/cluster-based (low-resource; need coverage)
- Target IAA: γ ≥ 0.8 before bulk release
- Tool: Label Studio (self-hosted; span annotation support)Related Skills
linguistic-syntax— UD treebank annotation uses these methodology guidelineslinguistic-semantics— sense annotation project designlinguistic-eval— gold sets from annotation become eval suite items
Related Skills from Other Suites
- Data Profiling — data quality and characteristics analysis
Last updated on
linguistic-syntax
Universal Dependencies treebank usage, cross-lingual parser transfer (UDify/Trankit/stanza), and agreement-probe construction for grammatical-correctness evaluation of low-resource LLMs.
linguistic-semantics
Lexical and frame semantics for the target language — WordNet/OMW coverage, FrameNet/PropBank SRL guidance, multi-word expressions (MWE/PARSEME), and semantic-equivalence eval for cross-lingual retrieval.