linguistic-annotate

Design, run, and audit annotation projects: guideline authoring, IAA metric selection, adjudication workflow, and active learning for limited annotation budgets.

Overview

Annotation is where linguistic data quality is made or broken. Calibration rounds cost 10× less than re-annotating bulk. Guideline drift is invisible until IAA drops. Single-annotator gold is opinion, not data. linguistic-annotate enforces the process discipline that separates reliable gold-standard datasets from expensive noise.

Pipeline Position

Phase: Analyze (Phase 2)

Before this skill: Task definition (NER, POS, parsing, sentiment, MT eval); language analysis from scope/morph/syntax

After this skill: Training data pipeline; linguistic-eval (gold-standard eval set)

When It Activates

Designing a new annotation project
Selecting an IAA metric for the task
Calculating IAA from given counts
Running adjudication after multi-annotator pass
Active-learning sample selection for limited annotation budget
Deciding whether to ship a single-annotator gold dataset (usually NO)

When NOT to use: Purely synthetic-data generation with no human labels needed. Existing gold-standard reuse without modification — just use it.

What It Does

IAA Metric Selection

Task Type	Annotators	Metric	Why
Binary / nominal categorical	2	Cohen κ	Standard; chance-adjusted
Same, skewed prevalence	2	PABAK or F1-complement	κ misleads on imbalance
Same, ≥ 3 annotators	≥ 3	Fleiss κ	Multi-annotator κ
Ordinal labels (Likert, severity)	any	Krippendorff α (ordinal)	κ doesn't handle ordinal
Missing values per annotator	any	Krippendorff α	Handles missingness
Span / unitized (NER, coref)	any	γ (gamma)	Models span boundaries
Free-text overlap	any	F1 / BLEU / ROUGE	No κ-style metric

Threshold convention: ≥ 0.8 = "good"; 0.67–0.8 = "tentative"; < 0.67 = unreliable. Always report bootstrap CI — single-number agreement on small samples is unreliable.

Cohen κ is misleading on highly skewed classes. When 90% of items are "negative", κ underestimates real agreement. Use PABAK or F1-complement for imbalanced tasks.

Annotation Workflow

Iterative: Draft v0 (20–30 examples) → Pilot (50 items, 2–3 annotators, discuss disagreements) → Calibration (100 items, all annotators, per-decision discussion) → Bulk (full corpus, 10% double-annotated for IAA monitoring) → Adjudication (curator resolves all disagreements).

Skipping calibration → 10–20% of bulk needs re-annotation. The cost asymmetry is brutal.

Adjudication

For each disagreement: curator reviews → picks resolution or tags "ambiguous — exclude" → updates guidelines if new edge-case pattern emerges. Track per-annotator disagreement rates — high rates indicate annotator drift, training gap, or guideline ambiguity.

Active Learning

Approach	Best For	Tradeoff
Random sampling	Baseline	Simple; under-samples rare
Uncertainty sampling	Most cases	Catches uncertain examples
Diversity (cluster-based)	Very low resource	Best distribution coverage
Hybrid (uncertainty + diversity)	Production	Strongest in practice

The default "uncertainty sampling" advice fails when target distribution is itself sparse — use clustering-based for very-low-resource where diverse coverage matters more than model blind spots.

Inputs & Outputs

Input	Description
Task definition	Unit, labels, boundary policy
Language + domain	For guideline design
Annotation budget	For active learning approach

Output	Description
Annotation plan	Unit, labels, workflow stages
IAA metric selection	With rationale
Guideline document	Draft → calibrated
IAA scores	With bootstrap CI
Gold dataset	Post-adjudication

Example Usage

Task: Named Entity Recognition for Yoruba news text

Annotation Plan: Yoruba NER (news)
- Unit: token span
- Labels: PER, ORG, LOC, DATE (closed set)
- Boundary policy: include leading articles in ORG spans
- IAA metric: γ (gamma) — span boundaries require span metric
- Annotators: 3 (2 native speakers + 1 computational linguist)
- Calibration: 100 items, all annotators, 2-hour session
- Bulk: 5,000 sentences; 10% double-annotated (n=500)
- Active learning: diversity/cluster-based (low-resource; need coverage)
- Target IAA: γ ≥ 0.8 before bulk release
- Tool: Label Studio (self-hosted; span annotation support)

linguistic-syntax — UD treebank annotation uses these methodology guidelines
linguistic-semantics — sense annotation project design
linguistic-eval — gold sets from annotation become eval suite items

Data Profiling — data quality and characteristics analysis

Was this page helpful?

On this page