MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-annotate

Design, run, and audit annotation projects: guideline authoring, IAA metric selection, adjudication workflow, and active learning for limited annotation budgets.

Overview

Annotation is where linguistic data quality is made or broken. Calibration rounds cost 10× less than re-annotating bulk. Guideline drift is invisible until IAA drops. Single-annotator gold is opinion, not data. linguistic-annotate enforces the process discipline that separates reliable gold-standard datasets from expensive noise.

Pipeline Position

Phase: Analyze (Phase 2)

Before this skill: Task definition (NER, POS, parsing, sentiment, MT eval); language analysis from scope/morph/syntax

After this skill: Training data pipeline; linguistic-eval (gold-standard eval set)

When It Activates

  • Designing a new annotation project
  • Selecting an IAA metric for the task
  • Calculating IAA from given counts
  • Running adjudication after multi-annotator pass
  • Active-learning sample selection for limited annotation budget
  • Deciding whether to ship a single-annotator gold dataset (usually NO)

When NOT to use: Purely synthetic-data generation with no human labels needed. Existing gold-standard reuse without modification — just use it.

What It Does

IAA Metric Selection

Task TypeAnnotatorsMetricWhy
Binary / nominal categorical2Cohen κStandard; chance-adjusted
Same, skewed prevalence2PABAK or F1-complementκ misleads on imbalance
Same, ≥ 3 annotators≥ 3Fleiss κMulti-annotator κ
Ordinal labels (Likert, severity)anyKrippendorff α (ordinal)κ doesn't handle ordinal
Missing values per annotatoranyKrippendorff αHandles missingness
Span / unitized (NER, coref)anyγ (gamma)Models span boundaries
Free-text overlapanyF1 / BLEU / ROUGENo κ-style metric

Threshold convention: ≥ 0.8 = "good"; 0.67–0.8 = "tentative"; < 0.67 = unreliable. Always report bootstrap CI — single-number agreement on small samples is unreliable.

Cohen κ is misleading on highly skewed classes. When 90% of items are "negative", κ underestimates real agreement. Use PABAK or F1-complement for imbalanced tasks.

Annotation Workflow

Iterative: Draft v0 (20–30 examples) → Pilot (50 items, 2–3 annotators, discuss disagreements) → Calibration (100 items, all annotators, per-decision discussion) → Bulk (full corpus, 10% double-annotated for IAA monitoring) → Adjudication (curator resolves all disagreements).

Skipping calibration → 10–20% of bulk needs re-annotation. The cost asymmetry is brutal.

Adjudication

For each disagreement: curator reviews → picks resolution or tags "ambiguous — exclude" → updates guidelines if new edge-case pattern emerges. Track per-annotator disagreement rates — high rates indicate annotator drift, training gap, or guideline ambiguity.

Active Learning

ApproachBest ForTradeoff
Random samplingBaselineSimple; under-samples rare
Uncertainty samplingMost casesCatches uncertain examples
Diversity (cluster-based)Very low resourceBest distribution coverage
Hybrid (uncertainty + diversity)ProductionStrongest in practice

The default "uncertainty sampling" advice fails when target distribution is itself sparse — use clustering-based for very-low-resource where diverse coverage matters more than model blind spots.

Inputs & Outputs

InputDescription
Task definitionUnit, labels, boundary policy
Language + domainFor guideline design
Annotation budgetFor active learning approach
OutputDescription
Annotation planUnit, labels, workflow stages
IAA metric selectionWith rationale
Guideline documentDraft → calibrated
IAA scoresWith bootstrap CI
Gold datasetPost-adjudication

Example Usage

Task: Named Entity Recognition for Yoruba news text

Annotation Plan: Yoruba NER (news)
- Unit: token span
- Labels: PER, ORG, LOC, DATE (closed set)
- Boundary policy: include leading articles in ORG spans
- IAA metric: γ (gamma) — span boundaries require span metric
- Annotators: 3 (2 native speakers + 1 computational linguist)
- Calibration: 100 items, all annotators, 2-hour session
- Bulk: 5,000 sentences; 10% double-annotated (n=500)
- Active learning: diversity/cluster-based (low-resource; need coverage)
- Target IAA: γ ≥ 0.8 before bulk release
- Tool: Label Studio (self-hosted; span annotation support)
Was this page helpful?
Edit on GitHub

Last updated on

On this page