MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-speech

Bridge field-linguistics annotation (ELAN, Praat, FLEx, SayMore) and community-annotated audio into ML pipelines (Lhotse, ESPnet, k2/icefall, MMS, Whisper). G2P/IPA workflows and low-resource ASR/TTS recipe selection.

Overview

Field-recorded endangered-language audio is some of the most valuable and irreplaceable linguistic data in existence. Getting it into an ML pipeline requires navigating format diversity (ELAN EAF, Praat TextGrid, FLEx XML, SayMore IMDI), legacy encoding issues (SIL PUA characters from pre-Unicode fonts), and tone-preservation requirements. linguistic-speech handles this pipeline reliably.

Pipeline Position

Phase: Analyze (Phase 2) — when spoken data is involved

Before this skill: linguistic-scope (language identity, vitality for ethics routing), linguistic-ethics (community-controlled audio requires ethics gate)

After this skill: linguistic-eval (ASR/TTS metrics), linguistic-ethics (final release gate for community-controlled audio)

When It Activates

  • Ingesting field-recorded audio + linguistic annotation into ML pipelines
  • Choosing G2P approach for the target language
  • Selecting low-resource ASR (MMS / Whisper / fine-tune)
  • IPA validation in transcription pipelines
  • Building Lhotse CutSet from heterogeneous community-annotated sources
  • Bridging endangered-language oral data into TTS/ASR research

When NOT to use: Purely text data — not speech-relevant. Pure tokenizer audit → linguistic-tokenize.

What It Does

Input Format Support

FormatNotes
ELAN EAFTier-naming conventions vary per project — normalize at ingest
Praat TextGridStandard phonetic annotation; Lhotse recipe available
FLEx FieldWorks XMLWarning: often uses SIL PUA characters from legacy fonts — convert to Unicode at ingest
SayMore IMDIOLAC/IMDI metadata schema; requires mapping

FLEx PUA issue: FLEx FieldWorks XML from older projects uses SIL Private Use Area (PUA) characters from pre-Unicode fonts. Pre-Unicode legacy. Convert to Unicode at ingest or downstream tools choke silently.

ASR / TTS Tool Selection

ClassASR PrimaryASR FallbackTTS
0–2MMS (1,107 languages)Whisper-large + fine-tuneVITS / Tacotron2 fine-tune (≥5 hr audio)
3–4Whisper-large fine-tuneMMSVITS / FastSpeech2
5Whisper-large or commercialn/aXTTS / commercial

MMS (Meta Massively Multilingual Speech, 1,107 languages) is the floor for Class 0–2. Whisper covers ~99 languages (with varying quality). For class 0–2 languages not in Whisper: MMS first, then fine-tune.

TTS viability: VITS or Tacotron2 fine-tune on ~5 hours can produce intelligible output. Below 1 hour: usually not viable.

G2P / IPA

Identify per-language IPA convention (tone marking, vowel-length notation). Recommend G2P resource: WikiPron baseline + community refinement. Use IPA validator to validate transcription strings against per-language inventory.

Critical for tone languages: many community ASR pipelines silently strip diacritics. Build pipeline that rejects diacritic-stripped input. Validate with IPA-validator.

Lhotse CutSet

Lhotse CutSet is the canonical 2026 audio representation. A "cut" = audio + supervisions (transcription, speaker, etc.) + features (Mel-spec, etc.). ESPnet, k2/icefall, SpeechBrain, NeMo all consume it. Build pipelines that produce CutSets, not bespoke formats.

Inputs & Outputs

InputDescription
Audio files + annotation formatsELAN/Praat/FLEx/SayMore/CSV
Target language ISO codeFor G2P + ASR tool selection
Joshi classFor ASR/TTS approach
OutputDescription
Pre-processing planPUA→Unicode, tier normalization, diacritic preservation
G2P approachWikiPron baseline / custom
ASR recommendationMMS / Whisper / fine-tune
TTS viabilityHours required / not viable
Lhotse recipeExisting / custom
workspace_state.md entrySpeech plan

Example Usage

Language: Yoruba (yor), Class 2, 12 hours of community-recorded audio with ELAN annotation

Speech Plan: Yoruba
- Input format: ELAN EAF
- Pre-processing: BOM strip, NFC norm, PRESERVE diacritics (tone language)
    Tier-name normalization: map "tx@yor" → canonical "transcription"
- G2P: WikiPron Yoruba baseline + community review (tone diacritics critical)
- IPA validation: MANDATORY — reject diacritic-stripped input
- ASR: MMS primary (Class 2, 12hr available); Whisper-large fine-tune secondary
- TTS: VITS fine-tune viable (12hr > 5hr threshold)
- Lhotse recipe: custom (ELAN EAF ingest)
- Ethics: route to linguistic-ethics (community-controlled recordings)
Was this page helpful?
Edit on GitHub

Last updated on

On this page