linguistic-speech
Bridge field-linguistics annotation (ELAN, Praat, FLEx, SayMore) and community-annotated audio into ML pipelines (Lhotse, ESPnet, k2/icefall, MMS, Whisper). G2P/IPA workflows and low-resource ASR/TTS recipe selection.
Overview
Field-recorded endangered-language audio is some of the most valuable and irreplaceable linguistic data in existence. Getting it into an ML pipeline requires navigating format diversity (ELAN EAF, Praat TextGrid, FLEx XML, SayMore IMDI), legacy encoding issues (SIL PUA characters from pre-Unicode fonts), and tone-preservation requirements. linguistic-speech handles this pipeline reliably.
Pipeline Position
Phase: Analyze (Phase 2) — when spoken data is involved
Before this skill: linguistic-scope (language identity, vitality for ethics routing), linguistic-ethics (community-controlled audio requires ethics gate)
After this skill: linguistic-eval (ASR/TTS metrics), linguistic-ethics (final release gate for community-controlled audio)
When It Activates
- Ingesting field-recorded audio + linguistic annotation into ML pipelines
- Choosing G2P approach for the target language
- Selecting low-resource ASR (MMS / Whisper / fine-tune)
- IPA validation in transcription pipelines
- Building Lhotse CutSet from heterogeneous community-annotated sources
- Bridging endangered-language oral data into TTS/ASR research
When NOT to use: Purely text data — not speech-relevant. Pure tokenizer audit → linguistic-tokenize.
What It Does
Input Format Support
| Format | Notes |
|---|---|
| ELAN EAF | Tier-naming conventions vary per project — normalize at ingest |
| Praat TextGrid | Standard phonetic annotation; Lhotse recipe available |
| FLEx FieldWorks XML | Warning: often uses SIL PUA characters from legacy fonts — convert to Unicode at ingest |
| SayMore IMDI | OLAC/IMDI metadata schema; requires mapping |
FLEx PUA issue: FLEx FieldWorks XML from older projects uses SIL Private Use Area (PUA) characters from pre-Unicode fonts. Pre-Unicode legacy. Convert to Unicode at ingest or downstream tools choke silently.
ASR / TTS Tool Selection
| Class | ASR Primary | ASR Fallback | TTS |
|---|---|---|---|
| 0–2 | MMS (1,107 languages) | Whisper-large + fine-tune | VITS / Tacotron2 fine-tune (≥5 hr audio) |
| 3–4 | Whisper-large fine-tune | MMS | VITS / FastSpeech2 |
| 5 | Whisper-large or commercial | n/a | XTTS / commercial |
MMS (Meta Massively Multilingual Speech, 1,107 languages) is the floor for Class 0–2. Whisper covers ~99 languages (with varying quality). For class 0–2 languages not in Whisper: MMS first, then fine-tune.
TTS viability: VITS or Tacotron2 fine-tune on ~5 hours can produce intelligible output. Below 1 hour: usually not viable.
G2P / IPA
Identify per-language IPA convention (tone marking, vowel-length notation). Recommend G2P resource: WikiPron baseline + community refinement. Use IPA validator to validate transcription strings against per-language inventory.
Critical for tone languages: many community ASR pipelines silently strip diacritics. Build pipeline that rejects diacritic-stripped input. Validate with IPA-validator.
Lhotse CutSet
Lhotse CutSet is the canonical 2026 audio representation. A "cut" = audio + supervisions (transcription, speaker, etc.) + features (Mel-spec, etc.). ESPnet, k2/icefall, SpeechBrain, NeMo all consume it. Build pipelines that produce CutSets, not bespoke formats.
Inputs & Outputs
| Input | Description |
|---|---|
| Audio files + annotation formats | ELAN/Praat/FLEx/SayMore/CSV |
| Target language ISO code | For G2P + ASR tool selection |
| Joshi class | For ASR/TTS approach |
| Output | Description |
|---|---|
| Pre-processing plan | PUA→Unicode, tier normalization, diacritic preservation |
| G2P approach | WikiPron baseline / custom |
| ASR recommendation | MMS / Whisper / fine-tune |
| TTS viability | Hours required / not viable |
| Lhotse recipe | Existing / custom |
workspace_state.md entry | Speech plan |
Example Usage
Language: Yoruba (yor), Class 2, 12 hours of community-recorded audio with ELAN annotation
Speech Plan: Yoruba
- Input format: ELAN EAF
- Pre-processing: BOM strip, NFC norm, PRESERVE diacritics (tone language)
Tier-name normalization: map "tx@yor" → canonical "transcription"
- G2P: WikiPron Yoruba baseline + community review (tone diacritics critical)
- IPA validation: MANDATORY — reject diacritic-stripped input
- ASR: MMS primary (Class 2, 12hr available); Whisper-large fine-tune secondary
- TTS: VITS fine-tune viable (12hr > 5hr threshold)
- Lhotse recipe: custom (ELAN EAF ingest)
- Ethics: route to linguistic-ethics (community-controlled recordings)Related Skills
linguistic-ethics— community-controlled audio requires ethics gatelinguistic-scripts— diacritic preservation policylinguistic-eval— ASR uses CER/WER; TTS uses MOSlinguistic-annotate— IAA methodology for transcription annotation
Related Skills from Other Suites
- Data Loading — audio/speech data ingestion
Last updated on
linguistic-discourse
Discourse-level analysis — RST/PDTB/GUM framework selection, coreference (including zero-anaphora in pro-drop languages), discourse markers, and coherence-aware evaluation for long-context LLMs.
linguistic-eval
Honest evaluation for low-resource LLMs — benchmark choice (FLORES+/NTREX/Belebele/AfroBench/IndicXTREME/SEACrowd), metric choice (chrF++/spBLEU/COMET/MetricX/GEMBA-MQM), BLiMP-style probes, contamination check. A-tier.