linguistic-speech

Bridge field-linguistics annotation (ELAN, Praat, FLEx, SayMore) and community-annotated audio into ML pipelines (Lhotse, ESPnet, k2/icefall, MMS, Whisper). G2P/IPA workflows and low-resource ASR/TTS recipe selection.

Overview

Field-recorded endangered-language audio is some of the most valuable and irreplaceable linguistic data in existence. Getting it into an ML pipeline requires navigating format diversity (ELAN EAF, Praat TextGrid, FLEx XML, SayMore IMDI), legacy encoding issues (SIL PUA characters from pre-Unicode fonts), and tone-preservation requirements. linguistic-speech handles this pipeline reliably.

Pipeline Position

Phase: Analyze (Phase 2) — when spoken data is involved

Before this skill: linguistic-scope (language identity, vitality for ethics routing), linguistic-ethics (community-controlled audio requires ethics gate)

After this skill: linguistic-eval (ASR/TTS metrics), linguistic-ethics (final release gate for community-controlled audio)

When It Activates

Ingesting field-recorded audio + linguistic annotation into ML pipelines
Choosing G2P approach for the target language
Selecting low-resource ASR (MMS / Whisper / fine-tune)
IPA validation in transcription pipelines
Building Lhotse CutSet from heterogeneous community-annotated sources
Bridging endangered-language oral data into TTS/ASR research

When NOT to use: Purely text data — not speech-relevant. Pure tokenizer audit → linguistic-tokenize.

What It Does

Input Format Support

Format	Notes
ELAN EAF	Tier-naming conventions vary per project — normalize at ingest
Praat TextGrid	Standard phonetic annotation; Lhotse recipe available
FLEx FieldWorks XML	Warning: often uses SIL PUA characters from legacy fonts — convert to Unicode at ingest
SayMore IMDI	OLAC/IMDI metadata schema; requires mapping

FLEx PUA issue: FLEx FieldWorks XML from older projects uses SIL Private Use Area (PUA) characters from pre-Unicode fonts. Pre-Unicode legacy. Convert to Unicode at ingest or downstream tools choke silently.

ASR / TTS Tool Selection

Class	ASR Primary	ASR Fallback	TTS
0–2	MMS (1,107 languages)	Whisper-large + fine-tune	VITS / Tacotron2 fine-tune (≥5 hr audio)
3–4	Whisper-large fine-tune	MMS	VITS / FastSpeech2
5	Whisper-large or commercial	n/a	XTTS / commercial

MMS (Meta Massively Multilingual Speech, 1,107 languages) is the floor for Class 0–2. Whisper covers ~99 languages (with varying quality). For class 0–2 languages not in Whisper: MMS first, then fine-tune.

TTS viability: VITS or Tacotron2 fine-tune on ~5 hours can produce intelligible output. Below 1 hour: usually not viable.

G2P / IPA

Identify per-language IPA convention (tone marking, vowel-length notation). Recommend G2P resource: WikiPron baseline + community refinement. Use IPA validator to validate transcription strings against per-language inventory.

Critical for tone languages: many community ASR pipelines silently strip diacritics. Build pipeline that rejects diacritic-stripped input. Validate with IPA-validator.

Lhotse CutSet

Lhotse CutSet is the canonical 2026 audio representation. A "cut" = audio + supervisions (transcription, speaker, etc.) + features (Mel-spec, etc.). ESPnet, k2/icefall, SpeechBrain, NeMo all consume it. Build pipelines that produce CutSets, not bespoke formats.

Inputs & Outputs

Input	Description
Audio files + annotation formats	ELAN/Praat/FLEx/SayMore/CSV
Target language ISO code	For G2P + ASR tool selection
Joshi class	For ASR/TTS approach

Output	Description
Pre-processing plan	PUA→Unicode, tier normalization, diacritic preservation
G2P approach	WikiPron baseline / custom
ASR recommendation	MMS / Whisper / fine-tune
TTS viability	Hours required / not viable
Lhotse recipe	Existing / custom
`workspace_state.md` entry	Speech plan

Example Usage

Language: Yoruba (yor), Class 2, 12 hours of community-recorded audio with ELAN annotation

Speech Plan: Yoruba
- Input format: ELAN EAF
- Pre-processing: BOM strip, NFC norm, PRESERVE diacritics (tone language)
    Tier-name normalization: map "tx@yor" → canonical "transcription"
- G2P: WikiPron Yoruba baseline + community review (tone diacritics critical)
- IPA validation: MANDATORY — reject diacritic-stripped input
- ASR: MMS primary (Class 2, 12hr available); Whisper-large fine-tune secondary
- TTS: VITS fine-tune viable (12hr > 5hr threshold)
- Lhotse recipe: custom (ELAN EAF ingest)
- Ethics: route to linguistic-ethics (community-controlled recordings)

linguistic-ethics — community-controlled audio requires ethics gate
linguistic-scripts — diacritic preservation policy
linguistic-eval — ASR uses CER/WER; TTS uses MOS
linguistic-annotate — IAA methodology for transcription annotation

Data Loading — audio/speech data ingestion

Was this page helpful?

On this page