Guides
Practical guides for common linguistic AI engineering scenarios.
Available Guides
Low-Resource Languages
The Joshi classification system (0–5), how to find data for under-resourced languages, typological considerations (agglutinative, polysynthetic, tonal, script-diverse), and ethical considerations for endangered-language work.
Pipeline Workflow
End-to-end walkthrough of all 5 pipeline phases using Khmer (khm) as the example: scoping, corpus acquisition, bitext mining, transfer learning setup, evaluation, and release.
Ethics and FPIC
FPIC (Free, Prior, Informed Consent) and CARE principles in practice — how the ethics skill gates the pipeline at Scope and Release, sacred-text handling framework, license compatibility for dataset mixes, and the attribution registry requirement.
Cross-Suite Integration
Using Data Agent Skills and Linguistic Agent Skills together — responsibility boundaries, four common integration patterns (multilingual dataset audit, MT data pipeline, annotation project, eval results analysis), and shared concepts like deduplication and synthetic data generation.
Last updated on