Guides

Practical guides for common linguistic AI engineering scenarios.

The Joshi classification system (0–5), how to find data for under-resourced languages, typological considerations (agglutinative, polysynthetic, tonal, script-diverse), and ethical considerations for endangered-language work.

Pipeline Workflow

End-to-end walkthrough of all 5 pipeline phases using Khmer (khm) as the example: scoping, corpus acquisition, bitext mining, transfer learning setup, evaluation, and release.

Ethics and FPIC

FPIC (Free, Prior, Informed Consent) and CARE principles in practice — how the ethics skill gates the pipeline at Scope and Release, sacred-text handling framework, license compatibility for dataset mixes, and the attribution registry requirement.

Cross-Suite Integration

Using Data Agent Skills and Linguistic Agent Skills together — responsibility boundaries, four common integration patterns (multilingual dataset audit, MT data pipeline, annotation project, eval results analysis), and shared concepts like deduplication and synthetic data generation.

Was this page helpful?

Guides

Available Guides

Low-Resource Languages

Pipeline Workflow

Ethics and FPIC

Cross-Suite Integration

On this page