Getting Started
The Linguistic Agent Skills suite integrates with a wide range of computational linguistics and ML tooling. Each skill references the tools relevant to its domain.
| Tool | Coverage | Used By |
|---|
| GlotLID (2024) | 1,600+ languages, paragraph-level | linguistic-corpus |
| FastText LID (176 languages) | High-resource speed | linguistic-corpus |
| CLD3 | Broad coverage | linguistic-corpus (fallback) |
| Tool | Purpose | Used By |
|---|
Python unicodedata | NFC/NFKC normalization | linguistic-scripts |
| Unicode TR39 confusable data | Mixed-script deduplication | linguistic-scripts |
detect_confusables.py | Fold/detect confusables + joiners | linguistic-scripts |
| Tool | Purpose | Used By |
|---|
| SentencePiece 0.1.96+ | Unigram/BPE tokenizer training | linguistic-tokenize |
| FOCUS | Vocab extension (close script pairs) | linguistic-tokenize |
| OFA | Vocab extension (with parallel data) | linguistic-tokenize |
| HyperOfa | Vocab extension (minimal data) | linguistic-tokenize |
| Resource | Languages | Used By |
|---|
| CulturaX | 167 languages | linguistic-corpus |
| MADLAD-400 | 400+ languages | linguistic-corpus |
| Glot500 | 500+ languages | linguistic-corpus |
| OLDI (Open Language Data Initiative) | 100+ languages | linguistic-corpus |
| Wikipedia dumps | 300+ languages | linguistic-corpus |
| OPUS | Parallel data, many languages | linguistic-bitext |
| FLORES-200 | 200 languages, eval | linguistic-eval |
| NTREX-128 | 128 languages, eval | linguistic-eval |
| Belebele | 122 languages, reading comp | linguistic-eval |
| Tool | Purpose | Used By |
|---|
| LASER3 | Sentence embeddings for alignment | linguistic-bitext |
| SONAR (Meta 2024) | Embeddings — stronger for Bantu/Indigenous | linguistic-bitext |
| Vecalign | Sentence alignment (preferred) | linguistic-bitext |
| hunalign | Sentence alignment (legacy) | linguistic-bitext |
| Tool | Strengths | Used By |
|---|
| Unsloth | 2× faster QLoRA, single-GPU | linguistic-transfer |
| LLaMA-Factory | Multi-GPU + complex sampling | linguistic-transfer |
| Axolotl | YAML-config middle ground | linguistic-transfer |
| HuggingFace PEFT | Flexible LoRA/QLoRA/DoRA | linguistic-transfer |
| MAD-X / BAD-X adapters | Language + task adapter stacks | linguistic-transfer |
| Tool | Purpose | Used By |
|---|
| UniMorph | Gold paradigms, 100+ languages | linguistic-morph |
| SIGMORPHON 2022/2023 | Unsupervised segmenters | linguistic-morph |
| HFST | Rule-based FST analyzer | linguistic-morph |
| foma | Alternative FST toolkit | linguistic-morph |
| Morfessor | Statistical segmenter | linguistic-morph |
| Tool | Strengths | Used By |
|---|
| Trankit (2021) | Best low-resource UD quality (XLM-R) | linguistic-syntax |
| stanza (Stanford 2020) | 70+ languages, fast | linguistic-syntax |
| UDify (2019) | Single multilingual model | linguistic-syntax |
| UD treebank corpus | 100+ languages | linguistic-syntax |
| Tool | Purpose | Used By |
|---|
| Open Multilingual WordNet (OMW) | Synsets for 100+ languages | linguistic-semantics |
| PARSEME | MWE annotation datasets | linguistic-semantics |
| COMET-22 | Learned MT quality metric | linguistic-semantics, linguistic-eval |
| LaBSE / SONAR | Cross-lingual embeddings | linguistic-semantics, linguistic-eval |
| Tool | Coverage | Used By |
|---|
| MMS (Meta Massively Multilingual Speech) | 1,107 languages | linguistic-speech |
| Whisper (OpenAI) | ~99 languages | linguistic-speech |
| Lhotse | Audio pipeline (CutSet standard) | linguistic-speech |
| ELAN | Field annotation format | linguistic-speech |
| Praat (TextGrid) | Phonetic annotation | linguistic-speech |
| FLEx FieldWorks | Lexicographic field data | linguistic-speech |
| WikiPron | G2P / IPA crowd-sourced data | linguistic-speech |
| VITS / Tacotron2 | Low-resource TTS | linguistic-speech |
| Tool | Purpose | Used By |
|---|
| Label Studio | General annotation UI | linguistic-annotate |
| Prodigy | Active-learning annotation | linguistic-annotate |
| INCEpTION | NLP annotation with IAA | linguistic-annotate |
| brat | Lightweight web annotation | linguistic-annotate |
| Tool | Metric | Used By |
|---|
| sacrebleu | chrF++, spBLEU, BLEU | linguistic-eval |
| COMET / xCOMET | Learned MT quality | linguistic-eval |
| GEMBA-MQM | LLM-judge MQM rubric | linguistic-eval |
| AfroBench | African language benchmarks | linguistic-eval |
| IndicXTREME | Indic language benchmarks | linguistic-eval |
| SEACrowd | Southeast Asian benchmarks | linguistic-eval |
| Database | Coverage | Used By |
|---|
| WALS (World Atlas of Language Structures) | 2,662 languages | linguistic-scope |
| Grambank | 2,467 languages | linguistic-scope |
| URIEL / lang2vec | Typological distance vectors | linguistic-scope, linguistic-transfer |
| Glottolog | Language catalog + genealogy | linguistic-scope |