Overview
linguistic-bitext is the Acquire-phase specialist for parallel data. Bitext quality determines MT model quality, and the choices made here — embedding model, margin threshold, register balance — cascade through every downstream evaluation. The wrong embedding (LASER3 on Bantu), the wrong margin threshold (1.06 on Class 0–1), or Bible-only bitext all produce characteristic failures. This skill surfaces these choices before a single training step runs.
The most common mistake is using the published NLLB margin threshold of 1.06 on low-resource pairs. That threshold is calibrated for high-resource pairs — on Class 0–1 it over-filters, discarding half the usable data. The second most common: training MT on Bible-only bitext for general-purpose use, producing a model that sounds 17th-century.
Pipeline Position
This skill operates in Phase 1 — Acquire of the linguistic pipeline.
Preceding skills: linguistic-scope (pair-level resource class, URIEL distance), linguistic-ethics (per-dataset gate), linguistic-scripts (normalization before alignment)
Following skills: linguistic-tokenize (fertility audit on bitext target side), linguistic-transfer (adapter/LoRA planning)
When It Activates
- Building MT data for any low-resource language pair
- Mining parallel sentences from comparable corpora
- Choosing alignment tool (Vecalign vs hunalign vs Bleualign)
- Generating synthetic bitext via back-translation, dictionary substitution, or pivoting
- Auditing existing parallel data quality (margin scores, register skew, length filtering)
When NOT to use: For monolingual corpus → linguistic-corpus. For tokenizer fertility on bitext output → linguistic-tokenize. For ethics/license per source → linguistic-ethics.
What It Does
Embedding model selection:
| Source Family | Recommended | Rationale |
|---|---|---|
| European Latin/Cyrillic | LASER3 | Good coverage |
| Indic | LASER3 / SONAR | Both work |
| African (Bantu, Niger-Congo) | SONAR | LASER3 has coverage gaps |
| Indigenous Americas | SONAR | LASER3 has coverage gaps |
| SEA (Khmer/Lao/Burmese) | LASER3 | Adequate |
| Mixed-script source | SONAR | Handles better |
Margin threshold by pair class:
| Pair Type | Threshold |
|---|---|
| Class 4–5 ↔ Class 4–5 | 1.06 (NLLB standard) |
| Class 3 ↔ Class 4–5 | 1.05 |
| Class 1–2 ↔ Class 4–5 | 1.04 + manual spot-check |
| Class 0–1 ↔ anything | 1.03 + spot-check + length-ratio filter |
Length-ratio filtering after margin filter: keep target/source word ratio in [0.5, 2.0] for typologically-similar pairs; widen to [0.3, 3.0] for polysynthetic targets (English–Inuktitut can push 0.3 — do not filter as misalignment). Min sentence: 3 words. Max: 200 source words.
Synthetic bitext when real parallel < 100K pairs:
- Back-translation: train target→source; back-translate target monolingual. Use T=0.7–1.0 (never T=0 — translationese drift collapses diversity)
- Dictionary substitution: glossary-constrained word/phrase swap
- Pivot MT: route through better-resourced intermediate (En → De → Yor when En-De >> En-Yor)
Aligner: Vecalign beats hunalign for low-resource — linear-time + state-of-the-art on Bible-parallel data. Use hunalign only when retrofitting an existing pipeline.
Register balance targets: Bible/liturgical >30% = archaic-register risk; news >70% = event-bias risk; subtitles >50% = conversational-skew risk.
Example Usage
Pair: English ↔ Yoruba (Class 5 ↔ Class 2)
## Bitext Plan: English ↔ Yoruba
**Source pair class:** Class 5 ↔ Class 2
**Recommended embedding:** SONAR (LASER3 has Bantu/Niger-Congo coverage gaps)
**Recommended aligner:** Vecalign
**Margin threshold:** 1.04 + manual spot-check (50 pairs)
**Length-ratio filter:** [0.4, 2.5] (SVO both sides; Yoruba relatively analytic)
**Min/max sentence length:** [3, 200] words
**Synthetic bitext needed?** YES — real pairs ~45K (below 100K)
Strategy: back-translation T=0.8 from target monolingual
**Register balance target:** Bible ≤25%, news ≥20%, web ≥30%
**Estimated final pairs:** ~120K (45K real + 75K synthetic)
**Hand-off:** linguistic-tokenize for fertility audit; linguistic-transfer for adapter strategyRelated Skills
- linguistic-scope — URIEL distance for embedding/threshold selection
- linguistic-ethics — per-dataset gate before mining
- linguistic-tokenize — fertility audit on bitext target side
- linguistic-transfer — adapter/LoRA planning after bitext
Last updated on
linguistic-corpus
Curate monolingual corpora — catalog awareness (OLDI/CulturaX/MADLAD-400/Glot500/Wikipedia), paragraph-level language-ID, Unicode-safe MinHash deduplication, two-sided contamination audit, register-balance analysis.
linguistic-transfer
Plan cross-lingual adaptation of pretrained LLMs — LoRA/QLoRA/DoRA config (rank scales with typological distance), MAD-X adapter stacks, source-language selection via URIEL, catastrophic-forgetting mitigation, tool selection.