linguistic-bitext

Overview

linguistic-bitext is the Acquire-phase specialist for parallel data. Bitext quality determines MT model quality, and the choices made here — embedding model, margin threshold, register balance — cascade through every downstream evaluation. The wrong embedding (LASER3 on Bantu), the wrong margin threshold (1.06 on Class 0–1), or Bible-only bitext all produce characteristic failures. This skill surfaces these choices before a single training step runs.

The most common mistake is using the published NLLB margin threshold of 1.06 on low-resource pairs. That threshold is calibrated for high-resource pairs — on Class 0–1 it over-filters, discarding half the usable data. The second most common: training MT on Bible-only bitext for general-purpose use, producing a model that sounds 17th-century.

Pipeline Position

This skill operates in Phase 1 — Acquire of the linguistic pipeline.

Preceding skills: linguistic-scope (pair-level resource class, URIEL distance), linguistic-ethics (per-dataset gate), linguistic-scripts (normalization before alignment) Following skills: linguistic-tokenize (fertility audit on bitext target side), linguistic-transfer (adapter/LoRA planning)

When It Activates

Building MT data for any low-resource language pair
Mining parallel sentences from comparable corpora
Choosing alignment tool (Vecalign vs hunalign vs Bleualign)
Generating synthetic bitext via back-translation, dictionary substitution, or pivoting
Auditing existing parallel data quality (margin scores, register skew, length filtering)

When NOT to use: For monolingual corpus → linguistic-corpus. For tokenizer fertility on bitext output → linguistic-tokenize. For ethics/license per source → linguistic-ethics.

What It Does

Embedding model selection:

Source Family	Recommended	Rationale
European Latin/Cyrillic	LASER3	Good coverage
Indic	LASER3 / SONAR	Both work
African (Bantu, Niger-Congo)	SONAR	LASER3 has coverage gaps
Indigenous Americas	SONAR	LASER3 has coverage gaps
SEA (Khmer/Lao/Burmese)	LASER3	Adequate
Mixed-script source	SONAR	Handles better

Margin threshold by pair class:

Pair Type	Threshold
Class 4–5 ↔ Class 4–5	1.06 (NLLB standard)
Class 3 ↔ Class 4–5	1.05
Class 1–2 ↔ Class 4–5	1.04 + manual spot-check
Class 0–1 ↔ anything	1.03 + spot-check + length-ratio filter

Length-ratio filtering after margin filter: keep target/source word ratio in [0.5, 2.0] for typologically-similar pairs; widen to [0.3, 3.0] for polysynthetic targets (English–Inuktitut can push 0.3 — do not filter as misalignment). Min sentence: 3 words. Max: 200 source words.

Synthetic bitext when real parallel < 100K pairs:

Back-translation: train target→source; back-translate target monolingual. Use T=0.7–1.0 (never T=0 — translationese drift collapses diversity)
Dictionary substitution: glossary-constrained word/phrase swap
Pivot MT: route through better-resourced intermediate (En → De → Yor when En-De >> En-Yor)

Aligner: Vecalign beats hunalign for low-resource — linear-time + state-of-the-art on Bible-parallel data. Use hunalign only when retrofitting an existing pipeline.

Register balance targets: Bible/liturgical >30% = archaic-register risk; news >70% = event-bias risk; subtitles >50% = conversational-skew risk.

Example Usage

Pair: English ↔ Yoruba (Class 5 ↔ Class 2)

## Bitext Plan: English ↔ Yoruba

**Source pair class:** Class 5 ↔ Class 2
**Recommended embedding:** SONAR (LASER3 has Bantu/Niger-Congo coverage gaps)
**Recommended aligner:** Vecalign
**Margin threshold:** 1.04 + manual spot-check (50 pairs)
**Length-ratio filter:** [0.4, 2.5] (SVO both sides; Yoruba relatively analytic)
**Min/max sentence length:** [3, 200] words
**Synthetic bitext needed?** YES — real pairs ~45K (below 100K)
  Strategy: back-translation T=0.8 from target monolingual
**Register balance target:** Bible ≤25%, news ≥20%, web ≥30%
**Estimated final pairs:** ~120K (45K real + 75K synthetic)
**Hand-off:** linguistic-tokenize for fertility audit; linguistic-transfer for adapter strategy

linguistic-scope — URIEL distance for embedding/threshold selection
linguistic-ethics — per-dataset gate before mining
linguistic-tokenize — fertility audit on bitext target side
linguistic-transfer — adapter/LoRA planning after bitext

Was this page helpful?

Overview

Pipeline Position

When It Activates

What It Does

Example Usage

Related Skills

On this page