linguistic-transfer

Overview

linguistic-transfer plans cross-lingual adaptation of pretrained LLMs before any training job runs. The most common error in this domain is setting LoRA rank by data size rather than typological distance. r=8 is fine for English→Spanish; r=64+ is required for English→Inuktitut. URIEL distance — computed by linguistic-scope — is the correct input. This skill takes that distance and produces an evidence-based adapter plan.

Wrong adapter choice or rank wastes weeks of compute. Continued pretraining on fewer than 100M target tokens overfits and forgets the source language. English is often not the optimal transfer source — for Yoruba, Igbo (URIEL distance 0.18) outperforms English (distance 0.62) by 2–5× on transfer tasks.

Pipeline Position

This skill operates in Phase 1 — Acquire of the linguistic pipeline.

Preceding skills: linguistic-scope (URIEL distance, Joshi class), linguistic-tokenize (vocab extension method — must align with adapter choice), linguistic-corpus and linguistic-bitext (data available) Following skills: linguistic-eval (benchmark selection after transfer plan)

When It Activates

Adding a new language to an existing pretrained LLM (Llama-3, Mistral, Qwen, mBART, NLLB, BLOOM)
Choosing LoRA rank, alpha, target modules
Choosing between continued pretraining vs LoRA vs full fine-tune
Picking a tool (Unsloth, LLaMA-Factory, Axolotl, PEFT)
Designing catastrophic-forgetting mitigation
Picking adapter stack (MAD-X language + task adapters)

When NOT to use: Training English-only with abundant data → standard fine-tune, no specialist needed.

What It Does

Overall approach by class and data:

Target Class	Parallel Data	Best URIEL	Recommended Approach
0–1	< 10K	any	Vocab extension (HyperOfa) + LoRA + multilingual base
1–2	10K–100K	< 0.4	OFA + LoRA r=16–32
1–2	10K–100K	0.4–0.6	OFA + LoRA r=32–64
1–2	10K–100K	> 0.6	Multilingual base + HyperOfa + LoRA r=32+
2–3	100K–1M	< 0.4	Continued pretraining + LoRA
2–3	100K–1M	> 0.6	Vocab extension + LoRA r=32–64 (NOT CP — typology too far)
3–4	1M–10M	any	Continued pretraining + LoRA OR full fine-tune
5	abundant	any	Standard full fine-tune

LoRA config by URIEL distance:

URIEL Distance	Rank	Alpha	Target Modules
< 0.2	8	16	attention only acceptable
0.2–0.4	16	32	all-linear recommended
0.4–0.6	32	64	all-linear
0.6–0.8	64	128	all-linear + embed_tokens (if vocab extended)
> 0.8	128+	256	all-linear + embed_tokens; consider full fine-tune

Attention-only (q_proj, v_proj) is legacy default. Current best practice for typologically-distant transfer: all-linear (q, k, v, o, gate, up, down). Attention-only loses 2–5 BLEU points on hard pairs.

Catastrophic-forgetting mitigation:

Mitigation	When
Mix 10–20% source-language data in training	ALWAYS for cross-lingual fine-tune
Fisher-weighted regularization (EWC)	When source quality must be preserved
KL regularization to base model	Lighter alternative to EWC
Source-task eval throughout training	ALWAYS — catch forgetting early

Tool selection:

Setup	Recommend
Single GPU + QLoRA + speed priority	Unsloth (2× faster than LLaMA-Factory)
Multi-GPU + complex multilingual sampling	LLaMA-Factory
YAML-config + ergonomics	Axolotl
MAD-X / BAD-X adapter stacking	HuggingFace adapters library
Just want a baseline	PEFT directly

Example Usage

Language: Yoruba (yor), URIEL distance to Igbo (best source) = 0.18, Joshi Class 2, 120K pairs available

## Transfer Plan: Yoruba

**Base model:** NLLB-200 distilled (rationale: multilingual base, good Yoruba seed)
**Approach:** OFA vocab extension + LoRA (Class 2, 120K pairs — below CP threshold)
**LoRA config:**
  - rank: 16 (URIEL distance 0.18 — close pair)
  - alpha: 32
  - target modules: all-linear (recommended for Class 2+)
  - embed_tokens: YES (vocab extended via OFA)
  - dropout: 0.05
**Forgetting mitigation:**
  - Source-language data mix: 15% Igbo
  - Regularization: KL to base model
  - Source-task eval cadence: every 500 steps
**Tool:** Unsloth (single GPU; latency-sensitive deployment)
**Estimated training tokens:** ~180M
**Estimated GPU-hours:** ~12h (A100 80GB)
**Hand-off:** linguistic-eval for benchmark + metric selection

linguistic-scope — provides URIEL distance and Joshi class
linguistic-tokenize — vocab extension method must align with adapter choice
linguistic-bitext — provides parallel data for training
linguistic-eval — benchmark selection after transfer plan is complete

Was this page helpful?

Overview

Pipeline Position

When It Activates

What It Does

Example Usage

Related Skills

On this page