Overview
linguistic-transfer plans cross-lingual adaptation of pretrained LLMs before any training job runs. The most common error in this domain is setting LoRA rank by data size rather than typological distance. r=8 is fine for English→Spanish; r=64+ is required for English→Inuktitut. URIEL distance — computed by linguistic-scope — is the correct input. This skill takes that distance and produces an evidence-based adapter plan.
Wrong adapter choice or rank wastes weeks of compute. Continued pretraining on fewer than 100M target tokens overfits and forgets the source language. English is often not the optimal transfer source — for Yoruba, Igbo (URIEL distance 0.18) outperforms English (distance 0.62) by 2–5× on transfer tasks.
Pipeline Position
This skill operates in Phase 1 — Acquire of the linguistic pipeline.
Preceding skills: linguistic-scope (URIEL distance, Joshi class), linguistic-tokenize (vocab extension method — must align with adapter choice), linguistic-corpus and linguistic-bitext (data available)
Following skills: linguistic-eval (benchmark selection after transfer plan)
When It Activates
- Adding a new language to an existing pretrained LLM (Llama-3, Mistral, Qwen, mBART, NLLB, BLOOM)
- Choosing LoRA rank, alpha, target modules
- Choosing between continued pretraining vs LoRA vs full fine-tune
- Picking a tool (Unsloth, LLaMA-Factory, Axolotl, PEFT)
- Designing catastrophic-forgetting mitigation
- Picking adapter stack (MAD-X language + task adapters)
When NOT to use: Training English-only with abundant data → standard fine-tune, no specialist needed.
What It Does
Overall approach by class and data:
| Target Class | Parallel Data | Best URIEL | Recommended Approach |
|---|---|---|---|
| 0–1 | < 10K | any | Vocab extension (HyperOfa) + LoRA + multilingual base |
| 1–2 | 10K–100K | < 0.4 | OFA + LoRA r=16–32 |
| 1–2 | 10K–100K | 0.4–0.6 | OFA + LoRA r=32–64 |
| 1–2 | 10K–100K | > 0.6 | Multilingual base + HyperOfa + LoRA r=32+ |
| 2–3 | 100K–1M | < 0.4 | Continued pretraining + LoRA |
| 2–3 | 100K–1M | > 0.6 | Vocab extension + LoRA r=32–64 (NOT CP — typology too far) |
| 3–4 | 1M–10M | any | Continued pretraining + LoRA OR full fine-tune |
| 5 | abundant | any | Standard full fine-tune |
LoRA config by URIEL distance:
| URIEL Distance | Rank | Alpha | Target Modules |
|---|---|---|---|
| < 0.2 | 8 | 16 | attention only acceptable |
| 0.2–0.4 | 16 | 32 | all-linear recommended |
| 0.4–0.6 | 32 | 64 | all-linear |
| 0.6–0.8 | 64 | 128 | all-linear + embed_tokens (if vocab extended) |
| > 0.8 | 128+ | 256 | all-linear + embed_tokens; consider full fine-tune |
Attention-only (q_proj, v_proj) is legacy default. Current best practice for typologically-distant transfer: all-linear (q, k, v, o, gate, up, down). Attention-only loses 2–5 BLEU points on hard pairs.
Catastrophic-forgetting mitigation:
| Mitigation | When |
|---|---|
| Mix 10–20% source-language data in training | ALWAYS for cross-lingual fine-tune |
| Fisher-weighted regularization (EWC) | When source quality must be preserved |
| KL regularization to base model | Lighter alternative to EWC |
| Source-task eval throughout training | ALWAYS — catch forgetting early |
Tool selection:
| Setup | Recommend |
|---|---|
| Single GPU + QLoRA + speed priority | Unsloth (2× faster than LLaMA-Factory) |
| Multi-GPU + complex multilingual sampling | LLaMA-Factory |
| YAML-config + ergonomics | Axolotl |
| MAD-X / BAD-X adapter stacking | HuggingFace adapters library |
| Just want a baseline | PEFT directly |
Example Usage
Language: Yoruba (yor), URIEL distance to Igbo (best source) = 0.18, Joshi Class 2, 120K pairs available
## Transfer Plan: Yoruba
**Base model:** NLLB-200 distilled (rationale: multilingual base, good Yoruba seed)
**Approach:** OFA vocab extension + LoRA (Class 2, 120K pairs — below CP threshold)
**LoRA config:**
- rank: 16 (URIEL distance 0.18 — close pair)
- alpha: 32
- target modules: all-linear (recommended for Class 2+)
- embed_tokens: YES (vocab extended via OFA)
- dropout: 0.05
**Forgetting mitigation:**
- Source-language data mix: 15% Igbo
- Regularization: KL to base model
- Source-task eval cadence: every 500 steps
**Tool:** Unsloth (single GPU; latency-sensitive deployment)
**Estimated training tokens:** ~180M
**Estimated GPU-hours:** ~12h (A100 80GB)
**Hand-off:** linguistic-eval for benchmark + metric selectionRelated Skills
- linguistic-scope — provides URIEL distance and Joshi class
- linguistic-tokenize — vocab extension method must align with adapter choice
- linguistic-bitext — provides parallel data for training
- linguistic-eval — benchmark selection after transfer plan is complete
Last updated on
linguistic-bitext
Mine, align, filter, and synthesize parallel corpora for low-resource MT. Use before training any MT model — alignment threshold and register balance choices cascade through every downstream eval.
linguistic-morph
Morphological analysis for the target language — UniMorph paradigm lookup, SIGMORPHON segmenters, FST/HFST analyzer recommendations, morphology-aware data augmentation. Essential for agglutinative, polysynthetic, and templatic languages.