MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

Overview

linguistic-transfer plans cross-lingual adaptation of pretrained LLMs before any training job runs. The most common error in this domain is setting LoRA rank by data size rather than typological distance. r=8 is fine for English→Spanish; r=64+ is required for English→Inuktitut. URIEL distance — computed by linguistic-scope — is the correct input. This skill takes that distance and produces an evidence-based adapter plan.

Wrong adapter choice or rank wastes weeks of compute. Continued pretraining on fewer than 100M target tokens overfits and forgets the source language. English is often not the optimal transfer source — for Yoruba, Igbo (URIEL distance 0.18) outperforms English (distance 0.62) by 2–5× on transfer tasks.

Pipeline Position

This skill operates in Phase 1 — Acquire of the linguistic pipeline.

Preceding skills: linguistic-scope (URIEL distance, Joshi class), linguistic-tokenize (vocab extension method — must align with adapter choice), linguistic-corpus and linguistic-bitext (data available) Following skills: linguistic-eval (benchmark selection after transfer plan)

When It Activates

  • Adding a new language to an existing pretrained LLM (Llama-3, Mistral, Qwen, mBART, NLLB, BLOOM)
  • Choosing LoRA rank, alpha, target modules
  • Choosing between continued pretraining vs LoRA vs full fine-tune
  • Picking a tool (Unsloth, LLaMA-Factory, Axolotl, PEFT)
  • Designing catastrophic-forgetting mitigation
  • Picking adapter stack (MAD-X language + task adapters)

When NOT to use: Training English-only with abundant data → standard fine-tune, no specialist needed.

What It Does

Overall approach by class and data:

Target ClassParallel DataBest URIELRecommended Approach
0–1< 10KanyVocab extension (HyperOfa) + LoRA + multilingual base
1–210K–100K< 0.4OFA + LoRA r=16–32
1–210K–100K0.4–0.6OFA + LoRA r=32–64
1–210K–100K> 0.6Multilingual base + HyperOfa + LoRA r=32+
2–3100K–1M< 0.4Continued pretraining + LoRA
2–3100K–1M> 0.6Vocab extension + LoRA r=32–64 (NOT CP — typology too far)
3–41M–10ManyContinued pretraining + LoRA OR full fine-tune
5abundantanyStandard full fine-tune

LoRA config by URIEL distance:

URIEL DistanceRankAlphaTarget Modules
< 0.2816attention only acceptable
0.2–0.41632all-linear recommended
0.4–0.63264all-linear
0.6–0.864128all-linear + embed_tokens (if vocab extended)
> 0.8128+256all-linear + embed_tokens; consider full fine-tune

Attention-only (q_proj, v_proj) is legacy default. Current best practice for typologically-distant transfer: all-linear (q, k, v, o, gate, up, down). Attention-only loses 2–5 BLEU points on hard pairs.

Catastrophic-forgetting mitigation:

MitigationWhen
Mix 10–20% source-language data in trainingALWAYS for cross-lingual fine-tune
Fisher-weighted regularization (EWC)When source quality must be preserved
KL regularization to base modelLighter alternative to EWC
Source-task eval throughout trainingALWAYS — catch forgetting early

Tool selection:

SetupRecommend
Single GPU + QLoRA + speed priorityUnsloth (2× faster than LLaMA-Factory)
Multi-GPU + complex multilingual samplingLLaMA-Factory
YAML-config + ergonomicsAxolotl
MAD-X / BAD-X adapter stackingHuggingFace adapters library
Just want a baselinePEFT directly

Example Usage

Language: Yoruba (yor), URIEL distance to Igbo (best source) = 0.18, Joshi Class 2, 120K pairs available

## Transfer Plan: Yoruba

**Base model:** NLLB-200 distilled (rationale: multilingual base, good Yoruba seed)
**Approach:** OFA vocab extension + LoRA (Class 2, 120K pairs — below CP threshold)
**LoRA config:**
  - rank: 16 (URIEL distance 0.18 — close pair)
  - alpha: 32
  - target modules: all-linear (recommended for Class 2+)
  - embed_tokens: YES (vocab extended via OFA)
  - dropout: 0.05
**Forgetting mitigation:**
  - Source-language data mix: 15% Igbo
  - Regularization: KL to base model
  - Source-task eval cadence: every 500 steps
**Tool:** Unsloth (single GPU; latency-sensitive deployment)
**Estimated training tokens:** ~180M
**Estimated GPU-hours:** ~12h (A100 80GB)
**Hand-off:** linguistic-eval for benchmark + metric selection
Was this page helpful?
Edit on GitHub

Last updated on

On this page