MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills

linguistic-codeswitch

Code-switching awareness for ML pipelines: Hinglish, Spanglish, Singlish, MSA + dialect Arabic, Mandarin + Cantonese alternation, and other bilingual/multilingual mixing. Optional Mindset specialist.

Overview

Code-switching is the norm, not noise. ~50% of the world's population is multilingual; their conversational text routinely mixes languages. Filtering code-switched data as a "data quality" issue is a category error — you're filtering the user's natural way of speaking. In low-resource contexts (Twi-English chat, Yoruba-English social media), CS data is often the only natural data available. Filtering it out leaves no usable training data.

Pipeline Position

Phase: Optional (Phase 4) — activate when corpus analysis reveals significant CS data

When to activate: After linguistic-corpus identifies code-switched content in the target data

Related: linguistic-corpus (paragraph-level LID for CS detection), linguistic-tokenize (CS-aware tokenizer training)

When It Activates

  • User-generated text from bilingual / multilingual communities
  • Building a chatbot for CS-prevalent communities
  • Diagnosing model failures on Hinglish / Spanglish / MSA+dialect
  • CS data constitutes a significant portion of available training data

When NOT to use: Monolingual data — no code-switching to handle. For language-ID at paragraph granularity → linguistic-corpus.

What It Does

Matrix Language Frame

One language is the matrix (grammatical structure), the other is embedded (insertions). Myers-Scotton (1993) Matrix Language Frame model — detection methods exist and are useful for tokenizer and corpus stratification.

Key Data Sources

BenchmarkLanguagesUse
LinCEHinglish, Spanglish, MSA+Egyptian, etc.CS evaluation benchmark
GLUECoSHindi-English, Spanish-EnglishCS NLP tasks
MADARMSA + 25 Arabic dialectsDialect variation
DARTDialect annotation in tweetsDialect + CS

CS-Specific Processing Rules

  • Paragraph-level LID is the floor — document-level averages over CS and gets it wrong
  • CS-aware tokenizer preserves code-switched tokens; per-script policy preserved (don't strip diacritics from one side just because the other is Latin)
  • MT models trained on monolingual data fail at CS in characteristic ways — language-ID confusion, partial-translation hallucination. Targeted CS eval needed
  • Never filter CS as noise without checking community usage patterns — may be filtering 60% of conversational data

Inputs & Outputs

InputDescription
Target community languagesBoth/all languages in the CS mix
Corpus samplesFor CS detection and stratification
OutputDescription
Matrix/embedded language identificationPer-document or per-paragraph
CS prevalence estimate% of data that is code-switched
Processing recommendationsLID strategy, tokenizer policy, eval suite

Example Usage

Community: Urban Nigerian youth (Yoruba–English)

Code-Switching Analysis: Yoruba–English
- Matrix language: English (grammatical structure)
- Embedded: Yoruba (lexical insertions, ~25% of tokens)
- CS prevalence: ~60% of social media data (yor-eng mix)
- LID: paragraph-level GlotLID (document-level would misclassify ~40%)
- Tokenizer: CS-aware; preserve Yoruba tone diacritics even in English-matrix context
- Data sources: no existing labeled Yoruba-English CS corpus;
    collect from Twitter/X using language-pair detection
- Eval: construct CS-aware eval set (LinCE methodology);
    test for partial-translation hallucination
Was this page helpful?
Edit on GitHub

Last updated on

On this page