linguistic-codeswitch
Code-switching awareness for ML pipelines: Hinglish, Spanglish, Singlish, MSA + dialect Arabic, Mandarin + Cantonese alternation, and other bilingual/multilingual mixing. Optional Mindset specialist.
Overview
Code-switching is the norm, not noise. ~50% of the world's population is multilingual; their conversational text routinely mixes languages. Filtering code-switched data as a "data quality" issue is a category error — you're filtering the user's natural way of speaking. In low-resource contexts (Twi-English chat, Yoruba-English social media), CS data is often the only natural data available. Filtering it out leaves no usable training data.
Pipeline Position
Phase: Optional (Phase 4) — activate when corpus analysis reveals significant CS data
When to activate: After linguistic-corpus identifies code-switched content in the target data
Related: linguistic-corpus (paragraph-level LID for CS detection), linguistic-tokenize (CS-aware tokenizer training)
When It Activates
- User-generated text from bilingual / multilingual communities
- Building a chatbot for CS-prevalent communities
- Diagnosing model failures on Hinglish / Spanglish / MSA+dialect
- CS data constitutes a significant portion of available training data
When NOT to use: Monolingual data — no code-switching to handle. For language-ID at paragraph granularity → linguistic-corpus.
What It Does
Matrix Language Frame
One language is the matrix (grammatical structure), the other is embedded (insertions). Myers-Scotton (1993) Matrix Language Frame model — detection methods exist and are useful for tokenizer and corpus stratification.
Key Data Sources
| Benchmark | Languages | Use |
|---|---|---|
| LinCE | Hinglish, Spanglish, MSA+Egyptian, etc. | CS evaluation benchmark |
| GLUECoS | Hindi-English, Spanish-English | CS NLP tasks |
| MADAR | MSA + 25 Arabic dialects | Dialect variation |
| DART | Dialect annotation in tweets | Dialect + CS |
CS-Specific Processing Rules
- Paragraph-level LID is the floor — document-level averages over CS and gets it wrong
- CS-aware tokenizer preserves code-switched tokens; per-script policy preserved (don't strip diacritics from one side just because the other is Latin)
- MT models trained on monolingual data fail at CS in characteristic ways — language-ID confusion, partial-translation hallucination. Targeted CS eval needed
- Never filter CS as noise without checking community usage patterns — may be filtering 60% of conversational data
Inputs & Outputs
| Input | Description |
|---|---|
| Target community languages | Both/all languages in the CS mix |
| Corpus samples | For CS detection and stratification |
| Output | Description |
|---|---|
| Matrix/embedded language identification | Per-document or per-paragraph |
| CS prevalence estimate | % of data that is code-switched |
| Processing recommendations | LID strategy, tokenizer policy, eval suite |
Example Usage
Community: Urban Nigerian youth (Yoruba–English)
Code-Switching Analysis: Yoruba–English
- Matrix language: English (grammatical structure)
- Embedded: Yoruba (lexical insertions, ~25% of tokens)
- CS prevalence: ~60% of social media data (yor-eng mix)
- LID: paragraph-level GlotLID (document-level would misclassify ~40%)
- Tokenizer: CS-aware; preserve Yoruba tone diacritics even in English-matrix context
- Data sources: no existing labeled Yoruba-English CS corpus;
collect from Twitter/X using language-pair detection
- Eval: construct CS-aware eval set (LinCE methodology);
test for partial-translation hallucinationRelated Skills
linguistic-corpus— paragraph-level LID detects CS in corpuslinguistic-tokenize— CS-aware tokenizer traininglinguistic-scripts— per-script diacritic policy must apply to both languages
Last updated on
linguistic-eval
Honest evaluation for low-resource LLMs — benchmark choice (FLORES+/NTREX/Belebele/AfroBench/IndicXTREME/SEACrowd), metric choice (chrF++/spBLEU/COMET/MetricX/GEMBA-MQM), BLiMP-style probes, contamination check. A-tier.
linguistic-historical
Historical/comparative linguistics primitives for ML data augmentation — cognate sets, Swadesh lists for bilingual-lexicon bootstrapping, regular sound-correspondence rules. Optional Mindset specialist for Class 0–1 languages.