Cross-Suite Integration Guide
The Linguistic Agent Skills suite and the Data Agent Skills suite are sibling repos with complementary responsibilities. The data suite handles general-purpose tabular and text data pipelines; the linguistic suite handles language-specific decisions within those pipelines. Used together, they cover the full stack from raw multilingual data to a production-ready low-resource LLM.
Responsibility Boundaries
| Task | Suite | Skill |
|---|---|---|
| Load a CSV or parquet file | Data | magic-data-loading |
| Profile data quality | Data | magic-data-profiling |
| Clean missing values, fix types | Data | magic-data-cleaning |
| Deduplicate rows in a dataframe | Data | magic-data-cleaning |
| Resolve ISO 639-3 language identity | Linguistic | linguistic-scope |
| Set Unicode normalization policy | Linguistic | linguistic-scripts |
| MinHash dedup for text corpora | Linguistic | linguistic-corpus |
| Language-ID at paragraph granularity | Linguistic | linguistic-corpus |
| Tokenizer fertility audit | Linguistic | linguistic-tokenize |
| Bitext mining + alignment | Linguistic | linguistic-bitext |
| LoRA / adapter strategy | Linguistic | linguistic-transfer |
| Contamination audit | Linguistic | linguistic-corpus + linguistic-eval |
| Visualize corpus statistics | Data | magic-data-visualization |
| Generate synthetic text | Data | magic-data-synthesis |
| Statistical analysis of eval results | Data | magic-statistical-analysis |
| End-to-end pipeline report | Data | magic-report-generation |
Common Integration Patterns
Pattern 1: Multilingual Dataset Quality Audit
Use data skills for structural quality, linguistic skills for language-specific quality.
1. magic-data-loading → load the multilingual CSV / parquet
2. magic-data-profiling → structural quality (nulls, types, row counts)
3. magic-data-cleaning → fix structural issues (encoding errors as bytes, type coercion)
4. linguistic-scope → identify target language(s) in the dataset
5. linguistic-scripts → set normalization policy per language
6. linguistic-corpus → paragraph-level LID, MinHash dedup, register balance
7. magic-data-visualization → visualize language distribution, register breakdown
8. magic-report-generation → combined quality reportPattern 2: Low-Resource MT Data Pipeline
Full pipeline from raw web crawl to clean, deduplicated, contamination-checked bitext.
1. magic-data-loading → load raw CC or OPUS dump
2. linguistic-scope → identify source + target languages, resource class
3. linguistic-scripts → normalization policy (NFC, confusable folding)
4. linguistic-ethics → per-source license check before any processing
5. linguistic-corpus → monolingual corpus (LID, dedup, register)
6. linguistic-bitext → mine + align parallel data
7. linguistic-tokenize → fertility audit on bitext target side
8. magic-statistical-analysis → analyze bitext pair statistics (length ratio, score distribution)
9. magic-data-visualization → visualize corpus composition
10. linguistic-eval → contamination check, benchmark selectionPattern 3: Annotation Project Pipeline
Combine data skills for annotation data management with linguistic skills for IAA and gold-standard creation.
1. magic-data-loading → load annotation CSV from Label Studio / Prodigy export
2. magic-data-profiling → check annotation completeness, label distribution
3. linguistic-annotate → IAA calculation (κ/α/γ), adjudication workflow
4. magic-data-cleaning → resolve adjudicated conflicts into final gold labels
5. magic-statistical-analysis → inter-annotator statistics, confusion matrices
6. linguistic-eval → integrate gold set into eval pipelinePattern 4: LLM Eval Results Analysis
Use data skills to analyze model evaluation outputs produced by linguistic eval.
1. linguistic-eval → run benchmark (FLORES+, Belebele, custom probes)
2. magic-data-loading → load eval results CSV
3. magic-statistical-analysis → per-dialect / per-register breakdown, significance tests
4. magic-data-visualization → error analysis charts, per-stratum comparisons
5. magic-report-generation → stakeholder eval reportShared Concepts
MinHash Deduplication
Both suites handle deduplication but for different data types:
- Data suite (
magic-data-cleaning): row-level dedup for structured/tabular data - Linguistic suite (
linguistic-corpus): text-level MinHash dedup with language-specific thresholds (0.9 for low-resource, not 0.8)
For text corpora, always use linguistic-corpus dedup — it applies confusable folding first and uses the right threshold for low-resource data.
Synthetic Data Generation
Both suites generate synthetic data but for different purposes:
- Data suite (
magic-data-synthesis): fill sentinel values, format conversion, LLM-based imputation - Linguistic suite (
linguistic-bitext): back-translation, dictionary substitution, pivot MT for low-resource bitext
For synthetic parallel text generation, use linguistic-bitext. For filling structural gaps in a multilingual dataset, use magic-data-synthesis.
Report Generation
Use magic-report-generation (data suite) for final stakeholder reports that incorporate outputs from both suites. The linguistic suite produces structured findings in workspace_state.md; the data suite can read these and incorporate them into formatted reports.
Installation Side-by-Side
Both suites symlink into ~/.claude/skills/:
# Data suite
ln -s "$(pwd)/magic-data-agent-skills/skills" ~/.claude/skills/data
# Linguistic suite
ln -s "$(pwd)/magic-linguistic-agent-skills/skills" ~/.claude/skills/linguistic
ln -s "$(pwd)/magic-linguistic-agent-skills/commands" ~/.claude/commands/linguisticClaude Code will use both suites in the same session. The orchestrators are independent — linguistic-orchestrator handles linguistic routing; data skills activate on data-domain triggers.
Cross-Suite Links in Skill Pages
Throughout the Linguistic Agent Skills documentation, you'll see cross-references to related data skills:
linguistic-corpus→magic-data-cleaning(structural data cleaning before linguistic processing)linguistic-eval→magic-statistical-analysis(statistical significance of eval results)linguistic-annotate→magic-data-profiling(annotation data completeness checks)linguistic-bitext→magic-data-synthesis(synthetic data generation strategy)
Last updated on
Ethics and FPIC Guide
FPIC and CARE principles in linguistic AI projects — how the ethics skill gates the pipeline, community engagement requirements, and sacred-text handling.
Concepts
Core concepts behind the Linguistic Agent Skills suite — pipeline architecture, typological profiling, Joshi classification, shared utilities, and quality gating.