MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Guides

Cross-Suite Integration Guide

The Linguistic Agent Skills suite and the Data Agent Skills suite are sibling repos with complementary responsibilities. The data suite handles general-purpose tabular and text data pipelines; the linguistic suite handles language-specific decisions within those pipelines. Used together, they cover the full stack from raw multilingual data to a production-ready low-resource LLM.

Responsibility Boundaries

TaskSuiteSkill
Load a CSV or parquet fileDatamagic-data-loading
Profile data qualityDatamagic-data-profiling
Clean missing values, fix typesDatamagic-data-cleaning
Deduplicate rows in a dataframeDatamagic-data-cleaning
Resolve ISO 639-3 language identityLinguisticlinguistic-scope
Set Unicode normalization policyLinguisticlinguistic-scripts
MinHash dedup for text corporaLinguisticlinguistic-corpus
Language-ID at paragraph granularityLinguisticlinguistic-corpus
Tokenizer fertility auditLinguisticlinguistic-tokenize
Bitext mining + alignmentLinguisticlinguistic-bitext
LoRA / adapter strategyLinguisticlinguistic-transfer
Contamination auditLinguisticlinguistic-corpus + linguistic-eval
Visualize corpus statisticsDatamagic-data-visualization
Generate synthetic textDatamagic-data-synthesis
Statistical analysis of eval resultsDatamagic-statistical-analysis
End-to-end pipeline reportDatamagic-report-generation

Common Integration Patterns

Pattern 1: Multilingual Dataset Quality Audit

Use data skills for structural quality, linguistic skills for language-specific quality.

1. magic-data-loading      → load the multilingual CSV / parquet
2. magic-data-profiling    → structural quality (nulls, types, row counts)
3. magic-data-cleaning     → fix structural issues (encoding errors as bytes, type coercion)
4. linguistic-scope        → identify target language(s) in the dataset
5. linguistic-scripts      → set normalization policy per language
6. linguistic-corpus       → paragraph-level LID, MinHash dedup, register balance
7. magic-data-visualization → visualize language distribution, register breakdown
8. magic-report-generation → combined quality report

Pattern 2: Low-Resource MT Data Pipeline

Full pipeline from raw web crawl to clean, deduplicated, contamination-checked bitext.

1. magic-data-loading      → load raw CC or OPUS dump
2. linguistic-scope        → identify source + target languages, resource class
3. linguistic-scripts      → normalization policy (NFC, confusable folding)
4. linguistic-ethics       → per-source license check before any processing
5. linguistic-corpus       → monolingual corpus (LID, dedup, register)
6. linguistic-bitext       → mine + align parallel data
7. linguistic-tokenize     → fertility audit on bitext target side
8. magic-statistical-analysis → analyze bitext pair statistics (length ratio, score distribution)
9. magic-data-visualization → visualize corpus composition
10. linguistic-eval        → contamination check, benchmark selection

Pattern 3: Annotation Project Pipeline

Combine data skills for annotation data management with linguistic skills for IAA and gold-standard creation.

1. magic-data-loading      → load annotation CSV from Label Studio / Prodigy export
2. magic-data-profiling    → check annotation completeness, label distribution
3. linguistic-annotate     → IAA calculation (κ/α/γ), adjudication workflow
4. magic-data-cleaning     → resolve adjudicated conflicts into final gold labels
5. magic-statistical-analysis → inter-annotator statistics, confusion matrices
6. linguistic-eval         → integrate gold set into eval pipeline

Pattern 4: LLM Eval Results Analysis

Use data skills to analyze model evaluation outputs produced by linguistic eval.

1. linguistic-eval         → run benchmark (FLORES+, Belebele, custom probes)
2. magic-data-loading      → load eval results CSV
3. magic-statistical-analysis → per-dialect / per-register breakdown, significance tests
4. magic-data-visualization → error analysis charts, per-stratum comparisons
5. magic-report-generation → stakeholder eval report

Shared Concepts

MinHash Deduplication

Both suites handle deduplication but for different data types:

  • Data suite (magic-data-cleaning): row-level dedup for structured/tabular data
  • Linguistic suite (linguistic-corpus): text-level MinHash dedup with language-specific thresholds (0.9 for low-resource, not 0.8)

For text corpora, always use linguistic-corpus dedup — it applies confusable folding first and uses the right threshold for low-resource data.

Synthetic Data Generation

Both suites generate synthetic data but for different purposes:

  • Data suite (magic-data-synthesis): fill sentinel values, format conversion, LLM-based imputation
  • Linguistic suite (linguistic-bitext): back-translation, dictionary substitution, pivot MT for low-resource bitext

For synthetic parallel text generation, use linguistic-bitext. For filling structural gaps in a multilingual dataset, use magic-data-synthesis.

Report Generation

Use magic-report-generation (data suite) for final stakeholder reports that incorporate outputs from both suites. The linguistic suite produces structured findings in workspace_state.md; the data suite can read these and incorporate them into formatted reports.

Installation Side-by-Side

Both suites symlink into ~/.claude/skills/:

# Data suite
ln -s "$(pwd)/magic-data-agent-skills/skills" ~/.claude/skills/data

# Linguistic suite  
ln -s "$(pwd)/magic-linguistic-agent-skills/skills" ~/.claude/skills/linguistic
ln -s "$(pwd)/magic-linguistic-agent-skills/commands" ~/.claude/commands/linguistic

Claude Code will use both suites in the same session. The orchestrators are independent — linguistic-orchestrator handles linguistic routing; data skills activate on data-domain triggers.

Throughout the Linguistic Agent Skills documentation, you'll see cross-references to related data skills:

  • linguistic-corpusmagic-data-cleaning (structural data cleaning before linguistic processing)
  • linguistic-evalmagic-statistical-analysis (statistical significance of eval results)
  • linguistic-annotatemagic-data-profiling (annotation data completeness checks)
  • linguistic-bitextmagic-data-synthesis (synthetic data generation strategy)
Was this page helpful?
Edit on GitHub

Last updated on

On this page