MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills Reference

magic-data-cleaning

Clean data by detecting issues, handling missing values, normalizing strings, and executing cleaning plans. Does NOT handle sentinel/placeholder values requiring LLM — route those to magic-data-synthesis.

When It Activates

Use this skill when data has quality issues to fix. Trigger phrases: clean, fix, handle missing, nulls, duplicates, normalize, standardize, impute, remove outliers, fix data.

  • Data has missing values, duplicates, type errors, or text issues
  • Need to impute missing values or normalize strings
  • Need complex multi-step cleaning with domain-specific rules
  • After magic-data-profiling reveals quality issues

When NOT to Use: Use magic-data-validation for schema validation; use magic-data-transformation for reshaping; use magic-data-synthesis for LLM-based generation, translation, format conversion, or filling sentinel placeholders with meaningful content.

Quick Facts

PropertyValue
Version2.0.0
Complexitymedium
Phase1
Scripts5

Tags

data-science cleaning missing-values normalization deduplication

Scripts

Scriptable Tools (call directly or read + adapt)

ScriptStandard CLI UsageWhen to Customize
detect_issues.pypython3 detect_issues.py data.csv report.json--chunk-size N for files over 500K rows
handle_missing.pypython3 handle_missing.py data.csv cleaned.csv--strategy median|knn for specific imputation; --columns col1,col2 to restrict scope
normalize_strings.pypython3 normalize_strings.py data.csv normalized.csv--operations trim,encoding to run subset; --columns col1,col2 to restrict
validate_clean.pypython3 validate_clean.py original.csv cleaned.csv report.json--input-format when both files are non-CSV

Reference Implementations (read patterns, write custom code)

ScriptDemonstratesKey Pattern
execute_cleaning_plan.pyPlan-driven multi-step cleaningJSON plan with per-column strategies; --explain dry-run; --auto-checkpoint for versioned snapshots; 20-pattern mojibake map

New in v2.0.0

--auto-checkpoint Flag

execute_cleaning_plan.py now supports --auto-checkpoint, which creates a numbered snapshot (ckpt_NN_*.csv) after each successful cleaning operation:

python3 execute_cleaning_plan.py data.csv cleaned.csv --plan plan.json --auto-checkpoint

--explain Flag

Preview what a cleaning plan will do without writing any files:

python3 execute_cleaning_plan.py data.csv cleaned.csv --plan plan.json --explain

Outputs a JSON execution plan showing each step, affected columns, strategy, and estimated row impact.

Cleaning vs Synthesis Boundary

SignalRoute To
Missing numeric valuesCleaning (calculable fill)
Whitespace, encoding, case issuesCleaning (deterministic text fixes)
DuplicatesCleaning (row-level deduplication)
Sentinel values ("X", "N/A", "TBD")Synthesis (magic-data-synthesis)
Missing text contentSynthesis (magic-data-synthesis)
Format conversion (HTML to markdown)Synthesis (magic-data-synthesis)

Dependencies

pandas numpy

Was this page helpful?
Edit on GitHub

Last updated on

On this page