magic-data-cleaning

Clean data by detecting issues, handling missing values, normalizing strings, and executing cleaning plans. Does NOT handle sentinel/placeholder values requiring LLM — route those to magic-data-synthesis.

When It Activates

Use this skill when data has quality issues to fix. Trigger phrases: clean, fix, handle missing, nulls, duplicates, normalize, standardize, impute, remove outliers, fix data.

Data has missing values, duplicates, type errors, or text issues
Need to impute missing values or normalize strings
Need complex multi-step cleaning with domain-specific rules
After magic-data-profiling reveals quality issues

When NOT to Use: Use magic-data-validation for schema validation; use magic-data-transformation for reshaping; use magic-data-synthesis for LLM-based generation, translation, format conversion, or filling sentinel placeholders with meaningful content.

Quick Facts

Property	Value
Version	2.0.0
Complexity	medium
Phase	1
Scripts	5

Scripts

Scriptable Tools (call directly or read + adapt)

Script	Standard CLI Usage	When to Customize
`detect_issues.py`	`python3 detect_issues.py data.csv report.json`	`--chunk-size N` for files over 500K rows
`handle_missing.py`	`python3 handle_missing.py data.csv cleaned.csv`	`--strategy median\|knn` for specific imputation; `--columns col1,col2` to restrict scope
`normalize_strings.py`	`python3 normalize_strings.py data.csv normalized.csv`	`--operations trim,encoding` to run subset; `--columns col1,col2` to restrict
`validate_clean.py`	`python3 validate_clean.py original.csv cleaned.csv report.json`	`--input-format` when both files are non-CSV

Reference Implementations (read patterns, write custom code)

Script	Demonstrates	Key Pattern
`execute_cleaning_plan.py`	Plan-driven multi-step cleaning	JSON plan with per-column strategies; `--explain` dry-run; `--auto-checkpoint` for versioned snapshots; 20-pattern mojibake map

New in v2.0.0

`--auto-checkpoint` Flag

execute_cleaning_plan.py now supports --auto-checkpoint, which creates a numbered snapshot (ckpt_NN_*.csv) after each successful cleaning operation:

python3 execute_cleaning_plan.py data.csv cleaned.csv --plan plan.json --auto-checkpoint

`--explain` Flag

Preview what a cleaning plan will do without writing any files:

python3 execute_cleaning_plan.py data.csv cleaned.csv --plan plan.json --explain

Outputs a JSON execution plan showing each step, affected columns, strategy, and estimated row impact.

Cleaning vs Synthesis Boundary

Signal	Route To
Missing numeric values	Cleaning (calculable fill)
Whitespace, encoding, case issues	Cleaning (deterministic text fixes)
Duplicates	Cleaning (row-level deduplication)
Sentinel values ("X", "N/A", "TBD")	Synthesis (`magic-data-synthesis`)
Missing text content	Synthesis (`magic-data-synthesis`)
Format conversion (HTML to markdown)	Synthesis (`magic-data-synthesis`)

Dependencies

pandas numpy

Was this page helpful?