magic-data-cleaning
Clean data by detecting issues, handling missing values, normalizing strings, and executing cleaning plans. Does NOT handle sentinel/placeholder values requiring LLM — route those to magic-data-synthesis.
When It Activates
Use this skill when data has quality issues to fix. Trigger phrases: clean, fix, handle missing, nulls, duplicates, normalize, standardize, impute, remove outliers, fix data.
- Data has missing values, duplicates, type errors, or text issues
- Need to impute missing values or normalize strings
- Need complex multi-step cleaning with domain-specific rules
- After magic-data-profiling reveals quality issues
When NOT to Use: Use magic-data-validation for schema validation; use magic-data-transformation for reshaping; use magic-data-synthesis for LLM-based generation, translation, format conversion, or filling sentinel placeholders with meaningful content.
Quick Facts
| Property | Value |
|---|---|
| Version | 2.0.0 |
| Complexity | medium |
| Phase | 1 |
| Scripts | 5 |
Tags
data-science cleaning missing-values normalization deduplication
Scripts
Scriptable Tools (call directly or read + adapt)
| Script | Standard CLI Usage | When to Customize |
|---|---|---|
detect_issues.py | python3 detect_issues.py data.csv report.json | --chunk-size N for files over 500K rows |
handle_missing.py | python3 handle_missing.py data.csv cleaned.csv | --strategy median|knn for specific imputation; --columns col1,col2 to restrict scope |
normalize_strings.py | python3 normalize_strings.py data.csv normalized.csv | --operations trim,encoding to run subset; --columns col1,col2 to restrict |
validate_clean.py | python3 validate_clean.py original.csv cleaned.csv report.json | --input-format when both files are non-CSV |
Reference Implementations (read patterns, write custom code)
| Script | Demonstrates | Key Pattern |
|---|---|---|
execute_cleaning_plan.py | Plan-driven multi-step cleaning | JSON plan with per-column strategies; --explain dry-run; --auto-checkpoint for versioned snapshots; 20-pattern mojibake map |
New in v2.0.0
--auto-checkpoint Flag
execute_cleaning_plan.py now supports --auto-checkpoint, which creates a numbered snapshot (ckpt_NN_*.csv) after each successful cleaning operation:
python3 execute_cleaning_plan.py data.csv cleaned.csv --plan plan.json --auto-checkpoint--explain Flag
Preview what a cleaning plan will do without writing any files:
python3 execute_cleaning_plan.py data.csv cleaned.csv --plan plan.json --explainOutputs a JSON execution plan showing each step, affected columns, strategy, and estimated row impact.
Cleaning vs Synthesis Boundary
| Signal | Route To |
|---|---|
| Missing numeric values | Cleaning (calculable fill) |
| Whitespace, encoding, case issues | Cleaning (deterministic text fixes) |
| Duplicates | Cleaning (row-level deduplication) |
| Sentinel values ("X", "N/A", "TBD") | Synthesis (magic-data-synthesis) |
| Missing text content | Synthesis (magic-data-synthesis) |
| Format conversion (HTML to markdown) | Synthesis (magic-data-synthesis) |
Dependencies
pandas numpy
Last updated on
magic-data-profiling
Profile datasets — run quality scoring, distribution analysis, outlier detection, and issue detection. Use when assessing data quality, running quality_score.py, getting a quality overview, or profiling before cleaning.
magic-data-validation
Validate datasets against inferred or custom schemas, check cross-column constraints, detect sentinel/placeholder values, and catch statistical pitfalls (Simpson's paradox, join explosion). Use when verifying data quality after cleaning, enforcing schemas before delivery, checking for content placeholders, or sanity-checking transformation results.