magic-data-validation

Validate datasets against inferred or custom schemas, check cross-column constraints, detect sentinel/placeholder values, and catch statistical pitfalls.

When It Activates

Use this skill when verifying data correctness or enforcing rules. Trigger phrases: validate, check format, verify schema, enforce constraints, check for placeholders, sanity check, data quality rules.

After data cleaning, verify data quality
Need to infer or enforce a schema
Need cross-column constraint checking
Need to detect statistical pitfalls (join explosion, Simpson's paradox, etc.)

When NOT to Use: Use magic-data-cleaning to fix issues; use magic-data-profiling for exploration.

Quick Facts

Property	Value
Version	2.0.0
Complexity	medium
Phase	1
Scripts	6

Scripts

Scriptable Tools (call directly or read + adapt)

Script	Standard CLI Usage	When to Customize
`infer_schema.py`	`python3 infer_schema.py --input data.csv --output schema.json`	`--strict` for p5-p95 bounds and ±10% row range (production schema gates)
`content_validator.py`	`python3 content_validator.py data.csv report.json`	`--distribution-check` for length variance anomalies; `--group-by col`; `--depth deep`; `--sentinel-values X,TODO` for custom list
`validate_schema.py`	`python3 validate_schema.py --input data.csv --schema schema.json --output report.json`	`--explain` for per-violation explanations
`sanity_check.py`	`python3 sanity_check.py --input data.csv --output sanity.json`	Runs all 7 pitfalls: join explosion, survivorship bias, Simpson's paradox, look-ahead, selection bias, metric gaming, ecological fallacy

Reference Implementations (read patterns, write custom code)

Script	Demonstrates	Key Pattern
`check_constraints.py`	Cross-column constraint checking	Typed constraint dispatching (comparison, vocabulary, conditional_null, unique_together)
`validate_statistics.py`	Internal consistency of statistical results	Requires cross-skill artifact (`descriptive_stats.py` JSON output); tolerance-gated stat comparison

New in v2.0.0

content_validator.py — Three-Layer Content Validation

content_validator.py provides three distinct layers for text column validation:

Sentinel/placeholder detection — catches "N/A", "TBD", "TODO", "placeholder", single spaces, and custom lists
Length anomaly detection — flags columns where content length variance suggests quality issues
Column uniformity checks — detects suspicious uniformity (too many identical values)

# Basic sentinel detection
python3 content_validator.py data.csv report.json

# Deep inspection with length anomaly detection
python3 content_validator.py data.csv report.json --depth deep --distribution-check

# Custom sentinel list
python3 content_validator.py data.csv report.json --sentinel-values "X,TODO,FIXME,unknown"

validate_statistics.py — Statistical Consistency Checking

validate_statistics.py recomputes descriptive statistics and compares them against a previously reported stats JSON. Use when auditing analysis results or verifying that reported statistics match actual data after a transformation.

python3 validate_statistics.py data.csv stats_report.json validation_result.json

Validation Ordering

Always follow: schema → constraints → content → sanity. Each layer catches what the previous one misses.

Dependencies

pandas numpy

Was this page helpful?