magic-data-validation
Validate datasets against inferred or custom schemas, check cross-column constraints, detect sentinel/placeholder values, and catch statistical pitfalls.
When It Activates
Use this skill when verifying data correctness or enforcing rules. Trigger phrases: validate, check format, verify schema, enforce constraints, check for placeholders, sanity check, data quality rules.
- After data cleaning, verify data quality
- Need to infer or enforce a schema
- Need cross-column constraint checking
- Need to detect statistical pitfalls (join explosion, Simpson's paradox, etc.)
When NOT to Use: Use magic-data-cleaning to fix issues; use magic-data-profiling for exploration.
Quick Facts
| Property | Value |
|---|---|
| Version | 2.0.0 |
| Complexity | medium |
| Phase | 1 |
| Scripts | 6 |
Tags
data-science validation schema constraints pitfalls
Scripts
Scriptable Tools (call directly or read + adapt)
| Script | Standard CLI Usage | When to Customize |
|---|---|---|
infer_schema.py | python3 infer_schema.py --input data.csv --output schema.json | --strict for p5-p95 bounds and ±10% row range (production schema gates) |
content_validator.py | python3 content_validator.py data.csv report.json | --distribution-check for length variance anomalies; --group-by col; --depth deep; --sentinel-values X,TODO for custom list |
validate_schema.py | python3 validate_schema.py --input data.csv --schema schema.json --output report.json | --explain for per-violation explanations |
sanity_check.py | python3 sanity_check.py --input data.csv --output sanity.json | Runs all 7 pitfalls: join explosion, survivorship bias, Simpson's paradox, look-ahead, selection bias, metric gaming, ecological fallacy |
Reference Implementations (read patterns, write custom code)
| Script | Demonstrates | Key Pattern |
|---|---|---|
check_constraints.py | Cross-column constraint checking | Typed constraint dispatching (comparison, vocabulary, conditional_null, unique_together) |
validate_statistics.py | Internal consistency of statistical results | Requires cross-skill artifact (descriptive_stats.py JSON output); tolerance-gated stat comparison |
New in v2.0.0
content_validator.py — Three-Layer Content Validation
content_validator.py provides three distinct layers for text column validation:
- Sentinel/placeholder detection — catches "N/A", "TBD", "TODO", "placeholder", single spaces, and custom lists
- Length anomaly detection — flags columns where content length variance suggests quality issues
- Column uniformity checks — detects suspicious uniformity (too many identical values)
# Basic sentinel detection
python3 content_validator.py data.csv report.json
# Deep inspection with length anomaly detection
python3 content_validator.py data.csv report.json --depth deep --distribution-check
# Custom sentinel list
python3 content_validator.py data.csv report.json --sentinel-values "X,TODO,FIXME,unknown"validate_statistics.py — Statistical Consistency Checking
validate_statistics.py recomputes descriptive statistics and compares them against a previously reported stats JSON. Use when auditing analysis results or verifying that reported statistics match actual data after a transformation.
python3 validate_statistics.py data.csv stats_report.json validation_result.jsonValidation Ordering
Always follow: schema → constraints → content → sanity. Each layer catches what the previous one misses.
Dependencies
pandas numpy
Last updated on
magic-data-cleaning
Clean data by detecting issues, handling missing values, normalizing strings, and executing cleaning plans. Use when: (1) data has missing values or nulls to impute, (2) text columns need normalization or deduplication, (3) type errors or inconsistent formats need fixing, (4) planning a cleaning strategy before execution. Does NOT handle sentinel/placeholder values requiring LLM — route those to magic-data-synthesis. Trigger keywords: clean, fix nulls, handle missing, normalize, deduplicate, impute, strip whitespace.
magic-data-exploration
Explore data interactively and detect patterns systematically. Use when investigating a dataset — freely exploring quality issues, comparing segments, discovering correlations, or running automated pattern detection. Covers both interactive investigation (asking questions, following threads) and scripted analysis (pattern detection, segment comparison, relationship exploration).