MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills Reference

magic-data-validation

Validate datasets against inferred or custom schemas, check cross-column constraints, detect sentinel/placeholder values, and catch statistical pitfalls.

When It Activates

Use this skill when verifying data correctness or enforcing rules. Trigger phrases: validate, check format, verify schema, enforce constraints, check for placeholders, sanity check, data quality rules.

  • After data cleaning, verify data quality
  • Need to infer or enforce a schema
  • Need cross-column constraint checking
  • Need to detect statistical pitfalls (join explosion, Simpson's paradox, etc.)

When NOT to Use: Use magic-data-cleaning to fix issues; use magic-data-profiling for exploration.

Quick Facts

PropertyValue
Version2.0.0
Complexitymedium
Phase1
Scripts6

Tags

data-science validation schema constraints pitfalls

Scripts

Scriptable Tools (call directly or read + adapt)

ScriptStandard CLI UsageWhen to Customize
infer_schema.pypython3 infer_schema.py --input data.csv --output schema.json--strict for p5-p95 bounds and ±10% row range (production schema gates)
content_validator.pypython3 content_validator.py data.csv report.json--distribution-check for length variance anomalies; --group-by col; --depth deep; --sentinel-values X,TODO for custom list
validate_schema.pypython3 validate_schema.py --input data.csv --schema schema.json --output report.json--explain for per-violation explanations
sanity_check.pypython3 sanity_check.py --input data.csv --output sanity.jsonRuns all 7 pitfalls: join explosion, survivorship bias, Simpson's paradox, look-ahead, selection bias, metric gaming, ecological fallacy

Reference Implementations (read patterns, write custom code)

ScriptDemonstratesKey Pattern
check_constraints.pyCross-column constraint checkingTyped constraint dispatching (comparison, vocabulary, conditional_null, unique_together)
validate_statistics.pyInternal consistency of statistical resultsRequires cross-skill artifact (descriptive_stats.py JSON output); tolerance-gated stat comparison

New in v2.0.0

content_validator.py — Three-Layer Content Validation

content_validator.py provides three distinct layers for text column validation:

  1. Sentinel/placeholder detection — catches "N/A", "TBD", "TODO", "placeholder", single spaces, and custom lists
  2. Length anomaly detection — flags columns where content length variance suggests quality issues
  3. Column uniformity checks — detects suspicious uniformity (too many identical values)
# Basic sentinel detection
python3 content_validator.py data.csv report.json

# Deep inspection with length anomaly detection
python3 content_validator.py data.csv report.json --depth deep --distribution-check

# Custom sentinel list
python3 content_validator.py data.csv report.json --sentinel-values "X,TODO,FIXME,unknown"

validate_statistics.py — Statistical Consistency Checking

validate_statistics.py recomputes descriptive statistics and compares them against a previously reported stats JSON. Use when auditing analysis results or verifying that reported statistics match actual data after a transformation.

python3 validate_statistics.py data.csv stats_report.json validation_result.json

Validation Ordering

Always follow: schema → constraints → content → sanity. Each layer catches what the previous one misses.

Dependencies

pandas numpy

Was this page helpful?
Edit on GitHub

Last updated on

On this page