magic-data-profiling

Profile datasets — run quality scoring, distribution analysis, outlier detection, and issue detection. Use when assessing data quality, getting a quality overview, or profiling before cleaning.

When It Activates

Use this skill when the user wants to understand data quality, distributions, or characteristics. Trigger phrases: profile, quality, check quality, assess, what types, categorize, classify, outliers, distributions, summarize data.

Need to understand data characteristics before cleaning or analysis
Need distribution analysis (skewness, normality tests)
Need to detect outliers or assess data quality
Need correlation analysis with significance testing
Need to discover categorical groupings or classify value types

When NOT to Use: Use magic-statistical-analysis for hypothesis testing; use magic-data-exploration for pattern discovery. Data is already profiled — re-profile only after transformations.

Quick Facts

Property	Value
Version	2.0.0
Complexity	medium
Phase	1
Scripts	8

Scripts

Scriptable Tools (call directly or read + adapt)

Script	Standard CLI Usage	When to Customize
`quality_score.py`	`python3 quality_score.py data.parquet logs/quality.json`	Custom dimension weights, additional dimensions, domain-specific thresholds
`detect_all_issues.py`	`python3 detect_all_issues.py data.parquet report.json`	`--include-content-validation` for sentinel checks; `--sentinel-patterns` for custom list
`distribution_analysis.py`	`python3 distribution_analysis.py data.csv dist.json`	`--columns col1,col2` to limit scope on wide datasets
`outlier_detection.py`	`python3 outlier_detection.py data.csv outliers.json`	`--method zscore --threshold 3.0` for normal data; `--method both` for dual detection
`correlation_matrix.py`	`python3 correlation_matrix.py data.csv corr.json`	`--method pearson\|spearman` to override auto; `--columns` to limit scope. Outputs JSON + CSV matrix + PNG heatmap
`deep_quality_analysis.py`	`python3 deep_quality_analysis.py data.csv analysis.json`	`--depth deep` for full investigation; `--columns` for targeted analysis; `--sample N` for large datasets
`detect_categories.py`	`python3 detect_categories.py --input data.csv --output cats.json`	`--column name` to override auto-selection; `--method tfidf_kmeans` to force clustering
`classify_answers.py`	`python3 classify_answers.py --input data.csv --output classify.json`	`--column col` when auto-selection picks wrong column; `--sample N` for large datasets

New in v2.0.0

detect_all_issues.py — Combined Meta-Profiler

detect_all_issues.py runs quality, distribution, outlier, and correlation analysis in a single pass, producing one JSON report with nested sub-analyses: {sentinels, quality, distributions, outliers, correlations, categories, answer_classification, errors}.

Use this instead of running each analysis script individually.

python3 detect_all_issues.py data.parquet report.json

# Include sentinel/placeholder detection
python3 detect_all_issues.py data.parquet report.json --include-content-validation

Do not run detect_all_issues.py on datasets over 1M rows without sampling first — it runs 6 sub-analyses sequentially, which can take 30+ minutes and risk OOM on correlation heatmaps.

Dependencies

pandas numpy scipy matplotlib seaborn

Linguistic Annotate — annotation quality assessment

Was this page helpful?