magic-data-profiling
Profile datasets — run quality scoring, distribution analysis, outlier detection, and issue detection. Use when assessing data quality, getting a quality overview, or profiling before cleaning.
When It Activates
Use this skill when the user wants to understand data quality, distributions, or characteristics. Trigger phrases: profile, quality, check quality, assess, what types, categorize, classify, outliers, distributions, summarize data.
- Need to understand data characteristics before cleaning or analysis
- Need distribution analysis (skewness, normality tests)
- Need to detect outliers or assess data quality
- Need correlation analysis with significance testing
- Need to discover categorical groupings or classify value types
When NOT to Use: Use magic-statistical-analysis for hypothesis testing; use magic-data-exploration for pattern discovery. Data is already profiled — re-profile only after transformations.
Quick Facts
| Property | Value |
|---|---|
| Version | 2.0.0 |
| Complexity | medium |
| Phase | 1 |
| Scripts | 8 |
Tags
data-science profiling statistics quality eda
Scripts
Scriptable Tools (call directly or read + adapt)
| Script | Standard CLI Usage | When to Customize |
|---|---|---|
quality_score.py | python3 quality_score.py data.parquet logs/quality.json | Custom dimension weights, additional dimensions, domain-specific thresholds |
detect_all_issues.py | python3 detect_all_issues.py data.parquet report.json | --include-content-validation for sentinel checks; --sentinel-patterns for custom list |
distribution_analysis.py | python3 distribution_analysis.py data.csv dist.json | --columns col1,col2 to limit scope on wide datasets |
outlier_detection.py | python3 outlier_detection.py data.csv outliers.json | --method zscore --threshold 3.0 for normal data; --method both for dual detection |
correlation_matrix.py | python3 correlation_matrix.py data.csv corr.json | --method pearson|spearman to override auto; --columns to limit scope. Outputs JSON + CSV matrix + PNG heatmap |
deep_quality_analysis.py | python3 deep_quality_analysis.py data.csv analysis.json | --depth deep for full investigation; --columns for targeted analysis; --sample N for large datasets |
detect_categories.py | python3 detect_categories.py --input data.csv --output cats.json | --column name to override auto-selection; --method tfidf_kmeans to force clustering |
classify_answers.py | python3 classify_answers.py --input data.csv --output classify.json | --column col when auto-selection picks wrong column; --sample N for large datasets |
New in v2.0.0
detect_all_issues.py — Combined Meta-Profiler
detect_all_issues.py runs quality, distribution, outlier, and correlation analysis in a single pass, producing one JSON report with nested sub-analyses: {sentinels, quality, distributions, outliers, correlations, categories, answer_classification, errors}.
Use this instead of running each analysis script individually.
python3 detect_all_issues.py data.parquet report.json
# Include sentinel/placeholder detection
python3 detect_all_issues.py data.parquet report.json --include-content-validationDo not run detect_all_issues.py on datasets over 1M rows without sampling first — it runs 6 sub-analyses sequentially, which can take 30+ minutes and risk OOM on correlation heatmaps.
Dependencies
pandas numpy scipy matplotlib seaborn
Related Skills from Other Suites
- Linguistic Annotate — annotation quality assessment
Last updated on
magic-data-loading
Load and ingest data from any source — files (CSV, TSV, Parquet, JSON, JSONL, Excel), databases (SQLite, PostgreSQL, MySQL via connection string), or remote repositories (HuggingFace Hub datasets). Auto-detects format, encoding, and delimiter for files. Use when a user mentions data, a dataset, a file, a database, a table, records, or any structured data source they want to work with — even vague references like 'I have some data' or 'help me with this dataset'.
magic-data-cleaning
Clean data by detecting issues, handling missing values, normalizing strings, and executing cleaning plans. Use when: (1) data has missing values or nulls to impute, (2) text columns need normalization or deduplication, (3) type errors or inconsistent formats need fixing, (4) planning a cleaning strategy before execution. Does NOT handle sentinel/placeholder values requiring LLM — route those to magic-data-synthesis. Trigger keywords: clean, fix nulls, handle missing, normalize, deduplicate, impute, strip whitespace.