Load and Profile a CSV

In this tutorial you'll watch the MAGIC Data Agent load a CSV file and run a full data-quality profile on it. The agent automatically detects the file format, loads it into memory, and then routes to the profiling skill to score quality and surface critical issues — all from a single natural-language prompt.

This is the recommended starting point if you're new to MAGIC Data Agent Skills. It uses two skills (magic-data-loading and magic-data-profiling) and takes under two minutes end-to-end.

Setup

Follow these steps to set up your environment. If you've already installed, skip to the sample data.

Install the skills

npx @votee-ai/magic-data-agent-skills init

Create a workspace

mkdir magic-tutorial && cd magic-tutorial

Download sample data

curl -O https://docs.votee.ai/tutorials/data/sales_data.csv

Sample Data

This tutorial uses a 515-row sales dataset with intentional quality issues — mixed date formats, duplicate rows, missing values, and extreme outliers. Download it to follow along:

sales_data.csv

515 rows × 12 columns — sales transactions with embedded quality issues

The Agent Trace

The interactive viewer below shows every step the agent took — from receiving your prompt to delivering the quality report. Click any step to expand its details, or use ← → arrow keys to navigate.

magic-data-loadingmagic-data-profiling2 minutes

Load data/sales_data.csv and tell me about the data quality issues

User prompt

00:00

magic-data-loadingSkillloadcsv

00:03

Detected 'Load' keyword with a file path reference — the agent invokes the skill, loading 21,739 characters of domain knowledge about format detection, encoding handling, and loading strategies

file inspection + CSV loadExecution

00:06

Input

{
  "file_path": "data/sales_data.csv"
}

Output

{
  "rows": 515,
  "columns": 12,
  "encoding": "UTF-8",
  "format": "CSV"
}

wc -l data/sales_data.csv && file data/sales_data.csv

magic-data-profilingSkillqualityissues

00:13

magic-data-loading → magic-data-profiling— User asked about 'quality issues' — data loaded successfully (515 rows × 12 columns). Activating profiling skill for comprehensive analysis.

Loaded profiling skill with quality scoring patterns, distribution analysis methods, and outlier detection strategies

comprehensive quality profilingExecution

00:18

Input

{
  "dataframe_shape": [
    515,
    12
  ]
}

Output

{
  "quality_score": 86.9,
  "grade": "B",
  "duplicates": 15,
  "date_formats": 3,
  "quantity_outliers": 5,
  "missing_email_pct": "13.0%",
  "missing_phone_pct": "8.2%",
  "missing_notes_pct": "89.7%",
  "revenue_consistency": "100%",
  "email_shared_across_reps": true
}

import pandas as pd
import numpy as np
import re

df = pd.read_csv('data/sales_data.csv')
# Quality dimensions: completeness, uniqueness,
# consistency (date formats, email mapping),
# validity (outliers, revenue check)

date format & outlier deep-diveExecution

00:25

Input

{
  "columns": [
    "date",
    "quantity"
  ]
}

Output

{
  "date_formats": {
    "YYYY-MM-DD": 410,
    "MM/DD/YYYY": 69,
    "DD-Mon-YYYY": 36
  },
  "quantity_outliers": [
    {
      "order_id": "ORD-1389",
      "quantity": 15862,
      "revenue": "$1.6M"
    },
    {
      "order_id": "ORD-1444",
      "quantity": 12353,
      "revenue": "$1.7M"
    },
    {
      "order_id": "ORD-1256",
      "quantity": 11581
    },
    {
      "order_id": "ORD-1042",
      "quantity": 9262
    },
    {
      "order_id": "ORD-1137",
      "quantity": 5741
    }
  ]
}

# Date format detection
mask_iso = df['date'].str.match(r'^\d{4}-\d{2}-\d{2}$')
mask_us = df['date'].str.match(r'^\d{2}/\d{2}/\d{4}$')
mask_dmy = df['date'].str.match(r'^\d{2}-[A-Z][a-z]{2}-\d{4}$')

Results deliveredOutput

00:32

Data Quality Report: `sales_data.csv`

Loaded: 515 rows × 12 columns | Overall quality: 86.9/100 (Grade B) — but the aggregate masks serious column-level issues.

Critical issues

Duplicate `order_id` values — 15 duplicates (e.g., ORD-1072, ORD-1066). Breaks joins and aggregations if order_id is a primary key.

Three inconsistent date formats — 410 rows ISO, 69 rows US MM/DD/YYYY, 36 rows DD-Mon-YYYY. All parseable, but downstream sort/filter will misfire without normalization.

Extreme quantity outliers — 5 rows with quantities of 5,741–15,862 vs. median ~165. Revenue arithmetic is consistent (qty × price = revenue, 0 mismatches), so the error is upstream.

`email` ↔ `sales_rep` mapping is broken — only 30 unique emails for 10 reps, every email assigned to multiple reps. Effectively random.

Moderate issues

Missingness: notes 89.7%, email 13.0%, phone 8.2%

Whitespace in notes: 53 values with leading/trailing spaces

Clean dimensions

region (5), channel (4), product (10), sales_rep (10) — all clean

revenue = quantity × unit_price holds for all 515 rows

No negative or zero quantities/prices

Date range: 2024-01-01 → 2024-12-30

Artifacts

sales_data.csv

515 rows × 12 columns — sales transactions with quality issues

Data Flow

Step	Skill	Input	Output	Key Config
1	`magic-data-loading`	`sales_data.csv` (raw file)	DataFrame (515×12)	encoding: auto-detect
2	`magic-data-profiling`	DataFrame (515×12)	Quality report (86.9/100, Grade B)	method: inline (small dataset)

What Happened

Skill Activation

The agent detected "Load" and "CSV" in the prompt and activated magic-data-loading. It read the skill's SKILL.md to understand format detection, encoding handling, and checkpoint patterns.

Data Loading

The agent loaded sales_data.csv into pandas — 515 rows × 12 columns. It immediately inspected shape, dtypes, and null counts to plan the next step.

Automatic Transition to Profiling

Because the prompt said "quality issues", the agent activated magic-data-profiling, transitioning from the loading skill without any extra instruction. It read the profiling skill's SKILL.md to configure the inline analysis method appropriate for this dataset size.

Comprehensive Profiling

The agent ran five analysis passes:

Date format detection — found 3 formats via regex matching
Missing value analysis — email (13%), phone (8.2%), notes (89.7%)
Email cross-mapping — discovered emails shared across 6–9 reps
Duplicate detection — found 15 exact duplicate rows
Outlier detection — 5 extreme quantity values using the IQR method

Quality Report

The agent delivered a severity-rated report with 7 issues (3 High, 3 Medium, 1 Low) and recommended cleaning steps, scoring the dataset at 86.9/100 (Grade B).

Key Issues Found

#	Severity	Issue	Impact
1	High	15 exact duplicate rows	Inflates aggregations by 3%
2	High	3 inconsistent date formats	Breaks time-series analysis
3	High	Emails shared across all reps	Cannot use as identifier
4	Medium	5 extreme quantity outliers (5,741–15,862)	Skews averages by 10x
5	Medium	67 missing emails (13%)	Gaps in contact data
6	Medium	42 missing phones (8.2%)	Gaps in contact data
7	Low	Notes column 90% empty, rest is placeholder	No analytical value

Try It Yourself

Copy this prompt and paste it into your MAGIC session:

Load data/sales_data.csv and tell me about the data quality issues

The agent will:

Detect the file format and encoding
Load the data into memory
Transition automatically to the profiling skill
Return a quality score with a prioritized list of issues

What You Learned

The agent automatically selects skills based on keywords in your prompt — you don't need to name skills explicitly.
Skill transitions happen seamlessly: magic-data-loading hands off to magic-data-profiling without any extra instruction from you.
The agent runs multiple analysis passes (dates, nulls, duplicates, outliers, cross-column validation) in a single profiling session.
Missing values, duplicates, and outliers are all surfaced in a single pass with severity ratings to help you prioritize.

Load and Profile a CSV

Setup

Install the skills

Create a workspace

Download sample data

Sample Data

sales_data.csv

The Agent Trace