MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
TutorialsEasy

Load and Profile a CSV

In this tutorial you'll watch the MAGIC Data Agent load a CSV file and run a full data-quality profile on it. The agent automatically detects the file format, loads it into memory, and then routes to the profiling skill to score quality and surface critical issues — all from a single natural-language prompt.

This is the recommended starting point if you're new to MAGIC Data Agent Skills. It uses two skills (magic-data-loading and magic-data-profiling) and takes under two minutes end-to-end.

Setup

Follow these steps to set up your environment. If you've already installed, skip to the sample data.

Install the skills

npx @votee-ai/magic-data-agent-skills init

Create a workspace

mkdir magic-tutorial && cd magic-tutorial

Download sample data

curl -O https://docs.votee.ai/tutorials/data/sales_data.csv

Sample Data

This tutorial uses a 515-row sales dataset with intentional quality issues — mixed date formats, duplicate rows, missing values, and extreme outliers. Download it to follow along:

The Agent Trace

The interactive viewer below shows every step the agent took — from receiving your prompt to delivering the quality report. Click any step to expand its details, or use ← → arrow keys to navigate.

magic-data-loadingmagic-data-profiling2 minutes

Load data/sales_data.csv and tell me about the data quality issues

User prompt

00:00
magic-data-loadingSkillloadcsv
00:03

Detected 'Load' keyword with a file path reference — the agent invokes the skill, loading 21,739 characters of domain knowledge about format detection, encoding handling, and loading strategies

file inspection + CSV loadExecution
00:06

Input

{
  "file_path": "data/sales_data.csv"
}

Output

{
  "rows": 515,
  "columns": 12,
  "encoding": "UTF-8",
  "format": "CSV"
}
wc -l data/sales_data.csv && file data/sales_data.csv
magic-data-profilingSkillqualityissues
00:13

magic-data-loadingmagic-data-profilingUser asked about 'quality issues' — data loaded successfully (515 rows × 12 columns). Activating profiling skill for comprehensive analysis.

Loaded profiling skill with quality scoring patterns, distribution analysis methods, and outlier detection strategies

comprehensive quality profilingExecution
00:18

Input

{
  "dataframe_shape": [
    515,
    12
  ]
}

Output

{
  "quality_score": 86.9,
  "grade": "B",
  "duplicates": 15,
  "date_formats": 3,
  "quantity_outliers": 5,
  "missing_email_pct": "13.0%",
  "missing_phone_pct": "8.2%",
  "missing_notes_pct": "89.7%",
  "revenue_consistency": "100%",
  "email_shared_across_reps": true
}
import pandas as pd
import numpy as np
import re

df = pd.read_csv('data/sales_data.csv')
# Quality dimensions: completeness, uniqueness,
# consistency (date formats, email mapping),
# validity (outliers, revenue check)
date format & outlier deep-diveExecution
00:25

Input

{
  "columns": [
    "date",
    "quantity"
  ]
}

Output

{
  "date_formats": {
    "YYYY-MM-DD": 410,
    "MM/DD/YYYY": 69,
    "DD-Mon-YYYY": 36
  },
  "quantity_outliers": [
    {
      "order_id": "ORD-1389",
      "quantity": 15862,
      "revenue": "$1.6M"
    },
    {
      "order_id": "ORD-1444",
      "quantity": 12353,
      "revenue": "$1.7M"
    },
    {
      "order_id": "ORD-1256",
      "quantity": 11581
    },
    {
      "order_id": "ORD-1042",
      "quantity": 9262
    },
    {
      "order_id": "ORD-1137",
      "quantity": 5741
    }
  ]
}
# Date format detection
mask_iso = df['date'].str.match(r'^\d{4}-\d{2}-\d{2}$')
mask_us = df['date'].str.match(r'^\d{2}/\d{2}/\d{4}$')
mask_dmy = df['date'].str.match(r'^\d{2}-[A-Z][a-z]{2}-\d{4}$')
Results deliveredOutput
00:32

Data Quality Report: `sales_data.csv`

Loaded: 515 rows × 12 columns | Overall quality: 86.9/100 (Grade B) — but the aggregate masks serious column-level issues.

Critical issues

  • Duplicate `order_id` values — 15 duplicates (e.g., ORD-1072, ORD-1066). Breaks joins and aggregations if order_id is a primary key.
  • Three inconsistent date formats — 410 rows ISO, 69 rows US MM/DD/YYYY, 36 rows DD-Mon-YYYY. All parseable, but downstream sort/filter will misfire without normalization.
  • Extreme quantity outliers — 5 rows with quantities of 5,741–15,862 vs. median ~165. Revenue arithmetic is consistent (qty × price = revenue, 0 mismatches), so the error is upstream.
  • `email` ↔ `sales_rep` mapping is broken — only 30 unique emails for 10 reps, every email assigned to multiple reps. Effectively random.
  • Moderate issues

  • Missingness: notes 89.7%, email 13.0%, phone 8.2%
  • Whitespace in notes: 53 values with leading/trailing spaces
  • Clean dimensions

  • region (5), channel (4), product (10), sales_rep (10) — all clean
  • revenue = quantity × unit_price holds for all 515 rows
  • No negative or zero quantities/prices
  • Date range: 2024-01-01 → 2024-12-30
  • Artifacts

    sales_data.csv

    515 rows × 12 columns — sales transactions with quality issues

    Data Flow

    StepSkillInputOutputKey Config
    1magic-data-loadingsales_data.csv (raw file)DataFrame (515×12)encoding: auto-detect
    2magic-data-profilingDataFrame (515×12)Quality report (86.9/100, Grade B)method: inline (small dataset)

    What Happened

    Skill Activation

    The agent detected "Load" and "CSV" in the prompt and activated magic-data-loading. It read the skill's SKILL.md to understand format detection, encoding handling, and checkpoint patterns.

    Data Loading

    The agent loaded sales_data.csv into pandas — 515 rows × 12 columns. It immediately inspected shape, dtypes, and null counts to plan the next step.

    Automatic Transition to Profiling

    Because the prompt said "quality issues", the agent activated magic-data-profiling, transitioning from the loading skill without any extra instruction. It read the profiling skill's SKILL.md to configure the inline analysis method appropriate for this dataset size.

    Comprehensive Profiling

    The agent ran five analysis passes:

    • Date format detection — found 3 formats via regex matching
    • Missing value analysis — email (13%), phone (8.2%), notes (89.7%)
    • Email cross-mapping — discovered emails shared across 6–9 reps
    • Duplicate detection — found 15 exact duplicate rows
    • Outlier detection — 5 extreme quantity values using the IQR method

    Quality Report

    The agent delivered a severity-rated report with 7 issues (3 High, 3 Medium, 1 Low) and recommended cleaning steps, scoring the dataset at 86.9/100 (Grade B).

    Key Issues Found

    #SeverityIssueImpact
    1High15 exact duplicate rowsInflates aggregations by 3%
    2High3 inconsistent date formatsBreaks time-series analysis
    3HighEmails shared across all repsCannot use as identifier
    4Medium5 extreme quantity outliers (5,741–15,862)Skews averages by 10x
    5Medium67 missing emails (13%)Gaps in contact data
    6Medium42 missing phones (8.2%)Gaps in contact data
    7LowNotes column 90% empty, rest is placeholderNo analytical value

    Try It Yourself

    Copy this prompt and paste it into your MAGIC session:

    Load data/sales_data.csv and tell me about the data quality issues

    The agent will:

    1. Detect the file format and encoding
    2. Load the data into memory
    3. Transition automatically to the profiling skill
    4. Return a quality score with a prioritized list of issues

    What You Learned

    • The agent automatically selects skills based on keywords in your prompt — you don't need to name skills explicitly.
    • Skill transitions happen seamlessly: magic-data-loading hands off to magic-data-profiling without any extra instruction from you.
    • The agent runs multiple analysis passes (dates, nulls, duplicates, outliers, cross-column validation) in a single profiling session.
    • Missing values, duplicates, and outliers are all surfaced in a single pass with severity ratings to help you prioritize.

    Alternate Approaches

    Next Steps

    Was this page helpful?
    Edit on GitHub

    Last updated on

    On this page