MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills Reference

magic-data-loading

Load and ingest data from any source — files (CSV, TSV, Parquet, JSON, JSONL, Excel), databases (SQLite, PostgreSQL, MySQL via connection string), or remote repositories (HuggingFace Hub datasets). Auto-detects format, encoding, and delimiter for files.

When It Activates

Use this skill when the user mentions data, dataset, CSV, file, database, table, HuggingFace, records, or any structured data source. Trigger phrases: load, read, import, ingest, open, parse, I have data, I have a dataset, connect to database, check HuggingFace.

  • User provides a data file (CSV, TSV, Parquet, JSON, JSONL, Excel)
  • User mentions a database connection, table name, or SQL query
  • User references a HuggingFace dataset (by name, URL, or "HF" mention)
  • User mentions "data", "dataset", "records", or "table" without specifying a source
  • Need to load data for analysis, cleaning, or transformation
  • Need to preview data before full processing
  • Need to detect file format or encoding

When NOT to Use: Data is already loaded as a DataFrame; use magic-data-profiling for analysis instead.

Quick Facts

PropertyValue
Version2.0.0
Complexitylow
Phase1
Scripts11

Tags

data-science ingestion loading csv parquet json excel database huggingface data-source

Scripts

Callable Tools (call directly via CLI)

ScriptPurposeExample
detect_format.pyContent-sniffing format detectionpython3 detect_format.py input_file output.json
inspect_hf_dataset.pyInspect HF dataset remotely (no download)python3 inspect_hf_dataset.py --dataset org/name [--sample-rows 3]
download_hf_dataset.pyDownload HF dataset to local directorypython3 download_hf_dataset.py --dataset org/name --output data/input/hf/ [--patterns "*.parquet"]
generate_dataset_card.pyGenerate dataset card from datapython3 generate_dataset_card.py --input data.parquet --repo org/name [--license apache-2.0]
connect_database.pyConnect, health check, list tablespython3 connect_database.py --env-var DATABASE_URL
inspect_schema.pySchema discovery (tables, columns, FKs)python3 inspect_schema.py --env-var DATABASE_URL [--table name]

Scriptable Tools (call directly or read + adapt)

ScriptStandard CLI UsageWhen to Customize
load_file.pypython3 load_file.py input.csv output.parquet--nrows 10000 for sampling; --chunk_size N for >100MB files; --flatten-depth 2 for nested JSONL; --explain for dry-run; supports hf:// URIs
validate_load.pypython3 validate_load.py loaded.csv --original_path raw.csv --output_path report.jsonAlways add --original_path to catch silent data loss
sample_rows.pypython3 sample_rows.py data.parquet sample.csv --n 100 --method head--method random for unbiased sample; --method stratified --stratify_column label for class-balanced
extract_data.pypython3 extract_data.py --query "SELECT * FROM table LIMIT 100"Always provide --query; add --output path.parquet to save as checkpoint

Reference Implementations (read patterns, write custom code)

ScriptDemonstratesKey Pattern
text_parser.pyState-machine text parsingTwo modes (template vs raw text); markers/fields/separators are always data-specific

New in v2.0.0

JSONL Support with --flatten-depth

load_file.py now natively handles JSONL files (.jsonl / .ndjson). Use --flatten-depth N to flatten nested JSON fields to depth N:

python3 load_file.py records.jsonl output.parquet --flatten-depth 2

This flattens {"meta": {"source": "web"}} into a meta.source column at depth 2.

text_parser.py — Semi-Structured Text Parsing

text_parser.py parses semi-structured text fields (tag-value, key-value, delimited templates) into structured CSV columns. It is a reference implementation — read it to understand the approach, then write custom code adapted to your text format.

Two parsing modes:

  • Template mode — text follows a predictable template with known field markers
  • Raw text mode — free-form text where fields appear in unpredictable order

Dependencies

pandas chardet openpyxl pyarrow sqlalchemy huggingface_hub httpx

Was this page helpful?
Edit on GitHub

Last updated on

On this page