magic-data-loading

Load and ingest data from any source — files (CSV, TSV, Parquet, JSON, JSONL, Excel), databases (SQLite, PostgreSQL, MySQL via connection string), or remote repositories (HuggingFace Hub datasets). Auto-detects format, encoding, and delimiter for files.

When It Activates

Use this skill when the user mentions data, dataset, CSV, file, database, table, HuggingFace, records, or any structured data source. Trigger phrases: load, read, import, ingest, open, parse, I have data, I have a dataset, connect to database, check HuggingFace.

User provides a data file (CSV, TSV, Parquet, JSON, JSONL, Excel)
User mentions a database connection, table name, or SQL query
User references a HuggingFace dataset (by name, URL, or "HF" mention)
User mentions "data", "dataset", "records", or "table" without specifying a source
Need to load data for analysis, cleaning, or transformation
Need to preview data before full processing
Need to detect file format or encoding

When NOT to Use: Data is already loaded as a DataFrame; use magic-data-profiling for analysis instead.

Quick Facts

Property	Value
Version	2.0.0
Complexity	low
Phase	1
Scripts	11

Scripts

Callable Tools (call directly via CLI)

Script	Purpose	Example
`detect_format.py`	Content-sniffing format detection	`python3 detect_format.py input_file output.json`
`inspect_hf_dataset.py`	Inspect HF dataset remotely (no download)	`python3 inspect_hf_dataset.py --dataset org/name [--sample-rows 3]`
`download_hf_dataset.py`	Download HF dataset to local directory	`python3 download_hf_dataset.py --dataset org/name --output data/input/hf/ [--patterns "*.parquet"]`
`generate_dataset_card.py`	Generate dataset card from data	`python3 generate_dataset_card.py --input data.parquet --repo org/name [--license apache-2.0]`
`connect_database.py`	Connect, health check, list tables	`python3 connect_database.py --env-var DATABASE_URL`
`inspect_schema.py`	Schema discovery (tables, columns, FKs)	`python3 inspect_schema.py --env-var DATABASE_URL [--table name]`

Scriptable Tools (call directly or read + adapt)

Script	Standard CLI Usage	When to Customize
`load_file.py`	`python3 load_file.py input.csv output.parquet`	`--nrows 10000` for sampling; `--chunk_size N` for >100MB files; `--flatten-depth 2` for nested JSONL; `--explain` for dry-run; supports `hf://` URIs
`validate_load.py`	`python3 validate_load.py loaded.csv --original_path raw.csv --output_path report.json`	Always add `--original_path` to catch silent data loss
`sample_rows.py`	`python3 sample_rows.py data.parquet sample.csv --n 100 --method head`	`--method random` for unbiased sample; `--method stratified --stratify_column label` for class-balanced
`extract_data.py`	`python3 extract_data.py --query "SELECT * FROM table LIMIT 100"`	Always provide `--query`; add `--output path.parquet` to save as checkpoint

Reference Implementations (read patterns, write custom code)

Script	Demonstrates	Key Pattern
`text_parser.py`	State-machine text parsing	Two modes (template vs raw text); markers/fields/separators are always data-specific

New in v2.0.0

JSONL Support with `--flatten-depth`

load_file.py now natively handles JSONL files (.jsonl / .ndjson). Use --flatten-depth N to flatten nested JSON fields to depth N:

python3 load_file.py records.jsonl output.parquet --flatten-depth 2

This flattens {"meta": {"source": "web"}} into a meta.source column at depth 2.

text_parser.py — Semi-Structured Text Parsing

text_parser.py parses semi-structured text fields (tag-value, key-value, delimited templates) into structured CSV columns. It is a reference implementation — read it to understand the approach, then write custom code adapted to your text format.

Two parsing modes:

Template mode — text follows a predictable template with known field markers
Raw text mode — free-form text where fields appear in unpredictable order

Dependencies

pandas chardet openpyxl pyarrow sqlalchemy huggingface_hub httpx

Linguistic Corpus — corpus sourcing for NLP
Speech Processing — speech data pipelines

Was this page helpful?