magic-data-loading
Load and ingest data from any source — files (CSV, TSV, Parquet, JSON, JSONL, Excel), databases (SQLite, PostgreSQL, MySQL via connection string), or remote repositories (HuggingFace Hub datasets). Auto-detects format, encoding, and delimiter for files.
When It Activates
Use this skill when the user mentions data, dataset, CSV, file, database, table, HuggingFace, records, or any structured data source. Trigger phrases: load, read, import, ingest, open, parse, I have data, I have a dataset, connect to database, check HuggingFace.
- User provides a data file (CSV, TSV, Parquet, JSON, JSONL, Excel)
- User mentions a database connection, table name, or SQL query
- User references a HuggingFace dataset (by name, URL, or "HF" mention)
- User mentions "data", "dataset", "records", or "table" without specifying a source
- Need to load data for analysis, cleaning, or transformation
- Need to preview data before full processing
- Need to detect file format or encoding
When NOT to Use: Data is already loaded as a DataFrame; use magic-data-profiling for analysis instead.
Quick Facts
| Property | Value |
|---|---|
| Version | 2.0.0 |
| Complexity | low |
| Phase | 1 |
| Scripts | 11 |
Tags
data-science ingestion loading csv parquet json excel database huggingface data-source
Scripts
Callable Tools (call directly via CLI)
| Script | Purpose | Example |
|---|---|---|
detect_format.py | Content-sniffing format detection | python3 detect_format.py input_file output.json |
inspect_hf_dataset.py | Inspect HF dataset remotely (no download) | python3 inspect_hf_dataset.py --dataset org/name [--sample-rows 3] |
download_hf_dataset.py | Download HF dataset to local directory | python3 download_hf_dataset.py --dataset org/name --output data/input/hf/ [--patterns "*.parquet"] |
generate_dataset_card.py | Generate dataset card from data | python3 generate_dataset_card.py --input data.parquet --repo org/name [--license apache-2.0] |
connect_database.py | Connect, health check, list tables | python3 connect_database.py --env-var DATABASE_URL |
inspect_schema.py | Schema discovery (tables, columns, FKs) | python3 inspect_schema.py --env-var DATABASE_URL [--table name] |
Scriptable Tools (call directly or read + adapt)
| Script | Standard CLI Usage | When to Customize |
|---|---|---|
load_file.py | python3 load_file.py input.csv output.parquet | --nrows 10000 for sampling; --chunk_size N for >100MB files; --flatten-depth 2 for nested JSONL; --explain for dry-run; supports hf:// URIs |
validate_load.py | python3 validate_load.py loaded.csv --original_path raw.csv --output_path report.json | Always add --original_path to catch silent data loss |
sample_rows.py | python3 sample_rows.py data.parquet sample.csv --n 100 --method head | --method random for unbiased sample; --method stratified --stratify_column label for class-balanced |
extract_data.py | python3 extract_data.py --query "SELECT * FROM table LIMIT 100" | Always provide --query; add --output path.parquet to save as checkpoint |
Reference Implementations (read patterns, write custom code)
| Script | Demonstrates | Key Pattern |
|---|---|---|
text_parser.py | State-machine text parsing | Two modes (template vs raw text); markers/fields/separators are always data-specific |
New in v2.0.0
JSONL Support with --flatten-depth
load_file.py now natively handles JSONL files (.jsonl / .ndjson). Use --flatten-depth N to flatten nested JSON fields to depth N:
python3 load_file.py records.jsonl output.parquet --flatten-depth 2This flattens {"meta": {"source": "web"}} into a meta.source column at depth 2.
text_parser.py — Semi-Structured Text Parsing
text_parser.py parses semi-structured text fields (tag-value, key-value, delimited templates) into structured CSV columns. It is a reference implementation — read it to understand the approach, then write custom code adapted to your text format.
Two parsing modes:
- Template mode — text follows a predictable template with known field markers
- Raw text mode — free-form text where fields appear in unpredictable order
Dependencies
pandas chardet openpyxl pyarrow sqlalchemy huggingface_hub httpx
Related Skills from Other Suites
- Linguistic Corpus — corpus sourcing for NLP
- Speech Processing — speech data pipelines
Last updated on
magic-workspace-init
Initialize a MAGIC data processing workspace: directory scaffolding, Python environment verification, dependency installation, and LLM configuration. Use when starting a new data project or setting up the MAGIC environment for the first time.
magic-data-profiling
Profile datasets — run quality scoring, distribution analysis, outlier detection, and issue detection. Use when assessing data quality, running quality_score.py, getting a quality overview, or profiling before cleaning.