magic-data-transformation

Transform data by reshaping, aggregating, merging, deriving columns, and delivering to external destinations (database, HuggingFace Hub). Use when: (1) pivoting, melting, or unpivoting tables, (2) grouping and aggregating data, (3) joining or merging multiple datasets, (4) creating calculated or derived columns, (5) uploading/delivering/pushing data to HuggingFace Hub or database. Trigger keywords: pivot, melt, reshape, groupby, aggregate, merge, join, vlookup, deliver, upload, HuggingFace, push to Hub.

When It Activates

Use this skill when reshaping, joining, or deriving data. Trigger phrases: transform, reshape, pivot, melt, merge, join, aggregate, group by, derive column, split dataset, convert format, instruction tuning.

Need to pivot, melt, stack, or unstack data
Need group-by aggregations
Need to merge/join multiple datasets
Need to create calculated or derived columns
After magic-data-cleaning, before analysis

When NOT to Use: Use magic-data-cleaning for quality fixes; use magic-data-exploration for analysis.

Quick Facts

Property	Value
Version	2.0.0
Complexity	medium
Phase	1
Scripts	7

Scripts

Callable Tools (call directly via CLI)

Script	Purpose	Example
`deliver_to_db.py`	Write transformed data to a database table	`python3 deliver_to_db.py --input data.parquet --table target_table`
`deliver_to_hf.py`	Publish dataset to HuggingFace Hub	`python3 deliver_to_hf.py --input dataset_folder/ --repo org/repo-name`

Scriptable Tools (call directly or read + adapt)

Script	Standard CLI Usage	When to Customize
`validate_transform.py`	`python3 validate_transform.py original.csv transformed.csv report.csv`	`--expected-shape rows,cols` for dimensional assertion; `--key-columns id,date` to verify key preservation
`aggregate.py`	`python3 aggregate.py data.csv agg.csv --group_cols region --agg_cols revenue --functions mean,sum,count`	`--explain` for dry-run
`merge_datasets.py`	`python3 merge_datasets.py left.csv right.csv merged.csv --on customer_id --how left`	`--left-on`/`--right-on` when key names differ
`reshape.py`	`python3 reshape.py data.csv reshaped.csv --operation pivot --index_col date --columns_col region --values_col revenue`	`--operation stack\|unstack` needs only input/output (no column params)

Reference Implementations (read patterns, write custom code)

Script	Demonstrates	Key Pattern
`derive_columns.py`	Safe expression evaluation for computed columns	Safe `pd.eval()` sandbox; blocked unsafe patterns; `--expressions` is effectively required

New in v2.0.0

`--auto-checkpoint` and `--explain` Flags

aggregate.py, merge_datasets.py, and derive_columns.py support:

--explain — prints a JSON execution plan without writing any files. Use to preview what the operation will do before committing.
--auto-checkpoint — creates a numbered snapshot (ckpt_NN_*.csv) after each successful operation.

# Preview an aggregation without writing output
python3 aggregate.py data.csv agg.csv --group_cols region --agg_cols revenue --functions sum --explain

# Run with automatic versioned checkpoints
python3 merge_datasets.py left.csv right.csv merged.csv --on id --auto-checkpoint

JSONL Support

All transformation scripts now accept .jsonl input natively. Output format is determined by the output file extension.

Dependencies

pandas numpy

Was this page helpful?