magic-data-transformation
Transform data by reshaping, aggregating, merging, deriving columns, and delivering to external destinations (database, HuggingFace Hub). Use when: (1) pivoting, melting, or unpivoting tables, (2) grouping and aggregating data, (3) joining or merging multiple datasets, (4) creating calculated or derived columns, (5) uploading/delivering/pushing data to HuggingFace Hub or database. Trigger keywords: pivot, melt, reshape, groupby, aggregate, merge, join, vlookup, deliver, upload, HuggingFace, push to Hub.
When It Activates
Use this skill when reshaping, joining, or deriving data. Trigger phrases: transform, reshape, pivot, melt, merge, join, aggregate, group by, derive column, split dataset, convert format, instruction tuning.
- Need to pivot, melt, stack, or unstack data
- Need group-by aggregations
- Need to merge/join multiple datasets
- Need to create calculated or derived columns
- After magic-data-cleaning, before analysis
When NOT to Use: Use magic-data-cleaning for quality fixes; use magic-data-exploration for analysis.
Quick Facts
| Property | Value |
|---|---|
| Version | 2.0.0 |
| Complexity | medium |
| Phase | 1 |
| Scripts | 7 |
Tags
data-science transformation reshape aggregate merge join
Scripts
Callable Tools (call directly via CLI)
| Script | Purpose | Example |
|---|---|---|
deliver_to_db.py | Write transformed data to a database table | python3 deliver_to_db.py --input data.parquet --table target_table |
deliver_to_hf.py | Publish dataset to HuggingFace Hub | python3 deliver_to_hf.py --input dataset_folder/ --repo org/repo-name |
Scriptable Tools (call directly or read + adapt)
| Script | Standard CLI Usage | When to Customize |
|---|---|---|
validate_transform.py | python3 validate_transform.py original.csv transformed.csv report.csv | --expected-shape rows,cols for dimensional assertion; --key-columns id,date to verify key preservation |
aggregate.py | python3 aggregate.py data.csv agg.csv --group_cols region --agg_cols revenue --functions mean,sum,count | --explain for dry-run |
merge_datasets.py | python3 merge_datasets.py left.csv right.csv merged.csv --on customer_id --how left | --left-on/--right-on when key names differ |
reshape.py | python3 reshape.py data.csv reshaped.csv --operation pivot --index_col date --columns_col region --values_col revenue | --operation stack|unstack needs only input/output (no column params) |
Reference Implementations (read patterns, write custom code)
| Script | Demonstrates | Key Pattern |
|---|---|---|
derive_columns.py | Safe expression evaluation for computed columns | Safe pd.eval() sandbox; blocked unsafe patterns; --expressions is effectively required |
New in v2.0.0
--auto-checkpoint and --explain Flags
aggregate.py, merge_datasets.py, and derive_columns.py support:
--explain— prints a JSON execution plan without writing any files. Use to preview what the operation will do before committing.--auto-checkpoint— creates a numbered snapshot (ckpt_NN_*.csv) after each successful operation.
# Preview an aggregation without writing output
python3 aggregate.py data.csv agg.csv --group_cols region --agg_cols revenue --functions sum --explain
# Run with automatic versioned checkpoints
python3 merge_datasets.py left.csv right.csv merged.csv --on id --auto-checkpointJSONL Support
All transformation scripts now accept .jsonl input natively. Output format is determined by the output file extension.
Dependencies
pandas numpy
Last updated on
magic-statistical-analysis
Perform descriptive statistics, hypothesis testing, and correlation analysis with mandatory uncertainty communication. Use when computing statistics, testing hypotheses, comparing groups, or analyzing correlations with significance.
magic-data-synthesis
Synthesize, generate, and transform data using LLM-based operations via DataDesigner engine. Use when: (1) filling missing values/sentinels with contextual content, (2) translating columns, (3) converting formats (HTML→markdown), (4) annotating/labeling records, (5) extracting structured data from text, (6) generating new columns from existing context. Trigger keywords: synthesize, generate, fill missing, translate, annotate, enrich, LLM generation, DataDesigner.