magic-data-synthesis
Synthesize, generate, and transform data using LLM-based operations via the DataDesigner engine. The primary engine for all LLM-powered data generation tasks.
When It Activates
Use this skill when generating data using LLM or creating synthetic examples. Trigger phrases: synthesize, generate, fill missing, translate, annotate, enrich, augment data, create new examples, DataDesigner, LLM generation.
- Columns have missing values, sentinels ("X", "N/A", "TBD"), or placeholders needing contextual generation
- Format conversion (HTML→markdown), translation, annotation, labeling, summarization
- Structured field extraction from unstructured text into multiple columns
- Reference join leaves gaps → LLM fills remaining (use
enrich_from_reference.pyfirst)
When NOT to Use: Rule-based fixes (regex, type casting, dedup) → magic-data-cleaning. Reshaping, joins, aggregation → magic-data-transformation. Schema enforcement → magic-data-validation. If a Python function can produce correct output for every case, use programmatic generation instead of DataDesigner.
Quick Facts
| Property | Value |
|---|---|
| Version | 3.0.0 |
| Complexity | high |
| Phase | 2 |
| Scripts | 6 |
Tags
data-science synthesis generation llm transformation enrichment annotation
DataDesigner: Primary Synthesis Engine
DataDesigner is the primary engine for all LLM-powered data generation. It replaces the legacy batch_synthesize.py approach with a config-driven model that provides preview gates, cost estimation, and automated quality control.
Key Differences from Legacy Approach
| Aspect | Legacy (batch_synthesize.py) | DataDesigner |
|---|---|---|
| Engine | Script-based batch generation | Config-driven Python API |
| Config | Inline parameters per run | Python file with load_config_builder() |
| Preview | Optional | Hard gate before full generation |
| Quality | Manual review | LLMJudgeColumnConfig automated scoring |
| Cost | Unknown until run | estimate_from_preview() upfront |
| Models | Remote API only | Local models + remote APIs + thinking models |
Config-Driven Generation
DataDesigner configurations are Python files with a load_config_builder() function. The agent writes this config adapted to the task:
from data_designer import DataDesigner, LLMJudgeColumnConfig
def load_config_builder():
dd = DataDesigner()
# Define source columns
dd.add_column("product_name", dtype="string", description="Product name from catalog")
dd.add_column("category", dtype="category", values=["electronics", "clothing", "home"])
# LLM-generated column with quality scoring
dd.add_column(
"product_description",
dtype="string",
prompt="Write a 2-sentence product description for {product_name} in category {category}",
judge=LLMJudgeColumnConfig(
criteria="Is the description accurate, informative, and appropriate for the category?",
passing_score=0.8
)
)
return ddPreview Gate (Required)
Always run a preview before full generation. The preview gate shows sample output and cost estimate before committing:
dd = load_config_builder()
# Preview: generates 5 rows, estimates cost
preview = dd.preview(n_rows=5)
print(preview.sample_data)
print(f"Estimated cost for 1000 rows: ${dd.estimate_from_preview(preview, n_rows=1000):.4f}")
# Only proceed after user confirms the preview looks correct
result = dd.generate(n_rows=1000)
result.to_parquet("data/output/synthesized.parquet")LLMJudgeColumnConfig — Automated Quality Scoring
LLMJudgeColumnConfig attaches an LLM judge to any generated column. The judge scores each generated value and flags rows that fall below the threshold:
from data_designer import LLMJudgeColumnConfig
judge = LLMJudgeColumnConfig(
criteria="Is the translation accurate and natural-sounding?",
passing_score=0.75, # 0.0–1.0 scale
model="claude-3-5-haiku" # Use a fast model for judging
)Rows that fail the judge threshold are returned in result.failed_rows for review or regeneration.
Scripts
Scriptable Tools (call directly or read + adapt)
| Script | Standard CLI Usage | When to Customize |
|---|---|---|
synthesis_config.py | python3 synthesis_config.py --template fill_missing --output config.py | --template for starting points; edit the generated config for your specific columns |
generate_column.py | python3 generate_column.py data.csv output.csv --column description --prompt "Write a description for {name}" | Use for single-column generation without a full DataDesigner config |
validate_synthetic.py | python3 validate_synthetic.py output.csv validation.json | --judge-threshold 0.8 to adjust pass rate; --sample N for spot-checking |
enrich_from_reference.py | python3 enrich_from_reference.py data.csv reference.csv enriched.csv --join-col id | Use when reference data can fill most gaps; LLM fills only what reference misses |
Reference Implementations (read patterns, write custom code)
| Script | Demonstrates | Key Pattern |
|---|---|---|
batch_synthesize.py | Legacy batch generation approach | Kept for reference; DataDesigner is the preferred approach for new work |
synthesis_prompt_builder.py | Prompt construction patterns | Template variables, context injection, constraint encoding in prompts |
Cost Estimation
Always estimate cost before full generation. The estimate is based on token counts derived from the preview sample:
preview = dd.preview(n_rows=10)
cost_1k = dd.estimate_from_preview(preview, n_rows=1000)
cost_10k = dd.estimate_from_preview(preview, n_rows=10000)
print(f"1K rows: ${cost_1k:.3f} | 10K rows: ${cost_10k:.3f}")Cost scales linearly with row count for most column types. Thinking models (o1, claude-3-7-sonnet with extended thinking) cost significantly more — use them only for complex reasoning tasks.
Supported Operations
| Operation | Description | Key Config |
|---|---|---|
| Fill missing values | Replace nulls/sentinels with contextually generated content | prompt references other columns for context |
| Translation | Translate a text column to another language | prompt="Translate to French: {text}" |
| Format conversion | Convert HTML to Markdown, JSON to prose, etc. | prompt specifies input/output format |
| Annotation/labeling | Classify or label records | Use dtype="category" with values=[...] |
| Field extraction | Extract structured fields from unstructured text | dtype="string" + extraction prompt |
| New column generation | Generate a new column from existing context | prompt references multiple source columns |
Dependencies
pandas numpy data-designer tiktoken
Last updated on
magic-data-transformation
Transform data by reshaping, aggregating, merging, deriving columns, and delivering to external destinations (database, HuggingFace Hub). Use when: (1) pivoting, melting, or unpivoting tables, (2) grouping and aggregating data, (3) joining or merging multiple datasets, (4) creating calculated or derived columns, (5) uploading/delivering/pushing data to HuggingFace Hub or database. Trigger keywords: pivot, melt, reshape, groupby, aggregate, merge, join, vlookup, deliver, upload, HuggingFace, push to Hub.
magic-data-visualization
Select appropriate chart types and generate publication-quality visualizations (PNG, SVG, interactive HTML). Use when creating charts, plotting distributions, comparing groups visually, visualizing correlations, or supporting findings with visuals. Covers bar, line, scatter, histogram, box, heatmap, and small multiples. Use after profiling or statistical analysis to communicate results.