magic-data-synthesis

Synthesize, generate, and transform data using LLM-based operations via the DataDesigner engine. The primary engine for all LLM-powered data generation tasks.

When It Activates

Use this skill when generating data using LLM or creating synthetic examples. Trigger phrases: synthesize, generate, fill missing, translate, annotate, enrich, augment data, create new examples, DataDesigner, LLM generation.

Columns have missing values, sentinels ("X", "N/A", "TBD"), or placeholders needing contextual generation
Format conversion (HTML→markdown), translation, annotation, labeling, summarization
Structured field extraction from unstructured text into multiple columns
Reference join leaves gaps → LLM fills remaining (use enrich_from_reference.py first)

When NOT to Use: Rule-based fixes (regex, type casting, dedup) → magic-data-cleaning. Reshaping, joins, aggregation → magic-data-transformation. Schema enforcement → magic-data-validation. If a Python function can produce correct output for every case, use programmatic generation instead of DataDesigner.

Quick Facts

Property	Value
Version	3.0.0
Complexity	high
Phase	2
Scripts	6

DataDesigner: Primary Synthesis Engine

DataDesigner is the primary engine for all LLM-powered data generation. It replaces the legacy batch_synthesize.py approach with a config-driven model that provides preview gates, cost estimation, and automated quality control.

Key Differences from Legacy Approach

Aspect	Legacy (`batch_synthesize.py`)	DataDesigner
Engine	Script-based batch generation	Config-driven Python API
Config	Inline parameters per run	Python file with `load_config_builder()`
Preview	Optional	Hard gate before full generation
Quality	Manual review	`LLMJudgeColumnConfig` automated scoring
Cost	Unknown until run	`estimate_from_preview()` upfront
Models	Remote API only	Local models + remote APIs + thinking models

Config-Driven Generation

DataDesigner configurations are Python files with a load_config_builder() function. The agent writes this config adapted to the task:

from data_designer import DataDesigner, LLMJudgeColumnConfig

def load_config_builder():
    dd = DataDesigner()

    # Define source columns
    dd.add_column("product_name", dtype="string", description="Product name from catalog")
    dd.add_column("category", dtype="category", values=["electronics", "clothing", "home"])

    # LLM-generated column with quality scoring
    dd.add_column(
        "product_description",
        dtype="string",
        prompt="Write a 2-sentence product description for {product_name} in category {category}",
        judge=LLMJudgeColumnConfig(
            criteria="Is the description accurate, informative, and appropriate for the category?",
            passing_score=0.8
        )
    )
    return dd

Preview Gate (Required)

Always run a preview before full generation. The preview gate shows sample output and cost estimate before committing:

dd = load_config_builder()

# Preview: generates 5 rows, estimates cost
preview = dd.preview(n_rows=5)
print(preview.sample_data)
print(f"Estimated cost for 1000 rows: ${dd.estimate_from_preview(preview, n_rows=1000):.4f}")

# Only proceed after user confirms the preview looks correct
result = dd.generate(n_rows=1000)
result.to_parquet("data/output/synthesized.parquet")

LLMJudgeColumnConfig — Automated Quality Scoring

LLMJudgeColumnConfig attaches an LLM judge to any generated column. The judge scores each generated value and flags rows that fall below the threshold:

from data_designer import LLMJudgeColumnConfig

judge = LLMJudgeColumnConfig(
    criteria="Is the translation accurate and natural-sounding?",
    passing_score=0.75,  # 0.0–1.0 scale
    model="claude-3-5-haiku"  # Use a fast model for judging
)

Rows that fail the judge threshold are returned in result.failed_rows for review or regeneration.

Scripts

Scriptable Tools (call directly or read + adapt)

Script	Standard CLI Usage	When to Customize
`synthesis_config.py`	`python3 synthesis_config.py --template fill_missing --output config.py`	`--template` for starting points; edit the generated config for your specific columns
`generate_column.py`	`python3 generate_column.py data.csv output.csv --column description --prompt "Write a description for {name}"`	Use for single-column generation without a full DataDesigner config
`validate_synthetic.py`	`python3 validate_synthetic.py output.csv validation.json`	`--judge-threshold 0.8` to adjust pass rate; `--sample N` for spot-checking
`enrich_from_reference.py`	`python3 enrich_from_reference.py data.csv reference.csv enriched.csv --join-col id`	Use when reference data can fill most gaps; LLM fills only what reference misses

Reference Implementations (read patterns, write custom code)

Script	Demonstrates	Key Pattern
`batch_synthesize.py`	Legacy batch generation approach	Kept for reference; DataDesigner is the preferred approach for new work
`synthesis_prompt_builder.py`	Prompt construction patterns	Template variables, context injection, constraint encoding in prompts

Cost Estimation

Always estimate cost before full generation. The estimate is based on token counts derived from the preview sample:

preview = dd.preview(n_rows=10)
cost_1k = dd.estimate_from_preview(preview, n_rows=1000)
cost_10k = dd.estimate_from_preview(preview, n_rows=10000)
print(f"1K rows: ${cost_1k:.3f} | 10K rows: ${cost_10k:.3f}")

Cost scales linearly with row count for most column types. Thinking models (o1, claude-3-7-sonnet with extended thinking) cost significantly more — use them only for complex reasoning tasks.

Supported Operations

Operation	Description	Key Config
Fill missing values	Replace nulls/sentinels with contextually generated content	`prompt` references other columns for context
Translation	Translate a text column to another language	`prompt="Translate to French: {text}"`
Format conversion	Convert HTML to Markdown, JSON to prose, etc.	`prompt` specifies input/output format
Annotation/labeling	Classify or label records	Use `dtype="category"` with `values=[...]`
Field extraction	Extract structured fields from unstructured text	`dtype="string"` + extraction prompt
New column generation	Generate a new column from existing context	`prompt` references multiple source columns

Dependencies

pandas numpy data-designer tiktoken

Was this page helpful?