MAGIC Agent Skills is now open source! Star on GitHub
MAGIC Agent SkillsMAGIC Agent Skills
Skills Reference

magic-data-synthesis

Synthesize, generate, and transform data using LLM-based operations via the DataDesigner engine. The primary engine for all LLM-powered data generation tasks.

When It Activates

Use this skill when generating data using LLM or creating synthetic examples. Trigger phrases: synthesize, generate, fill missing, translate, annotate, enrich, augment data, create new examples, DataDesigner, LLM generation.

  • Columns have missing values, sentinels ("X", "N/A", "TBD"), or placeholders needing contextual generation
  • Format conversion (HTML→markdown), translation, annotation, labeling, summarization
  • Structured field extraction from unstructured text into multiple columns
  • Reference join leaves gaps → LLM fills remaining (use enrich_from_reference.py first)

When NOT to Use: Rule-based fixes (regex, type casting, dedup) → magic-data-cleaning. Reshaping, joins, aggregation → magic-data-transformation. Schema enforcement → magic-data-validation. If a Python function can produce correct output for every case, use programmatic generation instead of DataDesigner.

Quick Facts

PropertyValue
Version3.0.0
Complexityhigh
Phase2
Scripts6

Tags

data-science synthesis generation llm transformation enrichment annotation

DataDesigner: Primary Synthesis Engine

DataDesigner is the primary engine for all LLM-powered data generation. It replaces the legacy batch_synthesize.py approach with a config-driven model that provides preview gates, cost estimation, and automated quality control.

Key Differences from Legacy Approach

AspectLegacy (batch_synthesize.py)DataDesigner
EngineScript-based batch generationConfig-driven Python API
ConfigInline parameters per runPython file with load_config_builder()
PreviewOptionalHard gate before full generation
QualityManual reviewLLMJudgeColumnConfig automated scoring
CostUnknown until runestimate_from_preview() upfront
ModelsRemote API onlyLocal models + remote APIs + thinking models

Config-Driven Generation

DataDesigner configurations are Python files with a load_config_builder() function. The agent writes this config adapted to the task:

from data_designer import DataDesigner, LLMJudgeColumnConfig

def load_config_builder():
    dd = DataDesigner()

    # Define source columns
    dd.add_column("product_name", dtype="string", description="Product name from catalog")
    dd.add_column("category", dtype="category", values=["electronics", "clothing", "home"])

    # LLM-generated column with quality scoring
    dd.add_column(
        "product_description",
        dtype="string",
        prompt="Write a 2-sentence product description for {product_name} in category {category}",
        judge=LLMJudgeColumnConfig(
            criteria="Is the description accurate, informative, and appropriate for the category?",
            passing_score=0.8
        )
    )
    return dd

Preview Gate (Required)

Always run a preview before full generation. The preview gate shows sample output and cost estimate before committing:

dd = load_config_builder()

# Preview: generates 5 rows, estimates cost
preview = dd.preview(n_rows=5)
print(preview.sample_data)
print(f"Estimated cost for 1000 rows: ${dd.estimate_from_preview(preview, n_rows=1000):.4f}")

# Only proceed after user confirms the preview looks correct
result = dd.generate(n_rows=1000)
result.to_parquet("data/output/synthesized.parquet")

LLMJudgeColumnConfig — Automated Quality Scoring

LLMJudgeColumnConfig attaches an LLM judge to any generated column. The judge scores each generated value and flags rows that fall below the threshold:

from data_designer import LLMJudgeColumnConfig

judge = LLMJudgeColumnConfig(
    criteria="Is the translation accurate and natural-sounding?",
    passing_score=0.75,  # 0.0–1.0 scale
    model="claude-3-5-haiku"  # Use a fast model for judging
)

Rows that fail the judge threshold are returned in result.failed_rows for review or regeneration.

Scripts

Scriptable Tools (call directly or read + adapt)

ScriptStandard CLI UsageWhen to Customize
synthesis_config.pypython3 synthesis_config.py --template fill_missing --output config.py--template for starting points; edit the generated config for your specific columns
generate_column.pypython3 generate_column.py data.csv output.csv --column description --prompt "Write a description for {name}"Use for single-column generation without a full DataDesigner config
validate_synthetic.pypython3 validate_synthetic.py output.csv validation.json--judge-threshold 0.8 to adjust pass rate; --sample N for spot-checking
enrich_from_reference.pypython3 enrich_from_reference.py data.csv reference.csv enriched.csv --join-col idUse when reference data can fill most gaps; LLM fills only what reference misses

Reference Implementations (read patterns, write custom code)

ScriptDemonstratesKey Pattern
batch_synthesize.pyLegacy batch generation approachKept for reference; DataDesigner is the preferred approach for new work
synthesis_prompt_builder.pyPrompt construction patternsTemplate variables, context injection, constraint encoding in prompts

Cost Estimation

Always estimate cost before full generation. The estimate is based on token counts derived from the preview sample:

preview = dd.preview(n_rows=10)
cost_1k = dd.estimate_from_preview(preview, n_rows=1000)
cost_10k = dd.estimate_from_preview(preview, n_rows=10000)
print(f"1K rows: ${cost_1k:.3f} | 10K rows: ${cost_10k:.3f}")

Cost scales linearly with row count for most column types. Thinking models (o1, claude-3-7-sonnet with extended thinking) cost significantly more — use them only for complex reasoning tasks.

Supported Operations

OperationDescriptionKey Config
Fill missing valuesReplace nulls/sentinels with contextually generated contentprompt references other columns for context
TranslationTranslate a text column to another languageprompt="Translate to French: {text}"
Format conversionConvert HTML to Markdown, JSON to prose, etc.prompt specifies input/output format
Annotation/labelingClassify or label recordsUse dtype="category" with values=[...]
Field extractionExtract structured fields from unstructured textdtype="string" + extraction prompt
New column generationGenerate a new column from existing contextprompt references multiple source columns

Dependencies

pandas numpy data-designer tiktoken

Was this page helpful?
Edit on GitHub

Last updated on

On this page