Checkpoint Strategy

A checkpoint is a saved snapshot of your data at a specific point in the pipeline. Every skill that transforms data writes a checkpoint before handing off to the next skill. This seemingly simple practice has large consequences for reliability, debuggability, and reproducibility.

Why Checkpoints Matter

Recovery Without Re-Running

Data pipelines fail. A network error mid-transformation, a package import failure, or an AI assistant session that closes unexpectedly would otherwise mean starting over from scratch. With checkpoints, the agent restarts from the most recent completed step — not from the raw source.

For a seven-step pipeline, a failure at step six means re-running only the last step, not the entire pipeline.

Debugging at Any Step

When a pipeline produces unexpected results — wrong distributions, surprising row counts, a validation failure — the checkpoints let you inspect the data at each stage of transformation. You can compare ckpt_03_nulls_imputed.csv with ckpt_04_duplicates_removed.csv to see exactly what the deduplication step changed.

Reproducibility Across Sessions

Checkpoints combined with the analysis journal give you a complete record of every transformation applied to your data. Months later you can reconstruct exactly what happened to a dataset by reading the journal and the checkpoint sequence.

Safe Experimentation

Because checkpoints are immutable snapshots, you can try a different cleaning approach on a checkpoint without losing the original. If the experiment produces worse results, you roll back to the checkpoint and try another approach.

Naming Convention

Every checkpoint follows a strict naming pattern:

ckpt_\{NN\}_\{operation\}.\{extension\}

Components

NN — Step number (two-digit, zero-padded)

The step number is the position in the pipeline, not in the skill. Across a full pipeline of multiple skills, step numbers increment continuously: 01, 02, 03, ... 12. This means you can sort checkpoints alphabetically and get the correct chronological order.

operation — What was done (snake_case)

Descriptive and concise. Uses snake_case. The operation name should be readable by a human who was not present when the pipeline ran.

Good	Avoid
`raw_data`	`step1`
`nulls_imputed`	`cleaned` (too vague)
`outliers_removed`	`data2`
`schema_validated`	`final` (final is never final)

extension — File format

Match the format of the data written. The agent may switch formats between steps for efficiency:

Format	When Used
`.csv`	Small datasets, human-readable outputs, first checkpoints
`.parquet`	Large datasets, after profiling confirms size warrants it
`.json`	Nested or semi-structured data
`.pkl`	DataFrames with complex dtypes that CSV cannot preserve

Avoid .pkl for long-term storage — pickle files are not portable across Python versions. Use .parquet for DataFrames that need to persist beyond the current project.

Full Sequence Example

ckpt_01_raw_data.csv
ckpt_02_type_cast.csv
ckpt_03_nulls_imputed.csv
ckpt_04_duplicates_removed.csv
ckpt_05_outliers_capped.csv
ckpt_06_normalized.parquet
ckpt_07_schema_validated.parquet
ckpt_08_pivoted.parquet
ckpt_09_analysis_complete.parquet

Reading this list tells you the complete transformation history without opening a single file.

Multi-Session State

Checkpoints persist on disk, but the agent also needs to know which checkpoint represents the current pipeline state — especially when resuming a session. This is handled by workspace/workspace_state.md.

What `workspace_state.md` Tracks

## Pipeline State

**Status:** in-progress
**Current phase:** Execute
**Last completed step:** 5
**Last checkpoint:** workspace/data/checkpoints/ckpt_05_outliers_capped.csv
**Next skill:** magic-data-validation
**Remaining route:** [magic-data-validation, magic-report-generation]

## Quality Gates

**Profiling threshold:** 70
**Cleaning threshold:** 85
**Validation mode:** strict

## Session Notes

- User requested median imputation for all numeric columns
- Outlier capping at 99th percentile (user confirmed)
- Target schema: schemas/orders_schema.json

When you return to a project and say "resume my pipeline", the agent reads this file and knows exactly where to continue.

workspace_state.md is plain Markdown. You can edit it directly to adjust thresholds, change the remaining route, or correct a session note. The agent reads it fresh at the start of each session.

Rolling Back

If a transformation produces bad results, you can roll back to any prior checkpoint using the /magic:rollback command:

"/magic:rollback ckpt_03_nulls_imputed.csv"

The agent will:

Set the active checkpoint to ckpt_03_nulls_imputed.csv
Update workspace_state.md to reflect the rollback (step number, next skill, etc.)
Mark checkpoints 04 and beyond as superseded (they are not deleted — just excluded from the active route)
Ask what you want to do differently before re-running from step 3

Superseded checkpoints are kept on disk by default. Run /magic:cleanup-checkpoints to delete them after you are satisfied with the pipeline results. Always verify the active pipeline produces correct results before cleaning up.

Checkpoint Size and Storage

For large datasets, checkpoint files can accumulate significant disk space. The agent manages this automatically in a few ways:

Format promotion — switches from CSV to Parquet after the first few steps to reduce file size
Compression — Parquet checkpoints use Snappy compression by default
Cleanup command — /magic:cleanup-checkpoints removes superseded checkpoints while keeping the active chain intact

For very large datasets (>1GB uncompressed), the agent may suggest writing checkpoints as Parquet partitioned files and will update the checkpoint path in the state file accordingly.

Disabling Checkpoints

If your dataset is small and you want to skip checkpoint overhead:

"Run the cleaning pipeline without checkpoints"

The agent will still write a final checkpoint after the last step (so the workspace state remains valid) but will skip intermediate saves. The analysis journal still records every step.

Disabling intermediate checkpoints removes the ability to roll back mid-pipeline. Use only for short pipelines where re-running from scratch is fast.

Was this page helpful?

On this page