Workspace Patterns

For Tier 2 (multi-step pipeline) and Tier 3 (multi-dataset project) tasks, MAGIC creates a structured workspace with persistent state files. These files let the agent resume sessions, track decisions, and maintain data provenance across a full pipeline.

Full Directory Structure

workspace/
├── workspace_state.md          # Phase tracker, quality score, task list, last checkpoint
├── specs/
│   └── data-spec.md            # Single source of truth for dataset properties
├── logs/
│   └── analysis_journal.md     # Decision log (timestamp, context, options, chosen, rationale)
├── data/
│   ├── input/                  # Original source files (never modified)
│   ├── processed/              # Current working version
│   └── checkpoints/
│       ├── ckpt_01_loaded.csv
│       ├── ckpt_02_cleaned.csv
│       └── ckpt_03_validated.csv
└── output/
    ├── reports/                # Generated Markdown reports
    └── visualizations/         # Charts and plots (PNG, SVG)

Initialize the workspace by running /magic:init-workspace in your AI assistant, or by asking the agent to set up a workspace.

The workspace root defaults to ./workspace/ relative to your project. Tier 1 tasks (single operations) don't create this structure — they just produce their output directly.

workspace_state.md — Phase Tracker

workspace_state.md is the agent's persistent memory. It is read at the start of every session and updated at the end of each phase.

What It Contains

Current phase — which of the 5 lifecycle phases is active (Discover, Plan, Execute, Validate, Deliver)
Last checkpoint — path to the most recent data checkpoint
Quality score — current dataset quality score (updated after each profiling or cleaning pass)
Task list — remaining steps in the current pipeline
User preferences — target schema, preferred output format, quality thresholds

Resuming a Session

When you return to a project, the agent reads workspace_state.md and resumes from where you left off:

"Resume my data pipeline"

The agent identifies the last completed step and continues from the next checkpoint.

State files are plain Markdown — you can read and edit them directly. To force a fresh start, delete or clear the state file.

specs/data-spec.md — Single Source of Truth

data-spec.md is the authoritative description of the dataset. It is created during the Plan phase and updated whenever the dataset schema or properties change.

What It Contains

Schema — column names, types, nullability, expected ranges
Quality targets — minimum quality score, acceptable null rates per column
Business rules — cross-column constraints, domain-specific validation rules
Source provenance — original file, row/column counts, loading parameters used
Processing history — which operations have been applied and in what order

Why It Matters

data-spec.md prevents drift. Without it, each phase may make different assumptions about what the data looks like. With it, the agent checks every operation against the spec and flags deviations.

logs/analysis_journal.md — Decision Log

analysis_journal.md is an append-only log of every significant decision made during the pipeline. It is the first place to look when a pipeline produces unexpected results.

Entry Format

Each entry records:

Field	Description
Timestamp	When the decision was made
Context	What situation prompted this decision
Options considered	What alternatives were available
Chosen	What was decided
Rationale	Why this option was chosen

Sample Entry

## 2026-05-20 14:32 — magic-data-cleaning / null imputation

**Context:** `revenue` column has 342 nulls (2.7%). Missing rate is below 5% threshold.

**Options considered:**
- A. Mean imputation — fast but sensitive to outliers
- B. Median imputation — robust to skew; distribution is right-skewed
- C. KNN imputation — preserves correlations; data has no strong correlated columns
- D. Drop rows — 2.7% loss acceptable but unnecessary

**Chosen:** B — median imputation

**Rationale:** Distribution is right-skewed (skewness 2.4). Median is more representative than mean. No strongly correlated columns to exploit with KNN.

data/checkpoints/ — Versioned Snapshots

Checkpoints are immutable snapshots of the data at each pipeline step. They are never overwritten — each step creates a new file with an incrementing number. This enables rollback to any prior state using /magic:rollback.

Naming Convention

ckpt_{NN}_{operation}.{extension}

Part	Description	Example
`NN`	Two-digit step number (zero-padded)	`01`, `02`, `12`
`operation`	Snake-case description of what was done	`loaded`, `cleaned`, `validated`
`extension`	Format matching the data	`csv`, `parquet`, `jsonl`

Examples

ckpt_01_loaded.csv
ckpt_02_profiled.csv          # Data unchanged; metadata written separately
ckpt_03_nulls_imputed.csv
ckpt_04_duplicates_removed.csv
ckpt_05_normalized.parquet    # Format changed for efficiency
ckpt_06_validated.parquet     # Passed validation gate
ckpt_07_report.md             # Final report

`--auto-checkpoint` Flag

Scripts that support --auto-checkpoint create numbered snapshots automatically after each successful operation — no manual checkpoint management needed:

python3 execute_cleaning_plan.py data.csv cleaned.csv --plan plan.json --auto-checkpoint
# Creates: ckpt_01_cleaned.csv, ckpt_02_cleaned.csv, ... for each step in the plan

Do not rename or move checkpoint files manually. The agent tracks their paths in workspace_state.md. Renaming breaks the rollback chain.

data/input/ — Original Source Files

The input/ directory holds original source files and is never modified. If loading fails or produces unexpected results, the original is always available for re-loading with different parameters.

For Tier 3 projects with multiple source datasets, use subdirectories:

data/input/
├── customers/
│   └── customers_2026.csv
├── transactions/
│   ├── jan.parquet
│   └── feb.parquet
└── hf/
    └── org/dataset-name/   # HuggingFace download output

Multi-Session Workflows

Long pipelines spanning multiple sessions use the workspace naturally:

Session 1 — Load and profile data, write ckpt_01 and ckpt_02, update workspace_state.md
Session 2 — Agent reads state, picks up at Discover→Plan PAUSE gate, continues with cleaning and transformation, writes ckpt_03–ckpt_05
Session 3 — Validate, generate visualizations and report, write ckpt_06–ckpt_07 to output/

Each session appends entries to analysis_journal.md, so the full decision history is always in one place.

Was this page helpful?

On this page