# Data Quality Report: sales_data.csv

**Generated:** 2026-05-21
**Dataset:** `data/sales_data.csv`
**Template:** standard

## Summary

The dataset scores **88.9/100 (Grade B)** on the composite quality index — broadly usable but with four issues worth addressing before downstream analysis: (1) **20% of rows use inconsistent date formats**, (2) **2.9% of rows are exact duplicates**, (3) **5 records carry implausibly large `quantity` values** that distort revenue aggregates, and (4) the `notes` column is **89.7% empty** with the remaining values uniformly carrying trailing whitespace. Completeness in core business fields (`order_id`, `date`, `product`, `region`, `quantity`, `unit_price`, `revenue`) is 100%; quality issues are concentrated in contact fields (`email`, `phone`) and the free-text `notes` column.

The headline recommendation: **normalize dates, drop duplicates, and triage the 5 outlier orders with the data owner before computing any revenue totals** — those 5 rows alone account for the bulk of the variance in `revenue` (std = 117,264; max = 1,722,626 vs. median 6,595).

## Data Provenance

| Field | Value |
| --- | --- |
| Source file | `data/sales_data.csv` |
| Loaded checkpoint | `data/sales_loaded.csv` |
| Format / encoding | CSV / ASCII (auto-detected) |
| Rows | 515 |
| Columns | 12 |
| File size | 0.05 MB |
| Loaded on | 2026-05-21 |

**Schema (12 columns):**

| Column | Type | Non-null | Notes |
| --- | --- | --- | --- |
| `order_id` | string | 515 (100%) | 500 unique → 15 duplicate IDs |
| `date` | string | 515 (100%) | Mixed formats (see Finding 1) |
| `product` | string | 515 (100%) | 5 categories |
| `region` | string | 515 (100%) | 5 regions |
| `channel` | string | 515 (100%) | 3 channels |
| `quantity` | integer | 515 (100%) | Highly right-skewed (skew=11.18) |
| `unit_price` | float | 515 (100%) | Range 5.79–150.00 |
| `revenue` | float | 515 (100%) | Highly right-skewed (skew=12.81) |
| `sales_rep` | string | 515 (100%) | First names |
| `email` | string | 448 (87.0%) | 67 missing (13.0%) |
| `phone` | string | 473 (91.8%) | 42 missing (8.2%) |
| `notes` | string | 53 (10.3%) | 462 missing (89.7%); all values padded with trailing whitespace |

## Methodology

Quality was assessed via the MAGIC profiling pipeline:

1. **Format detection** (`detect_format.py`) — content-sniffed file format, encoding, and delimiter before load.
2. **Load + validate** (`load_file.py`) — read into pandas with auto-detected encoding; 515/515 rows loaded with no parser errors.
3. **Composite quality score** (`quality_score.py`) — weighted average of four dimensions: completeness (30%), consistency (25%), uniqueness (20%), validity (25%).
4. **Issue detection** (`detect_all_issues.py`) — ran six sub-analyses: sentinel scan, distribution analysis (Shapiro-Wilk normality + skewness), IQR outlier detection (1.5×IQR), correlation matrix, category cardinality analysis, and content validation.
5. **Manual format checks** — regex audit of `date` values against ISO-8601 (`YYYY-MM-DD`); whitespace check on `notes` against `.strip()` equivalence.

Outlier detection used IQR because all three numeric columns failed Shapiro-Wilk normality (p ≪ 0.05); Z-score would be unreliable on this skewed shape.

## Key Findings

### 1. Date column has three competing formats (high impact)

105 of 515 rows (20.4%) deviate from ISO-8601. Observed formats:

| Format | Example | Approx. count |
| --- | --- | --- |
| `YYYY-MM-DD` (ISO) | `2024-11-04` | 410 (79.6%) |
| `MM/DD/YYYY` | `08/18/2024` | ~70 |
| `DD-Mon-YYYY` | `20-Feb-2024` | ~35 |

**Impact:** Any time-series aggregation will silently misparse or drop these rows depending on the parser. `pd.to_datetime` with default arguments will coerce some to `NaT`.

### 2. Duplicate orders (medium impact)

`order_id` has 15 duplicate values (500 unique IDs across 515 rows). 30 rows participate in exact full-row duplicate pairs — i.e., the duplicates are not just shared IDs with different data, they are byte-identical re-entries (2.9% of the dataset).

**Impact:** Revenue and quantity sums double-count these orders by roughly 3%.

### 3. Five extreme `quantity` outliers inflate revenue totals (high impact)

IQR detection flags 5 rows (0.97%) where `quantity` exceeds the upper fence of 306.5 by 1–2 orders of magnitude. The same 5 rows are the only outliers in `revenue`.

| order_id | date | product | quantity | unit_price | revenue |
| --- | --- | --- | --- | --- | --- |
| ORD-1444 | 2024-03-15 | Widget A | **12,353** | 139.45 | 1,722,625.85 |
| ORD-1137 | 2024-03-09 | Display HD | **5,741** | 63.47 | 364,381.27 |
| ORD-1042 | 2024-02-12 | Power Unit | **9,262** | 31.35 | 290,363.70 |
| ORD-1389 | 26-Aug-2024 | Widget A | **15,862** | 100.75 | 1,598,096.50 |
| ORD-1256 | 2024-03-29 | Gadget Y | **11,581** | 103.72 | 1,201,181.32 |

For context, the median quantity is 105 and the 75th percentile is 152. These five rows alone push the mean from ~110 to 206 and the std to 1,118. They could be legitimate bulk/wholesale orders or data-entry errors (extra digit). **Triage with the data owner before treating revenue aggregates as trustworthy.**

### 4. `notes` column is sparse and uniformly malformed (low impact, easy fix)

- 462/515 rows (89.7%) are null.
- All 53 non-null values have leading/trailing whitespace (`"some notes  "` style). 53/53 fail `value == value.strip()`.

**Impact:** Trivial — a single `.str.strip()` resolves it. Worth fixing because GROUP BY operations on this column would treat `"some notes"` and `"some notes  "` as distinct.

### 5. Contact fields have moderate missingness

- `email`: 67 missing (13.0%)
- `phone`: 42 missing (8.2%)

Among present emails, 30 unique addresses serve 448 records — emails appear to be sales-rep identifiers (`rep1@…repN@`), not customer contacts. The pattern suggests these aren't genuinely "missing per customer" but unassigned-rep records. Confirm with the data owner before imputing.

### 6. Distribution shapes (informational)

| Column | Skewness | Kurtosis | Shape | Normal? |
| --- | --- | --- | --- | --- |
| `quantity` | 11.18 | 129.72 | Right-skewed | No |
| `unit_price` | -0.09 | -1.07 | Light-tailed (near-uniform) | No |
| `revenue` | 12.81 | 168.49 | Right-skewed | No |

`unit_price` is approximately uniform across 5.79–150.00 — consistent with synthetic or simulated data. Use Spearman correlations and median-based summaries downstream.

## Dimension Breakdown

| Dimension | Score | Detail |
| --- | --- | --- |
| Completeness | 81.5 | 9.2% of cells missing; concentrated in `notes` (462) |
| Consistency | 83.3 | 1 flagged issue (`notes` whitespace); **date-format inconsistency not captured by the heuristic** — see Finding 1 |
| Uniqueness | 97.1 | 15 duplicate rows (2.9%) |
| Validity | 96.8 | 0.6% IQR outliers (5 rows in `quantity` / `revenue`) |
| **Overall** | **88.9** | **Grade B** |

## Caveats and Limitations

- The composite quality score (88.9, Grade B) hides Finding 1: the consistency dimension scored 83.3 because only `notes` whitespace was flagged. **The 20% date-format inconsistency is not in the heuristic** and would, if measured, drop the score further.
- Outlier detection used IQR with the default 1.5× threshold. The 5 flagged `quantity` rows could be legitimate wholesale orders rather than errors — the report cannot distinguish without business context.
- Sample size (N=515) is small enough that the Shapiro-Wilk test reliably rejected normality, but distribution shape claims for sub-segments (e.g., per-region quantity) would have lower confidence.
- No historical baseline is available — this is a point-in-time profile, not a regression vs. a prior snapshot.
- Composite quality scores measure *structural* quality (missingness, type consistency, dedup, outliers). They do not measure semantic validity (e.g., whether `Widget A` is a real product) or business-rule compliance (e.g., whether `revenue ≈ quantity × unit_price`). A targeted business-rule check is recommended.
- The correlation matrix sub-analysis failed to write its JSON output during issue detection; pairwise correlations are not included in this report.

## Next Steps

1. **Normalize dates to ISO-8601** before any time-series work. Suggested: `pd.to_datetime(df['date'], errors='coerce')` with explicit format hints per detected pattern, then verify zero `NaT`.
2. **Drop or merge the 15 duplicate `order_id` rows** after confirming with the data owner whether duplicates are accidental re-entries or legitimate split orders.
3. **Review the 5 large-quantity outliers with the data owner**: legitimate bulk orders or data-entry errors? This single decision changes total revenue by ~70%.
4. **Strip whitespace in `notes`** (`df['notes'] = df['notes'].str.strip()`). Trivial fix, prevents downstream join/group bugs.
5. **Verify business rule `revenue ≈ quantity × unit_price`** across all rows — discrepancies would indicate either rounding rules (discounts, taxes) or data corruption.
6. **Clarify `email` semantics with the data owner** — these look like sales-rep identifiers, not customer emails. If so, rename to `sales_rep_email` and treat missing values as "unassigned" rather than imputing.
7. **Re-profile after cleaning** using `quality_score.py` to confirm score lifts above 95.
