# Deck 16 · Product Data Enhancement (PDX → Snowflake)

**Status:** Draft (not yet built)
**Saved:** 2026-06-28 by Jeff (verbal)
**Owner:** Plex / OMX Master Data + Commercial
**Sibling:** Deck 05 (Global Catalog) — Catalog is the platform/PIM; this deck is the data-quality programme feeding it.

---

## One-line thesis

**PDX takes all product data — ranged or not — and we round un-ranged SKUs through Snowflake as the staging layer.** This deck is the work to get every SKU we sell, quote, or might one day stock into PDX with the quality our search, quote, switching and IBP layers depend on.

## The wedge — why now

- Search relevance (Deck 08), Switching (Deck 09), Quoting (Deck 04), and IBP (Deck 14) are all only as good as the underlying product data
- Today: ranged SKUs sit in PDX; off-range / quote-only SKUs sit unmanaged, leading to slow supplier loops and missing spec sheets
- PDX is the canonical platform; Snowflake is the round-trip staging for un-ranged data until it earns its way into PDX
- Modern AI (Anthropic / OpenAI) makes enrichment of descriptions, attributes, classifications, images cheap-per-SKU vs the historical manual-labour cost

## What this deck covers

1. **PDX as the master** — ranged or not, all product data lives there
2. **Snowflake staging for un-ranged** — when a SKU is quoted, requested, or scraped via Plex-CI, it lands in Snowflake first; promoted into PDX when it qualifies
3. **AI-assisted enrichment** — descriptions, attributes, dimensions, GTINs, image-presence checks, taxonomic classification — generated/validated by AI then human-reviewed
4. **Data-quality scorecard** — completeness, accuracy, consistency, age — per SKU, per category, per supplier
5. **Spec sheet capture** — supplier loop reduction; pull from supplier portals when possible, OCR when not
6. **Image normalisation** — catalog hero, alt views, scale references — standardised format
7. **Cross-references** — competitor SKU → OMX SKU (links to Deck 20 Product Matching), supplier SKU → OMX SKU

## What this deck explicitly does NOT do

- Not the PIM / Catalog platform itself (Deck 05 owns)
- Not competitor pricing (Deck 13 — Plex-CI owns)
- Not SKU matching at scale (Deck 20 — Product Matching owns; this deck feeds it)
- Not IBP (Deck 14) — but provides the clean SKU master IBP needs

---

## Problem framing (what's broken)

- **Off-range = unmanaged** — when quoting team finds a SKU outside the ranged catalog, it's a multi-day supplier loop
- **Spec-sheet hunt** — quoting/customer-service rep manually searches supplier websites for product information
- **Inconsistent attributes** — same product might be described 3 different ways across PDX, Pronto, web, supplier feeds
- **Image gaps** — many SKUs have no hero, alt views, or scale references
- **No data-quality SLA** — no scorecard, no remediation backlog, no clear ownership
- **Snowflake as accidental graveyard** — un-ranged SKUs sit there because nobody promotes them up

## Benefits (the value story)

| Lever | Mechanism | Sizing approach |
|---|---|---|
| **Quoting speed** | Spec sheets + attributes immediately available | Quote turnaround 3-5 days → <1hr (Deck 04 benchmark) |
| **Search relevance** | Better attributes = better search results | Conversion lift in Deck 08 |
| **Switching enablement** | Cross-references to competitor SKUs | Deck 09 funnel works |
| **AI enrichment cost** | $0.01-0.10 per SKU enrichment vs $5-50 manual | Tens of thousands of SKUs × $5+ savings = material |
| **Margin lift** | Better data = better pricing decisions in PPSS/CI | Indirect; supports Deck 02 / Deck 13 |
| **Off-range margin recovery** | Industry: 15-30% of revenue is non-stock; 3-8% margin loss → $1.8M-$4.8M leak | Direct from Deck 05 research |

---

## Layout candidates from the gold standard

- **Problem vector grid (4)**: Off-range-pain / Spec-sheet-hunt / Image-gaps / Attribute-inconsistency
- **PDX as the centre — round-trip diagram**: Ranged SKU → PDX. Un-ranged → Snowflake → enrichment → promote → PDX.
- **AI-enrichment pipeline**: Supplier feed → AI propose → human-validate → PDX commit. Per-SKU economics.
- **Data-quality scorecard**: completeness × accuracy × consistency × age — visual heat map by category
- **Before/after a SKU record**: today's sparse data vs the enhanced data
- **Roadmap**: Sprint 1 (Snowflake staging + AI enrichment POC, top 100 quote-SKUs) → Sprint 2 (top 1k off-range, image normalisation) → Sprint 3 (cross-references + Plex-CI feedback loop)
- **The ask**: AI API budget + master-data team capacity + PDX platform integration; payback from quoting speed + margin recovery

---

## Open questions to resolve

1. **PDX tooling** — which PDX product / vendor / build-in-house? (Need to inspect current OMX state)
2. **Snowflake → PDX promotion rule** — what triggers promotion (quote frequency? supplier confirmation? human review?)
3. **AI provider** — Anthropic / OpenAI / both? Prompt-locked brand voice?
4. **Quality SLA owner** — who owns the data-quality scorecard? Master Data team capacity?
5. **Image pipeline** — AI-generated alt views (likely no — risk), but standard sourcing from suppliers? Photoshop-as-a-service for gaps?
6. **GTIN / barcode** — completeness in current SKU master? Gap analysis needed
7. **Connection to Deck 05** — sequencing — Deck 05 needs Deck 16's outputs; Deck 16 needs Deck 05's platform shape

## Audience

**Primary:** Chief Commercial Officer + Master Data lead + Chief Digital Officer.
**Secondary:** Customer Service (quoting team) + Sales (off-range frequent flyers).
**Tertiary:** Buying — data quality drives pricing decisions.

## Reference

- Memory: **PDX as the master, Snowflake as un-ranged round-trip** (Jeff verbal 2026-06-28)
- Memory: **Deck 05 Global Catalog research** — off-range margin leak $1.8M-$4.8M annual; PIM tools $100k-$300k/yr packaged
- Memory: **OMX dbt models** at `lens/Current/libraries/dbt/models/presented/` — existing structured product data
- Memory: **Plex-CI** as competitor SKU data source — feeds the cross-reference layer
- Memory: **FDL REVIEW_OMX_dbt v2.0** standards — landing → ODS/STAGING → DW → PRESENTED; this deck's data flows fit that pattern

---

## Research deepening (background-agent, 2026-06-28)

### PIM platform pricing — what "PDX or build" looks like at market

| Platform | Annual licence (USD/EUR) | Strength | OMX fit |
|---|---|---|---|
| **Akeneo Growth** | $25k-$45k/yr entry | Strong open-source heritage; native AI enrichment via embedded GPT-4.1-mini; BYOK add-on for Claude/own model | Best for OMX scale; AI-native |
| **Akeneo Enterprise** | $60k-$200k+/yr | Enterprise governance + workflow | Premium tier |
| **Salsify Enterprise** | $75k-$300k+/yr (rarely <$50k) | Strong DAM + retail distribution channels | Heavy on retailer-syndication |
| **Pimcore PaaS** | $20k-$60k/yr; OSS free | MDM + CMS + DAM in one; open-source core | Best if OMX wants to extend with developers |
| **inRiver** | Custom enterprise | Multi-channel sales/marketing focus | Slightly off-vector for distribution B2B |
| **Stibo Systems STEP** | Custom $100k+ | Very large enterprise MDM+PIM | Over-engineered for OMX |
| **Custom-on-Snowflake (PDX wrapper)** | Compute + dbt + UI ($30-50k tooling) | Tightly fits the Snowflake-staging-round-trip thesis | Match for un-ranged round-trip |

**Sources:**
- Akeneo pricing — https://www.g2.com/products/akeneo-pim/pricing / https://www.akeneo.com/akeneo-pim/
- Akeneo AI enrichment (embedded GPT-4.1-mini, BYOK option) — https://help.akeneo.com/serenity-boost-your-productivity/ai-enhanced-enrichment-in-the-pim
- Salsify / Pimcore / inRiver / Stibo comparison — https://edana.ch/en/2025/05/24/akeneo-pimcore-salsify-choosing-and-integrating-a-pim-into-your-it-system/
- Best PIM 2026 review — https://www.inriver.com/resources/best-pim-solutions/ / https://www.viewpointanalysis.com/post/product-information-management-pim-software-options-2026

**Implication for deck:** the "PDX as master + Snowflake staging" pattern is sensible because **PIM platforms don't natively model the un-ranged SKU pool** — they assume every SKU is in-catalog. OMX's hybrid uses Snowflake for the dirty-data round-trip and promotes only when quality threshold is met.

### AI-enrichment cost per SKU — concrete numbers

| Provider | Cost shape | Cost per SKU (estimate) |
|---|---|---|
| **Claude Sonnet 4.6 / Opus 4.7** | $3-$15/1M tokens input; $15-$75/1M output | $0.02-$0.10/SKU (description + 5-10 attributes + classification) |
| **GPT-4.1 mini** (Akeneo embedded) | $0.15-$0.60/1M tokens | $0.005-$0.02/SKU |
| **GPT-5 / Claude Opus full** | $10-$50/1M+ | $0.10-$0.50/SKU |
| **Manual master-data analyst** | NZ rate ~$50-$80/hr; 5-10 min/SKU | $4-$13/SKU |

At OMX scale, **~50,000 ranged + 100,000+ un-ranged candidate SKUs = 150k SKUs to enrich**. Manual cost: **$600k-$1.95M**. AI cost (Claude Sonnet, full enrichment): **$3k-$15k**. **>99% cost reduction.** Human-in-loop validation typically required on 5-10% of high-value items — still <$50k labour.

### Data-quality scorecard — industry framework

DAMA-DMBOK2 data-quality dimensions (already in Data Architect framework):
- **Completeness** (% required attributes populated)
- **Accuracy** (% matching ground truth)
- **Consistency** (% same value across sources: PDX vs Pronto vs web vs supplier feed)
- **Age / Currency** (median days since last refresh)
- **Validity** (% values within allowed domain/format)
- **Uniqueness** (% duplicate SKU detection)

OMX-specific scorecard categories to consider: GTIN/barcode presence, hero-image presence, alt-views count, supplier spec-sheet attached, taxonomy classification depth, dimensions populated.

### Off-range margin recovery — sourced sizing

From Deck 05 research baseline: 15-30% of OMX revenue is non-stock; 3-8% margin loss on those = **$1.8M-$4.8M annual leak**. Deck 16's data-quality programme is the **prerequisite** to capture that leak — quoting team can't price-up an off-range item with confidence if attributes are missing.

---

## Vectors + visuals

### Lucide icons
- **Cover hero:** i-file-text (product data) + i-sparkles (enriched)
- **Problem grid:** i-search (off-range pain) / i-coffee (spec-sheet hunt) / i-shopping-cart (image gaps) / i-shuffle (attribute inconsistency)
- **PDX round-trip diagram:** i-repeat (round-trip) with i-door (promotion gate)
- **AI enrichment pipeline:** i-laptop (supplier feed) -> i-brain (AI propose) -> i-life-buoy (human validate) -> i-file-text (PDX commit)
- **Data-quality scorecard:** i-bar-chart (heat map)
- **Cross-reference layer:** i-shuffle (competitor SKU map)

### Image concepts (cover + 5 key slides)
1. **Cover hero** — Close-up of a SKU record being enriched: blank fields filling in (description / attributes / dimensions / image). Source: Figma mockup of a PDX record, side-by-side before/after. Anchor: "Every SKU. Every attribute. Every supplier. Round-tripped."
2. **Off-range pain slide** — Photo of a quoting-team rep on the phone, surrounded by 4 browser tabs of supplier websites, sticky notes everywhere. Source: OMX customer-service-team photography (with consent) or staged stock. NZ context critical — Albany or Auckland-South office aesthetic.
3. **PDX round-trip diagram** — Pure infographic: Ranged SKUs flow direct to PDX; un-ranged route through Snowflake -> AI enrichment -> human review -> promotion threshold -> PDX. OMX Slate-Blue + Coral palette.
4. **AI enrichment economics slide** — Side-by-side: manual ($4-13/SKU stack of analyst hours) vs AI ($0.02-0.10/SKU laptop with API call). Annotated with NZ analyst rate. Source: composite infographic + photo elements.
5. **Before/after SKU record slide** — Real OMX SKU (sanitised) shown as: today's sparse 6-field record vs enriched 25-field record with hero image, alt views, GTIN, dimensions, spec sheet attached, classified to taxonomy depth-4. Source: Figma mock built from actual PDX schema.
6. **Data-quality scorecard slide** — Heat-map visual of all categories x quality dimensions (completeness / accuracy / consistency / age). Red-amber-green grid. Pure infographic.