End-to-end customer analytics pipeline that ingests Snowflake data into Dataiku DSS, computes RFM scores, CLV estimates, and churn risk, writes results back to Snowflake, and is mirrored on Databricks with validated parity.
Snowflake (DEV.DATAIKU_DEMO)
├── CUSTOMERS (1,000 rows)
└── TRANSACTIONS (8,000 rows)
│
▼ Dataiku DSS (DEMO project)
┌───────────────────────────────────────────┐
│ [Shaker] filter STATUS = 'completed' │
│ → transactions_completed │
│ │
│ [Join] LEFT JOIN on CUSTOMER_ID │
│ → customer_transactions_joined │
│ │
│ [Python] RFM + CLV + Churn analytics │
│ → CUSTOMER_ANALYTICS_OUTPUT │
└───────────────────────────────────────────┘
│
▼
Snowflake DEV.DATAIKU_DEMO.CUSTOMER_ANALYTICS_OUTPUT
Databricks dev.dataiku_demo.customer_analytics_output ← migrated, parity verified
Parity was validated using Datafold — a data reliability platform that runs cross-database diffs at scale using bisection hashing.
- Datadiff run: https://app.datafold.com/datadiffs/13857162
- Algorithm: bisection hash on
CUSTOMER_ID - Result: 0 differences across all 1,000 rows
The validate_parity.py script uses the same open-source
data-diff library that powers Datafold cloud.
