E-Commerce Analytics Platform

Phase 5: Gold Layer Analytics

Duration: Days 12-14 | 6-8 hours total
Goal: Create business-ready aggregations, KPIs, and analytics models


OVERVIEW

In Phase 5, you will:

Gold Layer Philosophy: Business-ready, highly aggregated data optimized for reporting and dashboards. No joins required for end users.


PREREQUISITES

Before starting Phase 5:


ARCHITECTURE: GOLD LAYER

Copy to clipboard
Silver Tables (dimensional model) dbt Gold Models Gold Tables (aggregated) [customer_metrics] [product_performance] [daily_revenue] [rfm_segments] [cohort_analysis]

STEP 5.1: Set Up Gold Layer Structure (15 minutes)

Actions:

  1. Create gold models directory:
Copy to clipboard
cd dbt mkdir -p models/gold/metrics mkdir -p models/gold/aggregations mkdir -p models/gold/cohorts
  1. Update dbt_project.yml with Gold configuration:

Add this to the models section:

Copy to clipboard
models: ecommerce_analytics: # ... existing staging and silver config ... # Gold layer - business aggregations gold: +materialized: table +schema: gold metrics: +tags: ['gold', 'metrics'] aggregations: +tags: ['gold', 'aggregations'] cohorts: +tags: ['gold', 'cohorts']
  1. Create Gold database in Databricks:

In Databricks SQL Editor or notebook:

Copy to clipboard
CREATE DATABASE IF NOT EXISTS gold; USE gold;

✅ CHECKPOINT


STEP 5.2: Create Customer Metrics (1 hour)

Build comprehensive customer analytics.

Actions:

  1. Create models/gold/metrics/customer_metrics.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with customers as ( select * from {{ ref('dim_customers') }} ), orders as ( select * from {{ ref('fact_orders') }} where status = 'completed' ), -- Customer order metrics customer_orders as ( select customer_id, count(*) as order_count, sum(total) as total_revenue, avg(total) as avg_order_value, min(order_date) as first_order_date, max(order_date) as last_order_date, datediff(max(order_date), min(order_date)) as customer_lifespan_days, sum(num_items) as total_items_purchased, avg(num_items) as avg_items_per_order from orders group by customer_id ), -- Recent activity recent_orders as ( select customer_id, count(*) as orders_last_30_days from orders where order_date >= date_sub(current_date(), 30) group by customer_id ), final as ( select -- Customer info c.customer_id, c.full_name, c.email, c.country, c.state, c.segment, c.value_segment, c.lifecycle_stage, c.marketing_opt_in, -- Registration c.registration_date, c.days_since_registration, -- Order metrics coalesce(co.order_count, 0) as lifetime_orders, coalesce(co.total_revenue, 0) as lifetime_revenue, coalesce(co.avg_order_value, 0) as avg_order_value, coalesce(co.total_items_purchased, 0) as total_items_purchased, coalesce(co.avg_items_per_order, 0) as avg_items_per_order, -- Dates co.first_order_date, co.last_order_date, datediff(current_date(), co.last_order_date) as days_since_last_order, co.customer_lifespan_days, -- Recent activity coalesce(ro.orders_last_30_days, 0) as orders_last_30_days, -- Engagement metrics case when co.customer_lifespan_days > 0 then round(co.order_count / (co.customer_lifespan_days / 30.0), 2) else 0 end as avg_orders_per_month, case when co.order_count > 0 then round(co.total_revenue / co.order_count, 2) else 0 end as revenue_per_order, -- Customer quality score (0-100) least(100, (case when co.order_count >= 10 then 25 else co.order_count * 2.5 end) + (case when co.total_revenue >= 1000 then 25 else co.total_revenue / 40 end) + (case when datediff(current_date(), co.last_order_date) <= 30 then 25 when datediff(current_date(), co.last_order_date) <= 90 then 15 else 5 end) + (case when c.marketing_opt_in then 15 else 0 end) + (case when c.is_complete_profile then 10 else 0 end) ) as customer_quality_score, -- Timestamps current_timestamp() as metric_updated_at from customers c left join customer_orders co on c.customer_id = co.customer_id left join recent_orders ro on c.customer_id = ro.customer_id ) select * from final
  1. Create models/gold/metrics/customer_metrics.yml:
Copy to clipboard
version: 2 models: - name: customer_metrics description: | Comprehensive customer metrics for analysis and segmentation. One row per customer with all key performance indicators. columns: - name: customer_id description: "Primary key" tests: - unique - not_null - name: lifetime_revenue description: "Total revenue from completed orders" - name: customer_quality_score description: | Composite score (0-100) based on: - Order frequency (25 points) - Total spend (25 points) - Recent activity (25 points) - Marketing opt-in (15 points) - Complete profile (10 points)
  1. Run the model:
Copy to clipboard
dbt run --models customer_metrics

✅ CHECKPOINT


STEP 5.3: Create RFM Analysis (1 hour)

Implement Recency, Frequency, Monetary segmentation.

Actions:

  1. Create models/gold/cohorts/rfm_segments.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with customer_rfm_values as ( select customer_id, datediff(current_date(), last_order_date) as recency_days, lifetime_orders as frequency, lifetime_revenue as monetary from {{ ref('customer_metrics') }} where lifetime_orders > 0 ), -- Calculate quartiles for each RFM component rfm_quartiles as ( select percentile_approx(recency_days, 0.25) as recency_q1, percentile_approx(recency_days, 0.50) as recency_q2, percentile_approx(recency_days, 0.75) as recency_q3, percentile_approx(frequency, 0.25) as frequency_q1, percentile_approx(frequency, 0.50) as frequency_q2, percentile_approx(frequency, 0.75) as frequency_q3, percentile_approx(monetary, 0.25) as monetary_q1, percentile_approx(monetary, 0.50) as monetary_q2, percentile_approx(monetary, 0.75) as monetary_q3 from customer_rfm_values ), -- Assign RFM scores (1-4, where 4 is best) rfm_scores as ( select cv.customer_id, cv.recency_days, cv.frequency, cv.monetary, -- Recency score (lower days = better = higher score) case when cv.recency_days <= q.recency_q1 then 4 when cv.recency_days <= q.recency_q2 then 3 when cv.recency_days <= q.recency_q3 then 2 else 1 end as recency_score, -- Frequency score (more orders = better = higher score) case when cv.frequency >= q.frequency_q3 then 4 when cv.frequency >= q.frequency_q2 then 3 when cv.frequency >= q.frequency_q1 then 2 else 1 end as frequency_score, -- Monetary score (more revenue = better = higher score) case when cv.monetary >= q.monetary_q3 then 4 when cv.monetary >= q.monetary_q2 then 3 when cv.monetary >= q.monetary_q1 then 2 else 1 end as monetary_score from customer_rfm_values cv cross join rfm_quartiles q ), final as ( select customer_id, recency_days, frequency, monetary, recency_score, frequency_score, monetary_score, -- Combined RFM score concat( cast(recency_score as string), cast(frequency_score as string), cast(monetary_score as string) ) as rfm_score, -- Overall RFM value (1-12 scale) recency_score + frequency_score + monetary_score as rfm_total, -- Segment classification case when recency_score >= 3 and frequency_score >= 3 and monetary_score >= 3 then 'Champions' when recency_score >= 3 and frequency_score <= 2 and monetary_score >= 3 then 'Loyal Customers' when recency_score >= 3 and frequency_score <= 2 and monetary_score <= 2 then 'Potential Loyalists' when recency_score >= 3 and frequency_score >= 3 and monetary_score <= 2 then 'Recent Customers' when recency_score <= 2 and frequency_score >= 3 and monetary_score >= 3 then 'At Risk' when recency_score <= 2 and frequency_score >= 2 and monetary_score >= 2 then 'Cant Lose Them' when recency_score <= 2 and frequency_score <= 2 and monetary_score >= 3 then 'Hibernating High Value' when recency_score <= 1 and frequency_score >= 2 and monetary_score <= 2 then 'About to Sleep' when recency_score <= 1 and frequency_score <= 1 and monetary_score <= 2 then 'Lost' else 'Other' end as rfm_segment, current_timestamp() as segment_updated_at from rfm_scores ) select * from final
  1. Create models/gold/cohorts/rfm_segment_summary.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with rfm as ( select * from {{ ref('rfm_segments') }} ), segment_stats as ( select rfm_segment, count(*) as customer_count, round(avg(recency_days), 1) as avg_recency_days, round(avg(frequency), 1) as avg_frequency, round(avg(monetary), 2) as avg_monetary, round(sum(monetary), 2) as total_revenue, round(avg(rfm_total), 1) as avg_rfm_score from rfm group by rfm_segment ) select rfm_segment, customer_count, round(customer_count * 100.0 / sum(customer_count) over(), 2) as pct_of_customers, avg_recency_days, avg_frequency, avg_monetary, total_revenue, round(total_revenue * 100.0 / sum(total_revenue) over(), 2) as pct_of_revenue, avg_rfm_score from segment_stats order by total_revenue desc
  1. Run RFM models:
Copy to clipboard
dbt run --models rfm_segments rfm_segment_summary
  1. View results in Databricks:
Copy to clipboard
SELECT * FROM gold.rfm_segment_summary ORDER BY total_revenue DESC;

✅ CHECKPOINT


STEP 5.4: Create Product Performance Metrics (1 hour)

Actions:

  1. Create models/gold/metrics/product_performance.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with products as ( select * from {{ ref('dim_products') }} ), -- Sales metrics by time period sales_30d as ( select oi.product_id, sum(oi.quantity) as qty_sold_30d, sum(oi.total_price) as revenue_30d, count(distinct oi.order_id) as orders_30d from {{ ref('fact_order_items') }} oi join {{ ref('fact_orders') }} o on oi.order_id = o.order_id where o.order_date >= date_sub(current_date(), 30) and o.status = 'completed' group by oi.product_id ), sales_90d as ( select oi.product_id, sum(oi.quantity) as qty_sold_90d, sum(oi.total_price) as revenue_90d from {{ ref('fact_order_items') }} oi join {{ ref('fact_orders') }} o on oi.order_id = o.order_id where o.order_date >= date_sub(current_date(), 90) and o.status = 'completed' group by oi.product_id ), -- Web engagement metrics product_engagement as ( select product_id, count(*) as total_views, count(distinct session_id) as unique_sessions, count(distinct customer_id) as unique_viewers from {{ ref('fact_web_events') }} where event_type = 'product_view' and product_id is not null and event_date >= date_sub(current_date(), 30) group by product_id ), final as ( select -- Product info p.product_id, p.product_name, p.category, p.subcategory, p.price, p.cost, p.profit_margin_pct, p.stock_quantity, p.stock_status, p.is_active, -- Lifetime metrics p.total_quantity_sold as lifetime_quantity_sold, p.total_revenue as lifetime_revenue, p.times_ordered as lifetime_orders, -- 30-day metrics coalesce(s30.qty_sold_30d, 0) as qty_sold_30d, coalesce(s30.revenue_30d, 0) as revenue_30d, coalesce(s30.orders_30d, 0) as orders_30d, -- 90-day metrics coalesce(s90.qty_sold_90d, 0) as qty_sold_90d, coalesce(s90.revenue_90d, 0) as revenue_90d, -- Web metrics (30 days) coalesce(pe.total_views, 0) as views_30d, coalesce(pe.unique_sessions, 0) as sessions_30d, coalesce(pe.unique_viewers, 0) as viewers_30d, -- Conversion metrics case when coalesce(pe.total_views, 0) > 0 then round(coalesce(s30.orders_30d, 0) * 100.0 / pe.total_views, 2) else 0 end as view_to_purchase_rate_30d, -- Velocity metrics (units per day) round(coalesce(s30.qty_sold_30d, 0) / 30.0, 2) as daily_velocity_30d, round(coalesce(s90.qty_sold_90d, 0) / 90.0, 2) as daily_velocity_90d, -- Profit metrics round(coalesce(s30.revenue_30d, 0) * (p.profit_margin_pct / 100), 2) as profit_30d, round(p.total_revenue * (p.profit_margin_pct / 100), 2) as lifetime_profit, -- Days of inventory remaining case when coalesce(s30.qty_sold_30d, 0) > 0 then round(p.stock_quantity / (s30.qty_sold_30d / 30.0), 0) else 999 end as days_of_inventory, -- Product health score (0-100) least(100, (case when coalesce(s30.orders_30d, 0) >= 10 then 30 else coalesce(s30.orders_30d, 0) * 3 end) + (case when p.stock_quantity >= 100 then 20 when p.stock_quantity >= 50 then 15 when p.stock_quantity > 0 then 10 else 0 end) + (case when p.profit_margin_pct >= 50 then 25 when p.profit_margin_pct >= 30 then 20 when p.profit_margin_pct >= 20 then 15 else 10 end) + (case when coalesce(pe.total_views, 0) >= 100 then 15 else coalesce(pe.total_views, 0) * 0.15 end) + (case when p.is_active then 10 else 0 end) ) as product_health_score, -- Performance tier case when coalesce(s30.revenue_30d, 0) >= 1000 then 'Platinum' when coalesce(s30.revenue_30d, 0) >= 500 then 'Gold' when coalesce(s30.revenue_30d, 0) >= 100 then 'Silver' when coalesce(s30.revenue_30d, 0) > 0 then 'Bronze' else 'Inactive' end as performance_tier_30d, current_timestamp() as metric_updated_at from products p left join sales_30d s30 on p.product_id = s30.product_id left join sales_90d s90 on p.product_id = s90.product_id left join product_engagement pe on p.product_id = pe.product_id ) select * from final
  1. Create models/gold/metrics/category_performance.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} select category, count(*) as product_count, sum(case when is_active then 1 else 0 end) as active_products, -- Revenue metrics round(sum(revenue_30d), 2) as revenue_30d, round(sum(revenue_90d), 2) as revenue_90d, round(sum(lifetime_revenue), 2) as lifetime_revenue, -- Profit metrics round(sum(profit_30d), 2) as profit_30d, round(sum(lifetime_profit), 2) as lifetime_profit, round(avg(profit_margin_pct), 2) as avg_profit_margin, -- Sales metrics sum(qty_sold_30d) as units_sold_30d, sum(qty_sold_90d) as units_sold_90d, sum(lifetime_quantity_sold) as lifetime_units_sold, -- Engagement metrics sum(views_30d) as total_views_30d, round(avg(view_to_purchase_rate_30d), 2) as avg_conversion_rate, -- Inventory sum(stock_quantity) as total_inventory, round(avg(days_of_inventory), 0) as avg_days_of_inventory, -- Performance round(avg(product_health_score), 1) as avg_health_score, current_timestamp() as metric_updated_at from {{ ref('product_performance') }} group by category order by revenue_30d desc
  1. Run product models:
Copy to clipboard
dbt run --models product_performance category_performance

✅ CHECKPOINT


STEP 5.5: Create Time-Series Aggregations (1 hour)

Actions:

  1. Create models/gold/aggregations/daily_revenue.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with orders as ( select * from {{ ref('fact_orders') }} where status = 'completed' ), daily_metrics as ( select order_date, -- Order metrics count(distinct order_id) as total_orders, count(distinct customer_id) as unique_customers, -- Revenue metrics round(sum(total), 2) as gross_revenue, round(sum(subtotal), 2) as subtotal, round(sum(discount), 2) as total_discounts, round(sum(shipping_cost), 2) as shipping_revenue, round(sum(tax), 2) as tax_collected, -- Average metrics round(avg(total), 2) as avg_order_value, round(avg(num_items), 2) as avg_items_per_order, round(avg(discount_pct), 2) as avg_discount_pct, -- Order composition sum(num_items) as total_items_sold, sum(total_quantity) as total_units_sold, -- Payment methods sum(case when payment_method = 'credit_card' then 1 else 0 end) as orders_credit_card, sum(case when payment_method = 'debit_card' then 1 else 0 end) as orders_debit_card, sum(case when payment_method = 'paypal' then 1 else 0 end) as orders_paypal, sum(case when payment_method = 'apple_pay' then 1 else 0 end) as orders_apple_pay, -- Order sizes sum(case when order_size = 'Small' then 1 else 0 end) as orders_small, sum(case when order_size = 'Medium' then 1 else 0 end) as orders_medium, sum(case when order_size = 'Large' then 1 else 0 end) as orders_large, sum(case when order_size = 'Extra Large' then 1 else 0 end) as orders_xl, -- Flags sum(case when has_discount then 1 else 0 end) as orders_with_discount, sum(case when free_shipping then 1 else 0 end) as orders_free_shipping, sum(case when is_weekend then 1 else 0 end) as orders_weekend from orders group by order_date ), -- Add rolling averages with_rolling as ( select *, -- 7-day rolling averages round(avg(gross_revenue) over ( order by order_date rows between 6 preceding and current row ), 2) as revenue_7d_avg, round(avg(total_orders) over ( order by order_date rows between 6 preceding and current row ), 1) as orders_7d_avg, round(avg(avg_order_value) over ( order by order_date rows between 6 preceding and current row ), 2) as aov_7d_avg, -- 30-day rolling averages round(avg(gross_revenue) over ( order by order_date rows between 29 preceding and current row ), 2) as revenue_30d_avg, round(avg(total_orders) over ( order by order_date rows between 29 preceding and current row ), 1) as orders_30d_avg from daily_metrics ) select *, current_timestamp() as metric_updated_at from with_rolling order by order_date desc
  1. Create models/gold/aggregations/monthly_revenue.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with orders as ( select * from {{ ref('fact_orders') }} where status = 'completed' ), monthly_metrics as ( select date_format(order_date, 'yyyy-MM') as order_month, year(order_date) as order_year, month(order_date) as order_month_num, -- Order metrics count(distinct order_id) as total_orders, count(distinct customer_id) as unique_customers, round(count(distinct order_id) * 1.0 / count(distinct customer_id), 2) as orders_per_customer, -- Revenue metrics round(sum(total), 2) as gross_revenue, round(sum(discount), 2) as total_discounts, round(avg(total), 2) as avg_order_value, -- Items sum(num_items) as total_items, round(avg(num_items), 2) as avg_items_per_order, -- Customer segments count(distinct case when shipping_country = 'USA' then customer_id end) as customers_usa, count(distinct case when shipping_country != 'USA' then customer_id end) as customers_international from orders group by date_format(order_date, 'yyyy-MM'), year(order_date), month(order_date) ), with_growth as ( select *, -- Month-over-month growth round( (gross_revenue - lag(gross_revenue) over (order by order_month)) / lag(gross_revenue) over (order by order_month) * 100, 2 ) as revenue_mom_growth_pct, round( (total_orders - lag(total_orders) over (order by order_month)) * 1.0 / lag(total_orders) over (order by order_month) * 100, 2 ) as orders_mom_growth_pct, -- Year-over-year growth round( (gross_revenue - lag(gross_revenue, 12) over (order by order_month)) / lag(gross_revenue, 12) over (order by order_month) * 100, 2 ) as revenue_yoy_growth_pct from monthly_metrics ) select *, current_timestamp() as metric_updated_at from with_growth order by order_month desc
  1. Create models/gold/aggregations/weekly_web_activity.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with events as ( select * from {{ ref('fact_web_events') }} ), weekly_metrics as ( select date_trunc('week', event_date) as week_start_date, -- Event counts count(*) as total_events, count(distinct session_id) as unique_sessions, count(distinct customer_id) as unique_authenticated_users, -- Event types sum(case when event_type = 'page_view' then 1 else 0 end) as page_views, sum(case when event_type = 'product_view' then 1 else 0 end) as product_views, sum(case when event_type = 'add_to_cart' then 1 else 0 end) as add_to_carts, sum(case when event_type = 'search' then 1 else 0 end) as searches, sum(case when event_type = 'wishlist_add' then 1 else 0 end) as wishlist_adds, -- Devices sum(case when device_type = 'desktop' then 1 else 0 end) as events_desktop, sum(case when device_type = 'mobile' then 1 else 0 end) as events_mobile, sum(case when device_type = 'tablet' then 1 else 0 end) as events_tablet, -- User types sum(case when user_type = 'Authenticated' then 1 else 0 end) as events_authenticated, sum(case when user_type = 'Anonymous' then 1 else 0 end) as events_anonymous, -- Engagement round(count(*) * 1.0 / count(distinct session_id), 2) as avg_events_per_session from events group by date_trunc('week', event_date) ) select *, -- Conversion funnel metrics round(add_to_carts * 100.0 / nullif(product_views, 0), 2) as product_to_cart_rate, round(searches * 100.0 / nullif(page_views, 0), 2) as search_rate, -- Device mix round(events_mobile * 100.0 / total_events, 2) as mobile_pct, round(events_desktop * 100.0 / total_events, 2) as desktop_pct, current_timestamp() as metric_updated_at from weekly_metrics order by week_start_date desc
  1. Run all time-series models:
Copy to clipboard
dbt run --models daily_revenue monthly_revenue weekly_web_activity

✅ CHECKPOINT


STEP 5.6: Create Cohort Analysis (1 hour)

Build customer cohort analysis by registration month.

Actions:

  1. Create models/gold/cohorts/customer_cohorts.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with customers as ( select customer_id, date_format(registration_date, 'yyyy-MM') as cohort_month, registration_date from {{ ref('dim_customers') }} ), orders as ( select customer_id, order_date, total, date_format(order_date, 'yyyy-MM') as order_month from {{ ref('fact_orders') }} where status = 'completed' ), -- First order per customer first_orders as ( select customer_id, min(order_date) as first_order_date, date_format(min(order_date), 'yyyy-MM') as first_order_month from orders group by customer_id ), -- Calculate months since cohort for each order cohort_orders as ( select c.cohort_month, c.customer_id, o.order_month, o.total, months_between( to_date(o.order_month, 'yyyy-MM'), to_date(c.cohort_month, 'yyyy-MM') ) as months_since_registration from customers c join orders o on c.customer_id = o.customer_id ), -- Aggregate by cohort and month cohort_metrics as ( select cohort_month, months_since_registration, count(distinct customer_id) as active_customers, count(*) as total_orders, round(sum(total), 2) as total_revenue, round(avg(total), 2) as avg_order_value from cohort_orders where months_since_registration >= 0 group by cohort_month, months_since_registration ), -- Cohort sizes cohort_sizes as ( select cohort_month, count(*) as cohort_size from customers group by cohort_month ), final as ( select cm.cohort_month, cs.cohort_size, cm.months_since_registration, cm.active_customers, round(cm.active_customers * 100.0 / cs.cohort_size, 2) as retention_rate, cm.total_orders, cm.total_revenue, cm.avg_order_value, round(cm.total_revenue / cs.cohort_size, 2) as revenue_per_cohort_customer, current_timestamp() as metric_updated_at from cohort_metrics cm join cohort_sizes cs on cm.cohort_month = cs.cohort_month ) select * from final order by cohort_month, months_since_registration
  1. Create models/gold/cohorts/cohort_retention_summary.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with cohorts as ( select * from {{ ref('customer_cohorts') }} ), -- Get month 0, 1, 3, 6, 12 retention rates for each cohort retention_pivoted as ( select cohort_month, cohort_size, max(case when months_since_registration = 0 then retention_rate end) as retention_month_0, max(case when months_since_registration = 1 then retention_rate end) as retention_month_1, max(case when months_since_registration = 3 then retention_rate end) as retention_month_3, max(case when months_since_registration = 6 then retention_rate end) as retention_month_6, max(case when months_since_registration = 12 then retention_rate end) as retention_month_12, sum(total_revenue) as cohort_lifetime_revenue, round(sum(total_revenue) / cohort_size, 2) as revenue_per_customer from cohorts group by cohort_month, cohort_size ) select cohort_month, cohort_size, retention_month_0, retention_month_1, retention_month_3, retention_month_6, retention_month_12, cohort_lifetime_revenue, revenue_per_customer, current_timestamp() as metric_updated_at from retention_pivoted order by cohort_month desc
  1. Run cohort models:
Copy to clipboard
dbt run --models customer_cohorts cohort_retention_summary

✅ CHECKPOINT


STEP 5.7: Create Executive Dashboard Aggregation (45 minutes)

Build a single executive summary table.

Actions:

  1. Create models/gold/metrics/executive_summary.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} with current_metrics as ( -- Revenue metrics (last 30 days) select 'Revenue - Last 30 Days' as metric_name, 'Revenue' as metric_category, round(sum(gross_revenue), 2) as metric_value, 'currency' as metric_type from {{ ref('daily_revenue') }} where order_date >= date_sub(current_date(), 30) union all -- Revenue growth select 'Revenue Growth - MoM' as metric_name, 'Revenue' as metric_category, revenue_mom_growth_pct as metric_value, 'percentage' as metric_type from {{ ref('monthly_revenue') }} order by order_month desc limit 1 union all -- Active customers select 'Active Customers - Last 30 Days' as metric_name, 'Customers' as metric_category, count(distinct customer_id) as metric_value, 'count' as metric_type from {{ ref('fact_orders') }} where order_date >= date_sub(current_date(), 30) and status = 'completed' union all -- Total customers select 'Total Customers' as metric_name, 'Customers' as metric_category, count(*) as metric_value, 'count' as metric_type from {{ ref('dim_customers') }} union all -- Average order value select 'Average Order Value - Last 30 Days' as metric_name, 'Orders' as metric_category, round(avg(total), 2) as metric_value, 'currency' as metric_type from {{ ref('fact_orders') }} where order_date >= date_sub(current_date(), 30) and status = 'completed' union all -- Orders last 30 days select 'Orders - Last 30 Days' as metric_name, 'Orders' as metric_category, count(*) as metric_value, 'count' as metric_type from {{ ref('fact_orders') }} where order_date >= date_sub(current_date(), 30) and status = 'completed' union all -- Conversion rate select 'Conversion Rate - Last 30 Days' as metric_name, 'Performance' as metric_category, round( count(distinct case when status = 'completed' then customer_id end) * 100.0 / count(distinct customer_id), 2 ) as metric_value, 'percentage' as metric_type from {{ ref('fact_orders') }} where order_date >= date_sub(current_date(), 30) union all -- Average customer lifetime value select 'Average Customer LTV' as metric_name, 'Customers' as metric_category, round(avg(lifetime_revenue), 2) as metric_value, 'currency' as metric_type from {{ ref('customer_metrics') }} where lifetime_orders > 0 union all -- Top performing products select 'Active Products' as metric_name, 'Products' as metric_category, count(*) as metric_value, 'count' as metric_type from {{ ref('dim_products') }} where is_active = true union all -- Web traffic select 'Website Sessions - Last 30 Days' as metric_name, 'Web Traffic' as metric_category, count(distinct session_id) as metric_value, 'count' as metric_type from {{ ref('fact_web_events') }} where event_date >= date_sub(current_date(), 30) ) select metric_category, metric_name, metric_value, metric_type, current_timestamp() as metric_updated_at from current_metrics order by metric_category, metric_name
  1. Create models/gold/metrics/top_performers.sql:
Copy to clipboard
{{ config( materialized='table', schema='gold' ) }} -- Top customers with top_customers as ( select 'Top Customer' as category, full_name as name, lifetime_revenue as value, 1 as rank_order from {{ ref('customer_metrics') }} order by lifetime_revenue desc limit 10 ), -- Top products top_products as ( select 'Top Product' as category, product_name as name, revenue_30d as value, row_number() over (order by revenue_30d desc) as rank_order from {{ ref('product_performance') }} limit 10 ), -- Top categories top_categories as ( select 'Top Category' as category, category as name, revenue_30d as value, row_number() over (order by revenue_30d desc) as rank_order from {{ ref('category_performance') }} limit 5 ), combined as ( select * from top_customers union all select * from top_products union all select * from top_categories ) select category, rank_order, name, round(value, 2) as value, current_timestamp() as metric_updated_at from combined order by category, rank_order
  1. Run executive models:
Copy to clipboard
dbt run --models executive_summary top_performers

✅ CHECKPOINT


STEP 5.8: Create Gold Layer Schema Documentation (30 minutes)

Actions:

  1. Create models/gold/schema.yml:
Copy to clipboard
version: 2 models: # Metrics - name: customer_metrics description: "Comprehensive customer KPIs and scores" columns: - name: customer_id tests: - unique - not_null - name: product_performance description: "Product sales and engagement metrics" columns: - name: product_id tests: - unique - not_null - name: category_performance description: "Category-level aggregated metrics" - name: executive_summary description: "Key business metrics for executive dashboard" - name: top_performers description: "Top customers, products, and categories" # Aggregations - name: daily_revenue description: "Daily revenue and order metrics with rolling averages" columns: - name: order_date tests: - unique - not_null - name: monthly_revenue description: "Monthly revenue with growth calculations" - name: weekly_web_activity description: "Weekly web traffic and engagement metrics" # Cohorts - name: rfm_segments description: "Customer RFM segmentation analysis" columns: - name: customer_id tests: - unique - not_null - name: rfm_segment_summary description: "Aggregated statistics by RFM segment" - name: customer_cohorts description: "Customer retention by registration cohort" - name: cohort_retention_summary description: "Cohort retention rates over time"
  1. Generate documentation:
Copy to clipboard
dbt docs generate dbt docs serve

✅ CHECKPOINT


STEP 5.9: Create Validation Notebook (30 minutes)

Actions:

  1. Create Databricks notebook gold_validation:
Copy to clipboard
# Databricks notebook source # MAGIC %md # MAGIC # Gold Layer Validation # COMMAND ---------- print("=" * 70) print("GOLD LAYER TABLES") print("=" * 70) tables = spark.sql("SHOW TABLES IN gold").collect() for table in tables: count = spark.sql(f"SELECT COUNT(*) FROM gold.{table.tableName}").collect()[0][0] print(f"gold.{table.tableName:<30} {count:>12,} records") print("=" * 70) # COMMAND ---------- # MAGIC %md # MAGIC ## Executive Summary # COMMAND ---------- print("\n📊 EXECUTIVE DASHBOARD METRICS\n") spark.sql(""" SELECT metric_category, metric_name, metric_value, metric_type FROM gold.executive_summary ORDER BY metric_category, metric_name """).show(100, truncate=False) # COMMAND ---------- # MAGIC %md # MAGIC ## Top Performers # COMMAND ---------- print("\n🏆 TOP 10 CUSTOMERS BY REVENUE\n") spark.sql(""" SELECT name, value as lifetime_revenue FROM gold.top_performers WHERE category = 'Top Customer' ORDER BY rank_order """).show(10, truncate=False) print("\n🏆 TOP 10 PRODUCTS (30 DAYS)\n") spark.sql(""" SELECT name, value as revenue_30d FROM gold.top_performers WHERE category = 'Top Product' ORDER BY rank_order """).show(10, truncate=False) # COMMAND ---------- # MAGIC %md # MAGIC ## RFM Segments # COMMAND ---------- print("\n📈 RFM SEGMENT DISTRIBUTION\n") spark.sql(""" SELECT rfm_segment, customer_count, pct_of_customers, avg_monetary, total_revenue, pct_of_revenue FROM gold.rfm_segment_summary ORDER BY total_revenue DESC """).show(truncate=False) # COMMAND ---------- # MAGIC %md # MAGIC ## Revenue Trends # COMMAND ---------- print("\n📊 MONTHLY REVENUE TRENDS (LAST 12 MONTHS)\n") spark.sql(""" SELECT order_month, total_orders, gross_revenue, avg_order_value, revenue_mom_growth_pct, revenue_yoy_growth_pct FROM gold.monthly_revenue ORDER BY order_month DESC LIMIT 12 """).show(truncate=False) # COMMAND ---------- # MAGIC %md # MAGIC ## Cohort Analysis # COMMAND ---------- print("\n👥 COHORT RETENTION SUMMARY\n") spark.sql(""" SELECT cohort_month, cohort_size, retention_month_1, retention_month_3, retention_month_6, revenue_per_customer FROM gold.cohort_retention_summary ORDER BY cohort_month DESC LIMIT 12 """).show(truncate=False) # COMMAND ---------- # MAGIC %md # MAGIC ## Product Performance # COMMAND ---------- print("\n📦 TOP CATEGORIES BY REVENUE (30 DAYS)\n") spark.sql(""" SELECT category, product_count, revenue_30d, profit_30d, avg_profit_margin, units_sold_30d FROM gold.category_performance ORDER BY revenue_30d DESC """).show(truncate=False) # COMMAND ---------- print("\n" + "=" * 70) print("✅ GOLD LAYER VALIDATION COMPLETE") print("=" * 70) print("\nAll aggregations validated and ready for reporting!")
  1. Run the validation notebook

✅ CHECKPOINT


STEP 5.10: Commit Phase 5 to Git (15 minutes)

Actions:

Copy to clipboard
# Check status git status # Add all new files git add dbt/models/gold/ git add databricks/notebook_exports/ # if exported # Commit git commit -m "Phase 5 complete: Gold layer analytics - Created 15+ Gold layer aggregation tables - Built customer metrics with quality scores - Implemented RFM segmentation analysis - Created cohort retention analysis - Built product performance metrics - Generated time-series aggregations (daily, weekly, monthly) - Created executive dashboard summaries - All metrics validated and documented" # Push to GitHub git push origin main

✅ CHECKPOINT


PHASE 5 COMPLETE! 🎉

What You Built:

✅ Customer Analytics (3 models)

✅ Product Analytics (2 models)

✅ Time-Series Aggregations (3 models)

✅ Cohort Analysis (2 models)

✅ Executive Dashboards (2 models)

✅ Total: 15 Gold layer tables with 360,500+ aggregated records


Key Metrics Created

Customer Metrics:

Product Metrics:

Business KPIs:


What's Next: Phase 6

In Phase 6, you will:

Estimated Time: 4-6 hours over Days 15-16


Troubleshooting

Issue: dbt run takes too long
Solution: Increase Databricks cluster size or use incremental models

Issue: Aggregations don't match source data
Solution: Check join conditions and date filters, ensure completed orders only

Issue: RFM scores seem incorrect
Solution: Verify quartile calculations, check for null values in recency/frequency/monetary

Issue: Cohort retention rates > 100%
Solution: Check cohort_size calculation, ensure unique customer counts

Issue: Executive summary shows nulls
Solution: Verify all source tables have data for date ranges used


Best Practices Applied

  1. Aggregation layer separation - Gold is purely aggregated, no raw details
  2. Materialized as tables - Fast query performance for dashboards
  3. Rolling metrics - 7-day and 30-day averages for trend analysis
  4. Growth calculations - MoM and YoY comparisons
  5. Composite scores - Health and quality scores for quick assessment
  6. Executive-ready - Single query access to all key metrics
  7. Consistent naming - All metrics follow naming conventions

Usage Examples

Query executive summary:

Copy to clipboard
SELECT * FROM gold.executive_summary ORDER BY metric_category;

Get RFM Champions:

Copy to clipboard
SELECT * FROM gold.rfm_segments WHERE rfm_segment = 'Champions' ORDER BY monetary DESC LIMIT 100;

Monthly revenue trend:

Copy to clipboard
SELECT order_month, gross_revenue, revenue_mom_growth_pct FROM gold.monthly_revenue ORDER BY order_month DESC LIMIT 12;

Top products this month:

Copy to clipboard
SELECT product_name, category, revenue_30d, qty_sold_30d FROM gold.product_performance WHERE revenue_30d > 0 ORDER BY revenue_30d DESC LIMIT 20;

Resources


Phase 5 Manual Version 1.0
Last Updated: 2025-01-01