E-Commerce Analytics Platform

Phase 5: Gold Layer Analytics

Duration: Days 12-14 | 6-8 hours total
Goal: Create business-ready aggregations, KPIs, and analytics models

OVERVIEW

In Phase 5, you will:

Create Gold layer aggregated tables
Build business KPI metrics
Implement RFM (Recency, Frequency, Monetary) analysis
Create customer cohort analysis
Build product performance metrics
Generate executive dashboards
Create time-series aggregations

Gold Layer Philosophy: Business-ready, highly aggregated data optimized for reporting and dashboards. No joins required for end users.

PREREQUISITES

Before starting Phase 5:

✅ Phase 4 completed (Silver layer with dbt)
✅ All Silver tables validated
✅ dbt project working
✅ Databricks cluster running

ARCHITECTURE: GOLD LAYER


  
  Copy to clipboard
Silver Tables (dimensional model)
        ↓
   dbt Gold Models
        ↓
 Gold Tables (aggregated)
        ↓
    [customer_metrics]
    [product_performance]
    [daily_revenue]
    [rfm_segments]
    [cohort_analysis]

STEP 5.1: Set Up Gold Layer Structure (15 minutes)

Actions:

Create gold models directory:


  
  Copy to clipboard
cd dbt
mkdir -p models/gold/metrics
mkdir -p models/gold/aggregations
mkdir -p models/gold/cohorts

Update dbt_project.yml with Gold configuration:

Add this to the models section:


  
  Copy to clipboard
models:
  ecommerce_analytics:
    # ... existing staging and silver config ...
    
    # Gold layer - business aggregations
    gold:
      +materialized: table
      +schema: gold
      metrics:
        +tags: ['gold', 'metrics']
      aggregations:
        +tags: ['gold', 'aggregations']
      cohorts:
        +tags: ['gold', 'cohorts']

Create Gold database in Databricks:

In Databricks SQL Editor or notebook:


  
  Copy to clipboard
CREATE DATABASE IF NOT EXISTS gold;
USE gold;

✅ CHECKPOINT

Gold directory structure created
dbt_project.yml updated
Gold database created

STEP 5.2: Create Customer Metrics (1 hour)

Build comprehensive customer analytics.

Actions:

Create models/gold/metrics/customer_metrics.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with customers as (
    select * from {{ ref('dim_customers') }}
),

orders as (
    select * from {{ ref('fact_orders') }}
    where status = 'completed'
),

-- Customer order metrics
customer_orders as (
    select
        customer_id,
        count(*) as order_count,
        sum(total) as total_revenue,
        avg(total) as avg_order_value,
        min(order_date) as first_order_date,
        max(order_date) as last_order_date,
        datediff(max(order_date), min(order_date)) as customer_lifespan_days,
        sum(num_items) as total_items_purchased,
        avg(num_items) as avg_items_per_order
    from orders
    group by customer_id
),

-- Recent activity
recent_orders as (
    select
        customer_id,
        count(*) as orders_last_30_days
    from orders
    where order_date >= date_sub(current_date(), 30)
    group by customer_id
),

final as (
    select
        -- Customer info
        c.customer_id,
        c.full_name,
        c.email,
        c.country,
        c.state,
        c.segment,
        c.value_segment,
        c.lifecycle_stage,
        c.marketing_opt_in,
        
        -- Registration
        c.registration_date,
        c.days_since_registration,
        
        -- Order metrics
        coalesce(co.order_count, 0) as lifetime_orders,
        coalesce(co.total_revenue, 0) as lifetime_revenue,
        coalesce(co.avg_order_value, 0) as avg_order_value,
        coalesce(co.total_items_purchased, 0) as total_items_purchased,
        coalesce(co.avg_items_per_order, 0) as avg_items_per_order,
        
        -- Dates
        co.first_order_date,
        co.last_order_date,
        datediff(current_date(), co.last_order_date) as days_since_last_order,
        co.customer_lifespan_days,
        
        -- Recent activity
        coalesce(ro.orders_last_30_days, 0) as orders_last_30_days,
        
        -- Engagement metrics
        case 
            when co.customer_lifespan_days > 0 
            then round(co.order_count / (co.customer_lifespan_days / 30.0), 2)
            else 0
        end as avg_orders_per_month,
        
        case
            when co.order_count > 0
            then round(co.total_revenue / co.order_count, 2)
            else 0
        end as revenue_per_order,
        
        -- Customer quality score (0-100)
        least(100, 
            (case when co.order_count >= 10 then 25 else co.order_count * 2.5 end) +
            (case when co.total_revenue >= 1000 then 25 else co.total_revenue / 40 end) +
            (case when datediff(current_date(), co.last_order_date) <= 30 then 25
                  when datediff(current_date(), co.last_order_date) <= 90 then 15
                  else 5 end) +
            (case when c.marketing_opt_in then 15 else 0 end) +
            (case when c.is_complete_profile then 10 else 0 end)
        ) as customer_quality_score,
        
        -- Timestamps
        current_timestamp() as metric_updated_at
        
    from customers c
    left join customer_orders co on c.customer_id = co.customer_id
    left join recent_orders ro on c.customer_id = ro.customer_id
)

select * from final

Create models/gold/metrics/customer_metrics.yml:


  
  Copy to clipboard
version: 2

models:
  - name: customer_metrics
    description: |
      Comprehensive customer metrics for analysis and segmentation.
      One row per customer with all key performance indicators.
      
    columns:
      - name: customer_id
        description: "Primary key"
        tests:
          - unique
          - not_null
          
      - name: lifetime_revenue
        description: "Total revenue from completed orders"
        
      - name: customer_quality_score
        description: |
          Composite score (0-100) based on:
          - Order frequency (25 points)
          - Total spend (25 points)
          - Recent activity (25 points)
          - Marketing opt-in (15 points)
          - Complete profile (10 points)

Run the model:


  
  Copy to clipboard
dbt run --models customer_metrics

✅ CHECKPOINT

customer_metrics table created
10,000 customer records with comprehensive metrics

STEP 5.3: Create RFM Analysis (1 hour)

Implement Recency, Frequency, Monetary segmentation.

Actions:

Create models/gold/cohorts/rfm_segments.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with customer_rfm_values as (
    select
        customer_id,
        datediff(current_date(), last_order_date) as recency_days,
        lifetime_orders as frequency,
        lifetime_revenue as monetary
    from {{ ref('customer_metrics') }}
    where lifetime_orders > 0
),

-- Calculate quartiles for each RFM component
rfm_quartiles as (
    select
        percentile_approx(recency_days, 0.25) as recency_q1,
        percentile_approx(recency_days, 0.50) as recency_q2,
        percentile_approx(recency_days, 0.75) as recency_q3,
        percentile_approx(frequency, 0.25) as frequency_q1,
        percentile_approx(frequency, 0.50) as frequency_q2,
        percentile_approx(frequency, 0.75) as frequency_q3,
        percentile_approx(monetary, 0.25) as monetary_q1,
        percentile_approx(monetary, 0.50) as monetary_q2,
        percentile_approx(monetary, 0.75) as monetary_q3
    from customer_rfm_values
),

-- Assign RFM scores (1-4, where 4 is best)
rfm_scores as (
    select
        cv.customer_id,
        cv.recency_days,
        cv.frequency,
        cv.monetary,
        
        -- Recency score (lower days = better = higher score)
        case
            when cv.recency_days <= q.recency_q1 then 4
            when cv.recency_days <= q.recency_q2 then 3
            when cv.recency_days <= q.recency_q3 then 2
            else 1
        end as recency_score,
        
        -- Frequency score (more orders = better = higher score)
        case
            when cv.frequency >= q.frequency_q3 then 4
            when cv.frequency >= q.frequency_q2 then 3
            when cv.frequency >= q.frequency_q1 then 2
            else 1
        end as frequency_score,
        
        -- Monetary score (more revenue = better = higher score)
        case
            when cv.monetary >= q.monetary_q3 then 4
            when cv.monetary >= q.monetary_q2 then 3
            when cv.monetary >= q.monetary_q1 then 2
            else 1
        end as monetary_score
        
    from customer_rfm_values cv
    cross join rfm_quartiles q
),

final as (
    select
        customer_id,
        recency_days,
        frequency,
        monetary,
        recency_score,
        frequency_score,
        monetary_score,
        
        -- Combined RFM score
        concat(
            cast(recency_score as string),
            cast(frequency_score as string),
            cast(monetary_score as string)
        ) as rfm_score,
        
        -- Overall RFM value (1-12 scale)
        recency_score + frequency_score + monetary_score as rfm_total,
        
        -- Segment classification
        case
            when recency_score >= 3 and frequency_score >= 3 and monetary_score >= 3 
                then 'Champions'
            when recency_score >= 3 and frequency_score <= 2 and monetary_score >= 3 
                then 'Loyal Customers'
            when recency_score >= 3 and frequency_score <= 2 and monetary_score <= 2 
                then 'Potential Loyalists'
            when recency_score >= 3 and frequency_score >= 3 and monetary_score <= 2 
                then 'Recent Customers'
            when recency_score <= 2 and frequency_score >= 3 and monetary_score >= 3 
                then 'At Risk'
            when recency_score <= 2 and frequency_score >= 2 and monetary_score >= 2 
                then 'Cant Lose Them'
            when recency_score <= 2 and frequency_score <= 2 and monetary_score >= 3 
                then 'Hibernating High Value'
            when recency_score <= 1 and frequency_score >= 2 and monetary_score <= 2 
                then 'About to Sleep'
            when recency_score <= 1 and frequency_score <= 1 and monetary_score <= 2 
                then 'Lost'
            else 'Other'
        end as rfm_segment,
        
        current_timestamp() as segment_updated_at
        
    from rfm_scores
)

select * from final

Create models/gold/cohorts/rfm_segment_summary.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with rfm as (
    select * from {{ ref('rfm_segments') }}
),

segment_stats as (
    select
        rfm_segment,
        count(*) as customer_count,
        round(avg(recency_days), 1) as avg_recency_days,
        round(avg(frequency), 1) as avg_frequency,
        round(avg(monetary), 2) as avg_monetary,
        round(sum(monetary), 2) as total_revenue,
        round(avg(rfm_total), 1) as avg_rfm_score
    from rfm
    group by rfm_segment
)

select
    rfm_segment,
    customer_count,
    round(customer_count * 100.0 / sum(customer_count) over(), 2) as pct_of_customers,
    avg_recency_days,
    avg_frequency,
    avg_monetary,
    total_revenue,
    round(total_revenue * 100.0 / sum(total_revenue) over(), 2) as pct_of_revenue,
    avg_rfm_score
from segment_stats
order by total_revenue desc

Run RFM models:


  
  Copy to clipboard
dbt run --models rfm_segments rfm_segment_summary

View results in Databricks:


  
  Copy to clipboard
SELECT * FROM gold.rfm_segment_summary ORDER BY total_revenue DESC;

✅ CHECKPOINT

RFM segmentation complete
Customers classified into behavioral segments
Summary statistics calculated

STEP 5.4: Create Product Performance Metrics (1 hour)

Actions:

Create models/gold/metrics/product_performance.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with products as (
    select * from {{ ref('dim_products') }}
),

-- Sales metrics by time period
sales_30d as (
    select
        oi.product_id,
        sum(oi.quantity) as qty_sold_30d,
        sum(oi.total_price) as revenue_30d,
        count(distinct oi.order_id) as orders_30d
    from {{ ref('fact_order_items') }} oi
    join {{ ref('fact_orders') }} o on oi.order_id = o.order_id
    where o.order_date >= date_sub(current_date(), 30)
      and o.status = 'completed'
    group by oi.product_id
),

sales_90d as (
    select
        oi.product_id,
        sum(oi.quantity) as qty_sold_90d,
        sum(oi.total_price) as revenue_90d
    from {{ ref('fact_order_items') }} oi
    join {{ ref('fact_orders') }} o on oi.order_id = o.order_id
    where o.order_date >= date_sub(current_date(), 90)
      and o.status = 'completed'
    group by oi.product_id
),

-- Web engagement metrics
product_engagement as (
    select
        product_id,
        count(*) as total_views,
        count(distinct session_id) as unique_sessions,
        count(distinct customer_id) as unique_viewers
    from {{ ref('fact_web_events') }}
    where event_type = 'product_view'
      and product_id is not null
      and event_date >= date_sub(current_date(), 30)
    group by product_id
),

final as (
    select
        -- Product info
        p.product_id,
        p.product_name,
        p.category,
        p.subcategory,
        p.price,
        p.cost,
        p.profit_margin_pct,
        p.stock_quantity,
        p.stock_status,
        p.is_active,
        
        -- Lifetime metrics
        p.total_quantity_sold as lifetime_quantity_sold,
        p.total_revenue as lifetime_revenue,
        p.times_ordered as lifetime_orders,
        
        -- 30-day metrics
        coalesce(s30.qty_sold_30d, 0) as qty_sold_30d,
        coalesce(s30.revenue_30d, 0) as revenue_30d,
        coalesce(s30.orders_30d, 0) as orders_30d,
        
        -- 90-day metrics
        coalesce(s90.qty_sold_90d, 0) as qty_sold_90d,
        coalesce(s90.revenue_90d, 0) as revenue_90d,
        
        -- Web metrics (30 days)
        coalesce(pe.total_views, 0) as views_30d,
        coalesce(pe.unique_sessions, 0) as sessions_30d,
        coalesce(pe.unique_viewers, 0) as viewers_30d,
        
        -- Conversion metrics
        case
            when coalesce(pe.total_views, 0) > 0
            then round(coalesce(s30.orders_30d, 0) * 100.0 / pe.total_views, 2)
            else 0
        end as view_to_purchase_rate_30d,
        
        -- Velocity metrics (units per day)
        round(coalesce(s30.qty_sold_30d, 0) / 30.0, 2) as daily_velocity_30d,
        round(coalesce(s90.qty_sold_90d, 0) / 90.0, 2) as daily_velocity_90d,
        
        -- Profit metrics
        round(coalesce(s30.revenue_30d, 0) * (p.profit_margin_pct / 100), 2) as profit_30d,
        round(p.total_revenue * (p.profit_margin_pct / 100), 2) as lifetime_profit,
        
        -- Days of inventory remaining
        case
            when coalesce(s30.qty_sold_30d, 0) > 0
            then round(p.stock_quantity / (s30.qty_sold_30d / 30.0), 0)
            else 999
        end as days_of_inventory,
        
        -- Product health score (0-100)
        least(100,
            (case when coalesce(s30.orders_30d, 0) >= 10 then 30 else coalesce(s30.orders_30d, 0) * 3 end) +
            (case when p.stock_quantity >= 100 then 20 
                  when p.stock_quantity >= 50 then 15 
                  when p.stock_quantity > 0 then 10 
                  else 0 end) +
            (case when p.profit_margin_pct >= 50 then 25 
                  when p.profit_margin_pct >= 30 then 20 
                  when p.profit_margin_pct >= 20 then 15 
                  else 10 end) +
            (case when coalesce(pe.total_views, 0) >= 100 then 15 else coalesce(pe.total_views, 0) * 0.15 end) +
            (case when p.is_active then 10 else 0 end)
        ) as product_health_score,
        
        -- Performance tier
        case
            when coalesce(s30.revenue_30d, 0) >= 1000 then 'Platinum'
            when coalesce(s30.revenue_30d, 0) >= 500 then 'Gold'
            when coalesce(s30.revenue_30d, 0) >= 100 then 'Silver'
            when coalesce(s30.revenue_30d, 0) > 0 then 'Bronze'
            else 'Inactive'
        end as performance_tier_30d,
        
        current_timestamp() as metric_updated_at
        
    from products p
    left join sales_30d s30 on p.product_id = s30.product_id
    left join sales_90d s90 on p.product_id = s90.product_id
    left join product_engagement pe on p.product_id = pe.product_id
)

select * from final

Create models/gold/metrics/category_performance.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

select
    category,
    count(*) as product_count,
    sum(case when is_active then 1 else 0 end) as active_products,
    
    -- Revenue metrics
    round(sum(revenue_30d), 2) as revenue_30d,
    round(sum(revenue_90d), 2) as revenue_90d,
    round(sum(lifetime_revenue), 2) as lifetime_revenue,
    
    -- Profit metrics
    round(sum(profit_30d), 2) as profit_30d,
    round(sum(lifetime_profit), 2) as lifetime_profit,
    round(avg(profit_margin_pct), 2) as avg_profit_margin,
    
    -- Sales metrics
    sum(qty_sold_30d) as units_sold_30d,
    sum(qty_sold_90d) as units_sold_90d,
    sum(lifetime_quantity_sold) as lifetime_units_sold,
    
    -- Engagement metrics
    sum(views_30d) as total_views_30d,
    round(avg(view_to_purchase_rate_30d), 2) as avg_conversion_rate,
    
    -- Inventory
    sum(stock_quantity) as total_inventory,
    round(avg(days_of_inventory), 0) as avg_days_of_inventory,
    
    -- Performance
    round(avg(product_health_score), 1) as avg_health_score,
    
    current_timestamp() as metric_updated_at
    
from {{ ref('product_performance') }}
group by category
order by revenue_30d desc

Run product models:


  
  Copy to clipboard
dbt run --models product_performance category_performance

✅ CHECKPOINT

Product performance metrics created
Category aggregations complete
Health scores calculated

STEP 5.5: Create Time-Series Aggregations (1 hour)

Actions:

Create models/gold/aggregations/daily_revenue.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with orders as (
    select * from {{ ref('fact_orders') }}
    where status = 'completed'
),

daily_metrics as (
    select
        order_date,
        
        -- Order metrics
        count(distinct order_id) as total_orders,
        count(distinct customer_id) as unique_customers,
        
        -- Revenue metrics
        round(sum(total), 2) as gross_revenue,
        round(sum(subtotal), 2) as subtotal,
        round(sum(discount), 2) as total_discounts,
        round(sum(shipping_cost), 2) as shipping_revenue,
        round(sum(tax), 2) as tax_collected,
        
        -- Average metrics
        round(avg(total), 2) as avg_order_value,
        round(avg(num_items), 2) as avg_items_per_order,
        round(avg(discount_pct), 2) as avg_discount_pct,
        
        -- Order composition
        sum(num_items) as total_items_sold,
        sum(total_quantity) as total_units_sold,
        
        -- Payment methods
        sum(case when payment_method = 'credit_card' then 1 else 0 end) as orders_credit_card,
        sum(case when payment_method = 'debit_card' then 1 else 0 end) as orders_debit_card,
        sum(case when payment_method = 'paypal' then 1 else 0 end) as orders_paypal,
        sum(case when payment_method = 'apple_pay' then 1 else 0 end) as orders_apple_pay,
        
        -- Order sizes
        sum(case when order_size = 'Small' then 1 else 0 end) as orders_small,
        sum(case when order_size = 'Medium' then 1 else 0 end) as orders_medium,
        sum(case when order_size = 'Large' then 1 else 0 end) as orders_large,
        sum(case when order_size = 'Extra Large' then 1 else 0 end) as orders_xl,
        
        -- Flags
        sum(case when has_discount then 1 else 0 end) as orders_with_discount,
        sum(case when free_shipping then 1 else 0 end) as orders_free_shipping,
        sum(case when is_weekend then 1 else 0 end) as orders_weekend
        
    from orders
    group by order_date
),

-- Add rolling averages
with_rolling as (
    select
        *,
        -- 7-day rolling averages
        round(avg(gross_revenue) over (
            order by order_date rows between 6 preceding and current row
        ), 2) as revenue_7d_avg,
        
        round(avg(total_orders) over (
            order by order_date rows between 6 preceding and current row
        ), 1) as orders_7d_avg,
        
        round(avg(avg_order_value) over (
            order by order_date rows between 6 preceding and current row
        ), 2) as aov_7d_avg,
        
        -- 30-day rolling averages
        round(avg(gross_revenue) over (
            order by order_date rows between 29 preceding and current row
        ), 2) as revenue_30d_avg,
        
        round(avg(total_orders) over (
            order by order_date rows between 29 preceding and current row
        ), 1) as orders_30d_avg
        
    from daily_metrics
)

select
    *,
    current_timestamp() as metric_updated_at
from with_rolling
order by order_date desc

Create models/gold/aggregations/monthly_revenue.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with orders as (
    select * from {{ ref('fact_orders') }}
    where status = 'completed'
),

monthly_metrics as (
    select
        date_format(order_date, 'yyyy-MM') as order_month,
        year(order_date) as order_year,
        month(order_date) as order_month_num,
        
        -- Order metrics
        count(distinct order_id) as total_orders,
        count(distinct customer_id) as unique_customers,
        round(count(distinct order_id) * 1.0 / count(distinct customer_id), 2) as orders_per_customer,
        
        -- Revenue metrics
        round(sum(total), 2) as gross_revenue,
        round(sum(discount), 2) as total_discounts,
        round(avg(total), 2) as avg_order_value,
        
        -- Items
        sum(num_items) as total_items,
        round(avg(num_items), 2) as avg_items_per_order,
        
        -- Customer segments
        count(distinct case when shipping_country = 'USA' then customer_id end) as customers_usa,
        count(distinct case when shipping_country != 'USA' then customer_id end) as customers_international
        
    from orders
    group by 
        date_format(order_date, 'yyyy-MM'),
        year(order_date),
        month(order_date)
),

with_growth as (
    select
        *,
        -- Month-over-month growth
        round(
            (gross_revenue - lag(gross_revenue) over (order by order_month)) / 
            lag(gross_revenue) over (order by order_month) * 100,
            2
        ) as revenue_mom_growth_pct,
        
        round(
            (total_orders - lag(total_orders) over (order by order_month)) * 1.0 / 
            lag(total_orders) over (order by order_month) * 100,
            2
        ) as orders_mom_growth_pct,
        
        -- Year-over-year growth
        round(
            (gross_revenue - lag(gross_revenue, 12) over (order by order_month)) / 
            lag(gross_revenue, 12) over (order by order_month) * 100,
            2
        ) as revenue_yoy_growth_pct
        
    from monthly_metrics
)

select
    *,
    current_timestamp() as metric_updated_at
from with_growth
order by order_month desc

Create models/gold/aggregations/weekly_web_activity.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with events as (
    select * from {{ ref('fact_web_events') }}
),

weekly_metrics as (
    select
        date_trunc('week', event_date) as week_start_date,
        
        -- Event counts
        count(*) as total_events,
        count(distinct session_id) as unique_sessions,
        count(distinct customer_id) as unique_authenticated_users,
        
        -- Event types
        sum(case when event_type = 'page_view' then 1 else 0 end) as page_views,
        sum(case when event_type = 'product_view' then 1 else 0 end) as product_views,
        sum(case when event_type = 'add_to_cart' then 1 else 0 end) as add_to_carts,
        sum(case when event_type = 'search' then 1 else 0 end) as searches,
        sum(case when event_type = 'wishlist_add' then 1 else 0 end) as wishlist_adds,
        
        -- Devices
        sum(case when device_type = 'desktop' then 1 else 0 end) as events_desktop,
        sum(case when device_type = 'mobile' then 1 else 0 end) as events_mobile,
        sum(case when device_type = 'tablet' then 1 else 0 end) as events_tablet,
        
        -- User types
        sum(case when user_type = 'Authenticated' then 1 else 0 end) as events_authenticated,
        sum(case when user_type = 'Anonymous' then 1 else 0 end) as events_anonymous,
        
        -- Engagement
        round(count(*) * 1.0 / count(distinct session_id), 2) as avg_events_per_session
        
    from events
    group by date_trunc('week', event_date)
)

select
    *,
    -- Conversion funnel metrics
    round(add_to_carts * 100.0 / nullif(product_views, 0), 2) as product_to_cart_rate,
    round(searches * 100.0 / nullif(page_views, 0), 2) as search_rate,
    
    -- Device mix
    round(events_mobile * 100.0 / total_events, 2) as mobile_pct,
    round(events_desktop * 100.0 / total_events, 2) as desktop_pct,
    
    current_timestamp() as metric_updated_at
from weekly_metrics
order by week_start_date desc

Run all time-series models:


  
  Copy to clipboard
dbt run --models daily_revenue monthly_revenue weekly_web_activity

✅ CHECKPOINT

Daily revenue aggregations created
Monthly summaries with growth rates
Weekly web activity metrics

STEP 5.6: Create Cohort Analysis (1 hour)

Build customer cohort analysis by registration month.

Actions:

Create models/gold/cohorts/customer_cohorts.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with customers as (
    select
        customer_id,
        date_format(registration_date, 'yyyy-MM') as cohort_month,
        registration_date
    from {{ ref('dim_customers') }}
),

orders as (
    select
        customer_id,
        order_date,
        total,
        date_format(order_date, 'yyyy-MM') as order_month
    from {{ ref('fact_orders') }}
    where status = 'completed'
),

-- First order per customer
first_orders as (
    select
        customer_id,
        min(order_date) as first_order_date,
        date_format(min(order_date), 'yyyy-MM') as first_order_month
    from orders
    group by customer_id
),

-- Calculate months since cohort for each order
cohort_orders as (
    select
        c.cohort_month,
        c.customer_id,
        o.order_month,
        o.total,
        months_between(
            to_date(o.order_month, 'yyyy-MM'),
            to_date(c.cohort_month, 'yyyy-MM')
        ) as months_since_registration
    from customers c
    join orders o on c.customer_id = o.customer_id
),

-- Aggregate by cohort and month
cohort_metrics as (
    select
        cohort_month,
        months_since_registration,
        count(distinct customer_id) as active_customers,
        count(*) as total_orders,
        round(sum(total), 2) as total_revenue,
        round(avg(total), 2) as avg_order_value
    from cohort_orders
    where months_since_registration >= 0
    group by cohort_month, months_since_registration
),

-- Cohort sizes
cohort_sizes as (
    select
        cohort_month,
        count(*) as cohort_size
    from customers
    group by cohort_month
),

final as (
    select
        cm.cohort_month,
        cs.cohort_size,
        cm.months_since_registration,
        cm.active_customers,
        round(cm.active_customers * 100.0 / cs.cohort_size, 2) as retention_rate,
        cm.total_orders,
        cm.total_revenue,
        cm.avg_order_value,
        round(cm.total_revenue / cs.cohort_size, 2) as revenue_per_cohort_customer,
        current_timestamp() as metric_updated_at
    from cohort_metrics cm
    join cohort_sizes cs on cm.cohort_month = cs.cohort_month
)

select * from final
order by cohort_month, months_since_registration

Create models/gold/cohorts/cohort_retention_summary.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with cohorts as (
    select * from {{ ref('customer_cohorts') }}
),

-- Get month 0, 1, 3, 6, 12 retention rates for each cohort
retention_pivoted as (
    select
        cohort_month,
        cohort_size,
        max(case when months_since_registration = 0 then retention_rate end) as retention_month_0,
        max(case when months_since_registration = 1 then retention_rate end) as retention_month_1,
        max(case when months_since_registration = 3 then retention_rate end) as retention_month_3,
        max(case when months_since_registration = 6 then retention_rate end) as retention_month_6,
        max(case when months_since_registration = 12 then retention_rate end) as retention_month_12,
        sum(total_revenue) as cohort_lifetime_revenue,
        round(sum(total_revenue) / cohort_size, 2) as revenue_per_customer
    from cohorts
    group by cohort_month, cohort_size
)

select
    cohort_month,
    cohort_size,
    retention_month_0,
    retention_month_1,
    retention_month_3,
    retention_month_6,
    retention_month_12,
    cohort_lifetime_revenue,
    revenue_per_customer,
    current_timestamp() as metric_updated_at
from retention_pivoted
order by cohort_month desc

Run cohort models:


  
  Copy to clipboard
dbt run --models customer_cohorts cohort_retention_summary

✅ CHECKPOINT

Customer cohorts created
Retention rates calculated
Lifetime value by cohort

STEP 5.7: Create Executive Dashboard Aggregation (45 minutes)

Build a single executive summary table.

Actions:

Create models/gold/metrics/executive_summary.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

with current_metrics as (
    -- Revenue metrics (last 30 days)
    select
        'Revenue - Last 30 Days' as metric_name,
        'Revenue' as metric_category,
        round(sum(gross_revenue), 2) as metric_value,
        'currency' as metric_type
    from {{ ref('daily_revenue') }}
    where order_date >= date_sub(current_date(), 30)
    
    union all
    
    -- Revenue growth
    select
        'Revenue Growth - MoM' as metric_name,
        'Revenue' as metric_category,
        revenue_mom_growth_pct as metric_value,
        'percentage' as metric_type
    from {{ ref('monthly_revenue') }}
    order by order_month desc
    limit 1
    
    union all
    
    -- Active customers
    select
        'Active Customers - Last 30 Days' as metric_name,
        'Customers' as metric_category,
        count(distinct customer_id) as metric_value,
        'count' as metric_type
    from {{ ref('fact_orders') }}
    where order_date >= date_sub(current_date(), 30)
      and status = 'completed'
    
    union all
    
    -- Total customers
    select
        'Total Customers' as metric_name,
        'Customers' as metric_category,
        count(*) as metric_value,
        'count' as metric_type
    from {{ ref('dim_customers') }}
    
    union all
    
    -- Average order value
    select
        'Average Order Value - Last 30 Days' as metric_name,
        'Orders' as metric_category,
        round(avg(total), 2) as metric_value,
        'currency' as metric_type
    from {{ ref('fact_orders') }}
    where order_date >= date_sub(current_date(), 30)
      and status = 'completed'
    
    union all
    
    -- Orders last 30 days
    select
        'Orders - Last 30 Days' as metric_name,
        'Orders' as metric_category,
        count(*) as metric_value,
        'count' as metric_type
    from {{ ref('fact_orders') }}
    where order_date >= date_sub(current_date(), 30)
      and status = 'completed'
    
    union all
    
    -- Conversion rate
    select
        'Conversion Rate - Last 30 Days' as metric_name,
        'Performance' as metric_category,
        round(
            count(distinct case when status = 'completed' then customer_id end) * 100.0 /
            count(distinct customer_id),
            2
        ) as metric_value,
        'percentage' as metric_type
    from {{ ref('fact_orders') }}
    where order_date >= date_sub(current_date(), 30)
    
    union all
    
    -- Average customer lifetime value
    select
        'Average Customer LTV' as metric_name,
        'Customers' as metric_category,
        round(avg(lifetime_revenue), 2) as metric_value,
        'currency' as metric_type
    from {{ ref('customer_metrics') }}
    where lifetime_orders > 0
    
    union all
    
    -- Top performing products
    select
        'Active Products' as metric_name,
        'Products' as metric_category,
        count(*) as metric_value,
        'count' as metric_type
    from {{ ref('dim_products') }}
    where is_active = true
    
    union all
    
    -- Web traffic
    select
        'Website Sessions - Last 30 Days' as metric_name,
        'Web Traffic' as metric_category,
        count(distinct session_id) as metric_value,
        'count' as metric_type
    from {{ ref('fact_web_events') }}
    where event_date >= date_sub(current_date(), 30)
)

select
    metric_category,
    metric_name,
    metric_value,
    metric_type,
    current_timestamp() as metric_updated_at
from current_metrics
order by metric_category, metric_name

Create models/gold/metrics/top_performers.sql:


  
  Copy to clipboard
{{
  config(
    materialized='table',
    schema='gold'
  )
}}

-- Top customers
with top_customers as (
    select
        'Top Customer' as category,
        full_name as name,
        lifetime_revenue as value,
        1 as rank_order
    from {{ ref('customer_metrics') }}
    order by lifetime_revenue desc
    limit 10
),

-- Top products
top_products as (
    select
        'Top Product' as category,
        product_name as name,
        revenue_30d as value,
        row_number() over (order by revenue_30d desc) as rank_order
    from {{ ref('product_performance') }}
    limit 10
),

-- Top categories
top_categories as (
    select
        'Top Category' as category,
        category as name,
        revenue_30d as value,
        row_number() over (order by revenue_30d desc) as rank_order
    from {{ ref('category_performance') }}
    limit 5
),

combined as (
    select * from top_customers
    union all
    select * from top_products
    union all
    select * from top_categories
)

select
    category,
    rank_order,
    name,
    round(value, 2) as value,
    current_timestamp() as metric_updated_at
from combined
order by category, rank_order

Run executive models:


  
  Copy to clipboard
dbt run --models executive_summary top_performers

✅ CHECKPOINT

Executive summary created
Top performers identified
Single-query dashboard ready

STEP 5.8: Create Gold Layer Schema Documentation (30 minutes)

Actions:

Create models/gold/schema.yml:


  
  Copy to clipboard
version: 2

models:
  # Metrics
  - name: customer_metrics
    description: "Comprehensive customer KPIs and scores"
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
  
  - name: product_performance
    description: "Product sales and engagement metrics"
    columns:
      - name: product_id
        tests:
          - unique
          - not_null
          
  - name: category_performance
    description: "Category-level aggregated metrics"
    
  - name: executive_summary
    description: "Key business metrics for executive dashboard"
    
  - name: top_performers
    description: "Top customers, products, and categories"
  
  # Aggregations
  - name: daily_revenue
    description: "Daily revenue and order metrics with rolling averages"
    columns:
      - name: order_date
        tests:
          - unique
          - not_null
          
  - name: monthly_revenue
    description: "Monthly revenue with growth calculations"
    
  - name: weekly_web_activity
    description: "Weekly web traffic and engagement metrics"
  
  # Cohorts
  - name: rfm_segments
    description: "Customer RFM segmentation analysis"
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
          
  - name: rfm_segment_summary
    description: "Aggregated statistics by RFM segment"
    
  - name: customer_cohorts
    description: "Customer retention by registration cohort"
    
  - name: cohort_retention_summary
    description: "Cohort retention rates over time"

Generate documentation:


  
  Copy to clipboard
dbt docs generate
dbt docs serve

✅ CHECKPOINT

All Gold models documented
Documentation site updated

STEP 5.9: Create Validation Notebook (30 minutes)

Actions:

Create Databricks notebook gold_validation:


  
  Copy to clipboard
# Databricks notebook source
# MAGIC %md
# MAGIC # Gold Layer Validation

# COMMAND ----------

print("=" * 70)
print("GOLD LAYER TABLES")
print("=" * 70)

tables = spark.sql("SHOW TABLES IN gold").collect()

for table in tables:
    count = spark.sql(f"SELECT COUNT(*) FROM gold.{table.tableName}").collect()[0][0]
    print(f"gold.{table.tableName:<30} {count:>12,} records")

print("=" * 70)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Executive Summary

# COMMAND ----------

print("\n📊 EXECUTIVE DASHBOARD METRICS\n")
spark.sql("""
    SELECT 
        metric_category,
        metric_name,
        metric_value,
        metric_type
    FROM gold.executive_summary
    ORDER BY metric_category, metric_name
""").show(100, truncate=False)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Top Performers

# COMMAND ----------

print("\n🏆 TOP 10 CUSTOMERS BY REVENUE\n")
spark.sql("""
    SELECT name, value as lifetime_revenue
    FROM gold.top_performers
    WHERE category = 'Top Customer'
    ORDER BY rank_order
""").show(10, truncate=False)

print("\n🏆 TOP 10 PRODUCTS (30 DAYS)\n")
spark.sql("""
    SELECT name, value as revenue_30d
    FROM gold.top_performers
    WHERE category = 'Top Product'
    ORDER BY rank_order
""").show(10, truncate=False)

# COMMAND ----------

# MAGIC %md
# MAGIC ## RFM Segments

# COMMAND ----------

print("\n📈 RFM SEGMENT DISTRIBUTION\n")
spark.sql("""
    SELECT 
        rfm_segment,
        customer_count,
        pct_of_customers,
        avg_monetary,
        total_revenue,
        pct_of_revenue
    FROM gold.rfm_segment_summary
    ORDER BY total_revenue DESC
""").show(truncate=False)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Revenue Trends

# COMMAND ----------

print("\n📊 MONTHLY REVENUE TRENDS (LAST 12 MONTHS)\n")
spark.sql("""
    SELECT 
        order_month,
        total_orders,
        gross_revenue,
        avg_order_value,
        revenue_mom_growth_pct,
        revenue_yoy_growth_pct
    FROM gold.monthly_revenue
    ORDER BY order_month DESC
    LIMIT 12
""").show(truncate=False)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Cohort Analysis

# COMMAND ----------

print("\n👥 COHORT RETENTION SUMMARY\n")
spark.sql("""
    SELECT 
        cohort_month,
        cohort_size,
        retention_month_1,
        retention_month_3,
        retention_month_6,
        revenue_per_customer
    FROM gold.cohort_retention_summary
    ORDER BY cohort_month DESC
    LIMIT 12
""").show(truncate=False)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Product Performance

# COMMAND ----------

print("\n📦 TOP CATEGORIES BY REVENUE (30 DAYS)\n")
spark.sql("""
    SELECT 
        category,
        product_count,
        revenue_30d,
        profit_30d,
        avg_profit_margin,
        units_sold_30d
    FROM gold.category_performance
    ORDER BY revenue_30d DESC
""").show(truncate=False)

# COMMAND ----------

print("\n" + "=" * 70)
print("✅ GOLD LAYER VALIDATION COMPLETE")
print("=" * 70)
print("\nAll aggregations validated and ready for reporting!")

Run the validation notebook

✅ CHECKPOINT

All Gold tables validated
Metrics look correct
No data quality issues

STEP 5.10: Commit Phase 5 to Git (15 minutes)

Actions:


  
  Copy to clipboard
# Check status
git status

# Add all new files
git add dbt/models/gold/
git add databricks/notebook_exports/  # if exported

# Commit
git commit -m "Phase 5 complete: Gold layer analytics

- Created 15+ Gold layer aggregation tables
- Built customer metrics with quality scores
- Implemented RFM segmentation analysis
- Created cohort retention analysis
- Built product performance metrics
- Generated time-series aggregations (daily, weekly, monthly)
- Created executive dashboard summaries
- All metrics validated and documented"

# Push to GitHub
git push origin main

✅ CHECKPOINT

All Gold layer code committed
Changes pushed to GitHub

PHASE 5 COMPLETE! 🎉

What You Built:

✅ Customer Analytics (3 models)

customer_metrics - 10,000 customers with 25+ KPIs
rfm_segments - Behavioral segmentation
rfm_segment_summary - Segment performance stats

✅ Product Analytics (2 models)

product_performance - 500 products with health scores
category_performance - Category aggregations

✅ Time-Series Aggregations (3 models)

daily_revenue - Daily metrics with rolling averages
monthly_revenue - Monthly trends with MoM/YoY growth
weekly_web_activity - Web traffic patterns

✅ Cohort Analysis (2 models)

customer_cohorts - Retention by registration month
cohort_retention_summary - Cohort performance summary

✅ Executive Dashboards (2 models)

executive_summary - 10 key business metrics
top_performers - Top customers, products, categories

✅ Total: 15 Gold layer tables with 360,500+ aggregated records

Key Metrics Created

Customer Metrics:

Lifetime value and orders
Average order value
Customer quality score (0-100)
Days since last order
Engagement rate

Product Metrics:

Revenue (30d, 90d, lifetime)
Sales velocity
Inventory days remaining
Product health score (0-100)
View-to-purchase conversion

Business KPIs:

Revenue growth (MoM, YoY)
Customer retention rates
Average order value
Conversion rates
RFM segment distribution

What's Next: Phase 6

In Phase 6, you will:

Implement Great Expectations for data quality
Create automated data validation
Build data quality dashboards
Set up anomaly detection
Create data quality scorecards
Implement data monitoring alerts

Estimated Time: 4-6 hours over Days 15-16

Troubleshooting

Issue: dbt run takes too long
Solution: Increase Databricks cluster size or use incremental models

Issue: Aggregations don't match source data
Solution: Check join conditions and date filters, ensure completed orders only

Issue: RFM scores seem incorrect
Solution: Verify quartile calculations, check for null values in recency/frequency/monetary

Issue: Cohort retention rates > 100%
Solution: Check cohort_size calculation, ensure unique customer counts

Issue: Executive summary shows nulls
Solution: Verify all source tables have data for date ranges used

Best Practices Applied

Aggregation layer separation - Gold is purely aggregated, no raw details
Materialized as tables - Fast query performance for dashboards
Rolling metrics - 7-day and 30-day averages for trend analysis
Growth calculations - MoM and YoY comparisons
Composite scores - Health and quality scores for quick assessment
Executive-ready - Single query access to all key metrics
Consistent naming - All metrics follow naming conventions

Usage Examples

Query executive summary:


  
  Copy to clipboard
SELECT * FROM gold.executive_summary
ORDER BY metric_category;

Get RFM Champions:


  
  Copy to clipboard
SELECT * FROM gold.rfm_segments
WHERE rfm_segment = 'Champions'
ORDER BY monetary DESC
LIMIT 100;

Monthly revenue trend:


  
  Copy to clipboard
SELECT 
    order_month,
    gross_revenue,
    revenue_mom_growth_pct
FROM gold.monthly_revenue
ORDER BY order_month DESC
LIMIT 12;

Top products this month:


  
  Copy to clipboard
SELECT 
    product_name,
    category,
    revenue_30d,
    qty_sold_30d
FROM gold.product_performance
WHERE revenue_30d > 0
ORDER BY revenue_30d DESC
LIMIT 20;

Resources

RFM Analysis: https://en.wikipedia.org/wiki/RFM_(customer_value)
Cohort Analysis: https://www.investopedia.com/terms/c/cohort-analysis.asp
KPI Best Practices: https://www.klipfolio.com/resources/kpi-examples

Phase 5 Manual Version 1.0
Last Updated: 2025-01-01