E-Commerce Analytics Platform

Phase 7: ML & Advanced Analytics

Duration: Days 17-19 | 6-8 hours total
Goal: Build and deploy machine learning models with MLflow tracking

OVERVIEW

In Phase 7, you will:

Set up MLflow for experiment tracking
Build customer segmentation model (K-Means)
Create churn prediction model (Random Forest)
Implement product recommendation system
Deploy models to MLflow Model Registry
Create model monitoring dashboard
Generate predictions for Gold layer

ML Philosophy: Start simple, iterate quickly, track everything, deploy confidently.

PREREQUISITES

Before starting Phase 7:

✅ Phase 6 completed (Data Quality framework)
✅ Gold layer tables with customer/product metrics
✅ Databricks cluster running
✅ Python environment with scikit-learn installed

ARCHITECTURE: ML PIPELINE


  
  Copy to clipboard
Gold Layer Data
      ↓
Feature Engineering
      ↓
Model Training → [MLflow Tracking]
      ↓
Model Evaluation
      ↓
Model Registry → [MLflow Registry]
      ↓
Batch Predictions → Gold Layer

STEP 7.1: Set Up MLflow (30 minutes)

Actions:

Verify MLflow installation:


  
  Copy to clipboard
source venv/bin/activate
pip install mlflow==2.6.0
mlflow --version

Create MLflow directory structure:


  
  Copy to clipboard
mkdir -p mlflow/{models,experiments,artifacts}
mkdir -p databricks/notebooks/06_ml_models

Configure MLflow in Databricks:

Create Databricks notebook 06_ml_models/mlflow_setup:


  
  Copy to clipboard
# Databricks notebook source
# MAGIC %md
# MAGIC # MLflow Setup and Configuration

# COMMAND ----------

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

# COMMAND ----------

# Set MLflow experiment
mlflow.set_experiment("/Users/your_email@domain.com/ecommerce_ml")

# Get experiment info
experiment = mlflow.get_experiment_by_name("/Users/your_email@domain.com/ecommerce_ml")
print(f"Experiment ID: {experiment.experiment_id}")
print(f"Artifact Location: {experiment.artifact_location}")

# COMMAND ----------

# Test MLflow tracking
with mlflow.start_run(run_name="mlflow_test"):
    mlflow.log_param("test_param", "hello")
    mlflow.log_metric("test_metric", 1.0)
    print("✅ MLflow tracking working!")

# COMMAND ----------

print("=" * 70)
print("✅ MLFLOW SETUP COMPLETE")
print("=" * 70)
print(f"\nExperiment: {experiment.name}")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Registry URI: {mlflow.get_registry_uri()}")

Run the setup notebook

✅ CHECKPOINT

MLflow installed and configured
Experiment created
Tracking verified

STEP 7.2: Build Customer Segmentation Model (2 hours)

Create K-Means clustering for customer segmentation.

Actions:

Create segmentation notebook:

Create Databricks notebook 06_ml_models/customer_segmentation:


  
  Copy to clipboard
# Databricks notebook source
# MAGIC %md
# MAGIC # Customer Segmentation with K-Means Clustering
# MAGIC 
# MAGIC Creates customer segments based on RFM and behavioral metrics

# COMMAND ----------

import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score
import matplotlib.pyplot as plt
import seaborn as sns

# Set experiment
mlflow.set_experiment("/Users/your_email@domain.com/ecommerce_ml")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Load and Prepare Data

# COMMAND ----------

# Load customer metrics
customers_df = spark.sql("""
    SELECT 
        customer_id,
        lifetime_orders,
        lifetime_revenue,
        avg_order_value,
        days_since_last_order,
        days_since_registration,
        orders_last_30_days,
        customer_quality_score
    FROM gold.customer_metrics
    WHERE lifetime_orders > 0
""").toPandas()

print(f"Loaded {len(customers_df):,} customers")
display(customers_df.head())

# COMMAND ----------

# Feature engineering
features = customers_df[[
    'lifetime_orders',
    'lifetime_revenue', 
    'avg_order_value',
    'days_since_last_order',
    'days_since_registration',
    'orders_last_30_days',
    'customer_quality_score'
]].copy()

# Handle any missing values
features = features.fillna(0)

print(f"Feature shape: {features.shape}")
print("\nFeature statistics:")
display(features.describe())

# COMMAND ----------

# MAGIC %md
# MAGIC ## Scale Features

# COMMAND ----------

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

print(f"Scaled features shape: {features_scaled.shape}")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Determine Optimal Number of Clusters

# COMMAND ----------

# Elbow method
inertias = []
silhouette_scores = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(features_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(features_scaled, kmeans.labels_))

# Plot elbow curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(k_range, inertias, marker='o')
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Method')
ax1.grid(True, alpha=0.3)

ax2.plot(k_range, silhouette_scores, marker='o', color='orange')
ax2.set_xlabel('Number of Clusters (k)')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
display(fig)

# Choose optimal k (let's use 5)
optimal_k = 5
print(f"\n✅ Optimal number of clusters: {optimal_k}")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Train Final Model with MLflow Tracking

# COMMAND ----------

with mlflow.start_run(run_name="customer_segmentation_kmeans") as run:
    
    # Log parameters
    mlflow.log_param("n_clusters", optimal_k)
    mlflow.log_param("random_state", 42)
    mlflow.log_param("features", features.columns.tolist())
    mlflow.log_param("n_customers", len(customers_df))
    
    # Train model
    kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(features_scaled)
    
    # Add cluster labels to dataframe
    customers_df['cluster'] = cluster_labels
    
    # Calculate metrics
    silhouette = silhouette_score(features_scaled, cluster_labels)
    davies_bouldin = davies_bouldin_score(features_scaled, cluster_labels)
    
    # Log metrics
    mlflow.log_metric("silhouette_score", silhouette)
    mlflow.log_metric("davies_bouldin_score", davies_bouldin)
    mlflow.log_metric("inertia", kmeans.inertia_)
    
    # Log model
    mlflow.sklearn.log_model(
        kmeans, 
        "model",
        registered_model_name="customer_segmentation"
    )
    
    # Log scaler
    mlflow.sklearn.log_model(scaler, "scaler")
    
    print(f"✅ Model trained and logged to MLflow")
    print(f"   Run ID: {run.info.run_id}")
    print(f"   Silhouette Score: {silhouette:.3f}")
    print(f"   Davies-Bouldin Score: {davies_bouldin:.3f}")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Analyze Clusters

# COMMAND ----------

# Cluster statistics
cluster_stats = customers_df.groupby('cluster').agg({
    'customer_id': 'count',
    'lifetime_orders': 'mean',
    'lifetime_revenue': 'mean',
    'avg_order_value': 'mean',
    'days_since_last_order': 'mean',
    'customer_quality_score': 'mean'
}).round(2)

cluster_stats.columns = [
    'customer_count', 
    'avg_lifetime_orders',
    'avg_lifetime_revenue', 
    'avg_order_value',
    'avg_days_since_last_order',
    'avg_quality_score'
]

print("\n📊 Cluster Statistics:")
display(cluster_stats)

# COMMAND ----------

# Assign business names to clusters
cluster_names = {
    0: 'High Value Active',
    1: 'Medium Value Regular',
    2: 'Low Value Occasional',
    3: 'At Risk',
    4: 'New Customers'
}

# This mapping should be adjusted based on actual cluster characteristics
# Review cluster_stats above and assign meaningful names

customers_df['segment_name'] = customers_df['cluster'].map(cluster_names)

print("\n🏷️  Cluster Distribution:")
display(customers_df['segment_name'].value_counts())

# COMMAND ----------

# Visualize clusters (2D projection using first 2 features)
fig, ax = plt.subplots(figsize=(12, 8))

for cluster in range(optimal_k):
    cluster_data = customers_df[customers_df['cluster'] == cluster]
    ax.scatter(
        cluster_data['lifetime_revenue'],
        cluster_data['lifetime_orders'],
        label=f"Cluster {cluster}: {cluster_names.get(cluster, 'Unknown')}",
        alpha=0.6,
        s=50
    )

ax.set_xlabel('Lifetime Revenue ($)')
ax.set_ylabel('Lifetime Orders')
ax.set_title('Customer Segments (K-Means Clustering)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
display(fig)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Save Predictions to Gold Layer

# COMMAND ----------

# Convert to Spark DataFrame
customers_segments_spark = spark.createDataFrame(
    customers_df[['customer_id', 'cluster', 'segment_name']]
)

# Write to Delta table
customers_segments_spark.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("gold.customer_segments")

print("✅ Customer segments saved to gold.customer_segments")

# COMMAND ----------

# Verify
display(spark.sql("""
    SELECT 
        segment_name,
        COUNT(*) as customer_count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
    FROM gold.customer_segments
    GROUP BY segment_name
    ORDER BY customer_count DESC
"""))

# COMMAND ----------

print("=" * 70)
print("✅ CUSTOMER SEGMENTATION COMPLETE")
print("=" * 70)
print(f"Model: K-Means with {optimal_k} clusters")
print(f"Silhouette Score: {silhouette:.3f}")
print(f"Customers Segmented: {len(customers_df):,}")
print(f"Results saved to: gold.customer_segments")

Run the segmentation notebook

✅ CHECKPOINT

Customer segmentation model trained
Model tracked in MLflow
Segments saved to Gold layer
5 distinct customer segments identified

STEP 7.3: Build Churn Prediction Model (2 hours)

Create Random Forest model to predict customer churn.

Actions:

Create churn prediction notebook:

Create Databricks notebook 06_ml_models/churn_prediction:


  
  Copy to clipboard
# Databricks notebook source
# MAGIC %md
# MAGIC # Customer Churn Prediction
# MAGIC 
# MAGIC Predicts which customers are likely to churn using Random Forest

# COMMAND ----------

import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    f1_score, roc_auc_score, confusion_matrix, 
    classification_report
)
import matplotlib.pyplot as plt
import seaborn as sns

mlflow.set_experiment("/Users/your_email@domain.com/ecommerce_ml")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Define Churn and Load Data

# COMMAND ----------

# Load customer data
customers_df = spark.sql("""
    SELECT 
        customer_id,
        lifetime_orders,
        lifetime_revenue,
        avg_order_value,
        days_since_last_order,
        days_since_registration,
        orders_last_30_days,
        customer_quality_score,
        segment,
        value_segment,
        CASE 
            WHEN days_since_last_order > 90 THEN 1
            ELSE 0
        END as churned
    FROM gold.customer_metrics
    WHERE lifetime_orders > 0
""").toPandas()

print(f"Total customers: {len(customers_df):,}")
print(f"Churned: {customers_df['churned'].sum():,} ({customers_df['churned'].mean()*100:.1f}%)")
print(f"Active: {(1-customers_df['churned']).sum():,} ({(1-customers_df['churned'].mean())*100:.1f}%)")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Feature Engineering

# COMMAND ----------

# Create features
feature_cols = [
    'lifetime_orders',
    'lifetime_revenue',
    'avg_order_value',
    'days_since_last_order',
    'days_since_registration',
    'orders_last_30_days',
    'customer_quality_score'
]

# Add encoded categorical features
customers_df['is_premium'] = (customers_df['segment'] == 'Premium').astype(int)
customers_df['is_high_value'] = customers_df['value_segment'].isin(['High Value', 'VIP']).astype(int)

feature_cols.extend(['is_premium', 'is_high_value'])

X = customers_df[feature_cols].fillna(0)
y = customers_df['churned']

print(f"Features shape: {X.shape}")
print(f"Target distribution:")
print(y.value_counts())

# COMMAND ----------

# MAGIC %md
# MAGIC ## Train-Test Split

# COMMAND ----------

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train):,} samples")
print(f"Test set: {len(X_test):,} samples")
print(f"\nChurn rate in training: {y_train.mean()*100:.1f}%")
print(f"Churn rate in test: {y_test.mean()*100:.1f}%")

# COMMAND ----------

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Train Random Forest Model

# COMMAND ----------

with mlflow.start_run(run_name="churn_prediction_rf") as run:
    
    # Model parameters
    params = {
        'n_estimators': 100,
        'max_depth': 10,
        'min_samples_split': 50,
        'min_samples_leaf': 20,
        'random_state': 42,
        'class_weight': 'balanced'  # Handle class imbalance
    }
    
    # Log parameters
    mlflow.log_params(params)
    mlflow.log_param("features", feature_cols)
    mlflow.log_param("train_size", len(X_train))
    mlflow.log_param("test_size", len(X_test))
    
    # Train model
    rf_model = RandomForestClassifier(**params)
    rf_model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_pred = rf_model.predict(X_test_scaled)
    y_pred_proba = rf_model.predict_proba(X_test_scaled)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)
    mlflow.log_metric("roc_auc", roc_auc)
    
    # Log model
    mlflow.sklearn.log_model(
        rf_model,
        "model",
        registered_model_name="churn_prediction"
    )
    mlflow.sklearn.log_model(scaler, "scaler")
    
    print("=" * 70)
    print("MODEL PERFORMANCE")
    print("=" * 70)
    print(f"Accuracy:  {accuracy:.3f}")
    print(f"Precision: {precision:.3f}")
    print(f"Recall:    {recall:.3f}")
    print(f"F1 Score:  {f1:.3f}")
    print(f"ROC-AUC:   {roc_auc:.3f}")
    print(f"\n✅ Model logged to MLflow")
    print(f"   Run ID: {run.info.run_id}")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Model Evaluation

# COMMAND ----------

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix')
ax.set_xticklabels(['Active', 'Churned'])
ax.set_yticklabels(['Active', 'Churned'])
display(fig)

# COMMAND ----------

# Classification Report
print("Classification Report:")
print(classification_report(
    y_test, 
    y_pred,
    target_names=['Active', 'Churned']
))

# COMMAND ----------

# Feature Importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature', ax=ax)
ax.set_title('Feature Importance for Churn Prediction')
ax.set_xlabel('Importance')
display(fig)

print("\nTop 5 Features:")
display(feature_importance.head())

# COMMAND ----------

# MAGIC %md
# MAGIC ## Generate Predictions for All Customers

# COMMAND ----------

# Predict churn probability for all customers
X_all = customers_df[feature_cols].fillna(0)
X_all_scaled = scaler.transform(X_all)

customers_df['churn_probability'] = rf_model.predict_proba(X_all_scaled)[:, 1]
customers_df['churn_prediction'] = rf_model.predict(X_all_scaled)
customers_df['churn_risk'] = pd.cut(
    customers_df['churn_probability'],
    bins=[0, 0.3, 0.6, 1.0],
    labels=['Low', 'Medium', 'High']
)

print("\nChurn Risk Distribution:")
print(customers_df['churn_risk'].value_counts())

# COMMAND ----------

# MAGIC %md
# MAGIC ## Save Predictions to Gold Layer

# COMMAND ----------

# Prepare results
churn_predictions = customers_df[[
    'customer_id',
    'churn_probability',
    'churn_prediction',
    'churn_risk'
]]

# Convert to Spark DataFrame
churn_predictions_spark = spark.createDataFrame(churn_predictions)

# Write to Delta
churn_predictions_spark.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("gold.customer_churn_predictions")

print("✅ Churn predictions saved to gold.customer_churn_predictions")

# COMMAND ----------

# Show high-risk customers
display(spark.sql("""
    SELECT 
        cm.customer_id,
        cm.full_name,
        cm.email,
        cm.lifetime_revenue,
        cm.days_since_last_order,
        cp.churn_probability,
        cp.churn_risk
    FROM gold.customer_metrics cm
    JOIN gold.customer_churn_predictions cp ON cm.customer_id = cp.customer_id
    WHERE cp.churn_risk = 'High'
        AND cm.lifetime_revenue > 500
    ORDER BY cp.churn_probability DESC
    LIMIT 20
"""))

# COMMAND ----------

print("=" * 70)
print("✅ CHURN PREDICTION COMPLETE")
print("=" * 70)
print(f"Model: Random Forest Classifier")
print(f"ROC-AUC Score: {roc_auc:.3f}")
print(f"Predictions saved to: gold.customer_churn_predictions")
print(f"High-risk customers identified: {(customers_df['churn_risk']=='High').sum():,}")

Run the churn prediction notebook

✅ CHECKPOINT

Churn prediction model trained (ROC-AUC > 0.80)
Model tracked in MLflow
Predictions saved to Gold layer
High-risk customers identified

STEP 7.4: Create Product Recommendation System (1.5 hours)

Build collaborative filtering for product recommendations.

Actions:

Create recommendation notebook:

Create Databricks notebook 06_ml_models/product_recommendations:


  
  Copy to clipboard
# Databricks notebook source
# MAGIC %md
# MAGIC # Product Recommendation System
# MAGIC 
# MAGIC Creates product recommendations using collaborative filtering

# COMMAND ----------

import mlflow
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

mlflow.set_experiment("/Users/your_email@domain.com/ecommerce_ml")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Load Order History

# COMMAND ----------

# Get customer-product interactions
interactions_df = spark.sql("""
    SELECT 
        o.customer_id,
        oi.product_id,
        COUNT(*) as purchase_count,
        SUM(oi.total_price) as total_spent
    FROM silver.fact_order_items oi
    JOIN silver.fact_orders o ON oi.order_id = o.order_id
    WHERE o.status = 'completed'
    GROUP BY o.customer_id, oi.product_id
""").toPandas()

print(f"Total interactions: {len(interactions_df):,}")
print(f"Unique customers: {interactions_df['customer_id'].nunique():,}")
print(f"Unique products: {interactions_df['product_id'].nunique():,}")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Create User-Item Matrix

# COMMAND ----------

# Create pivot table (user-item matrix)
user_item_matrix = interactions_df.pivot_table(
    index='customer_id',
    columns='product_id',
    values='purchase_count',
    fill_value=0
)

print(f"Matrix shape: {user_item_matrix.shape}")
print(f"Sparsity: {(user_item_matrix == 0).sum().sum() / user_item_matrix.size * 100:.1f}%")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Calculate Item-Item Similarity

# COMMAND ----------

with mlflow.start_run(run_name="product_recommendations_cf") as run:
    
    mlflow.log_param("method", "collaborative_filtering")
    mlflow.log_param("similarity", "cosine")
    mlflow.log_param("n_customers", len(user_item_matrix))
    mlflow.log_param("n_products", user_item_matrix.shape[1])
    
    # Calculate item-item similarity
    item_similarity = cosine_similarity(user_item_matrix.T)
    item_similarity_df = pd.DataFrame(
        item_similarity,
        index=user_item_matrix.columns,
        columns=user_item_matrix.columns
    )
    
    mlflow.log_metric("avg_similarity", item_similarity_df.values.mean())
    
    print(f"✅ Similarity matrix calculated")
    print(f"   Average similarity: {item_similarity_df.values.mean():.3f}")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Generate Recommendations

# COMMAND ----------

def get_recommendations(product_id, n=5):
    """Get top N similar products"""
    if product_id not in item_similarity_df.columns:
        return []
    
    similar_products = item_similarity_df[product_id].sort_values(ascending=False)[1:n+1]
    return similar_products.index.tolist()

# Test recommendations
test_product = user_item_matrix.columns[0]
recommendations = get_recommendations(test_product, n=5)

print(f"\nRecommendations for {test_product}:")
for i, rec in enumerate(recommendations, 1):
    similarity = item_similarity_df.loc[rec, test_product]
    print(f"  {i}. {rec} (similarity: {similarity:.3f})")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Create Recommendations Table

# COMMAND ----------

# Generate recommendations for all products
all_recommendations = []

for product_id in user_item_matrix.columns[:100]:  # Top 100 products
    recs = get_recommendations(product_id, n=5)
    for rank, rec_product in enumerate(recs, 1):
        all_recommendations.append({
            'product_id': product_id,
            'recommended_product_id': rec_product,
            'rank': rank,
            'similarity_score': round(item_similarity_df.loc[rec_product, product_id], 3)
        })

recommendations_df = pd.DataFrame(all_recommendations)

print(f"✅ Generated {len(recommendations_df):,} product recommendations")
display(recommendations_df.head(20))

# COMMAND ----------

# MAGIC %md
# MAGIC ## Save to Gold Layer

# COMMAND ----------

# Convert to Spark DataFrame
recommendations_spark = spark.createDataFrame(recommendations_df)

# Add product details
recommendations_with_details = spark.sql("""
    SELECT 
        r.product_id,
        p1.product_name,
        p1.category,
        r.recommended_product_id,
        p2.product_name as recommended_product_name,
        p2.category as recommended_category,
        r.rank,
        r.similarity_score
    FROM recommendations_temp r
    JOIN silver.dim_products p1 ON r.product_id = p1.product_id
    JOIN silver.dim_products p2 ON r.recommended_product_id = p2.product_id
""")

# First create temp view
recommendations_spark.createOrReplaceTempView("recommendations_temp")

# Then join and write
spark.sql("""
    SELECT 
        r.product_id,
        p1.product_name,
        p1.category,
        r.recommended_product_id,
        p2.product_name as recommended_product_name,
        p2.category as recommended_category,
        r.rank,
        r.similarity_score
    FROM recommendations_temp r
    JOIN silver.dim_products p1 ON r.product_id = p1.product_id
    JOIN silver.dim_products p2 ON r.recommended_product_id = p2.product_id
""").write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("gold.product_recommendations")

print("✅ Recommendations saved to gold.product_recommendations")

# COMMAND ----------

# View sample recommendations
display(spark.sql("""
    SELECT *
    FROM gold.product_recommendations
    WHERE product_id IN (
        SELECT product_id 
        FROM gold.product_performance 
        ORDER BY revenue_30d DESC 
        LIMIT 5
    )
    ORDER BY product_id, rank
"""))

# COMMAND ----------

print("=" * 70)
print("✅ PRODUCT RECOMMENDATIONS COMPLETE")
print("=" * 70)
print(f"Method: Collaborative Filtering (Item-Item)")
print(f"Recommendations Generated: {len(recommendations_df):,}")
print(f"Results saved to: gold.product_recommendations")

Run the recommendations notebook

✅ CHECKPOINT

Product recommendation system built
Collaborative filtering implemented
Recommendations saved to Gold layer
Top 5 similar products for each item

STEP 7.5: Create ML Model Dashboard (1 hour)

Build dashboard to monitor model performance.

Actions:

Create ML dashboard notebook:

Create Databricks notebook 06_ml_models/ml_dashboard:


  
  Copy to clipboard
# Databricks notebook source
# MAGIC %md
# MAGIC # ML Model Dashboard

# COMMAND ----------

import mlflow
from mlflow.tracking import MlflowClient
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

client = MlflowClient()

# COMMAND ----------

# MAGIC %md
# MAGIC ## Model Registry Overview

# COMMAND ----------

# Get all registered models
registered_models = client.search_registered_models()

print("=" * 70)
print("REGISTERED MODELS")
print("=" * 70)

for rm in registered_models:
    print(f"\n📦 Model: {rm.name}")
    print(f"   Description: {rm.description}")
    
    # Get latest versions
    versions = client.search_model_versions(f"name='{rm.name}'")
    print(f"   Total Versions: {len(versions)}")
    
    for version in versions[:3]:  # Show latest 3
        print(f"   - Version {version.version}: {version.current_stage}")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Experiment Runs Summary

# COMMAND ----------

# Get experiment
experiment = mlflow.get_experiment_by_name("/Users/your_email@domain.com/ecommerce_ml")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])

print(f"Total Runs: {len(runs)}")
print("\nRecent Runs:")
display(runs[['run_id', 'start_time', 'status', 'tags.mlflow.runName']].head(10))

# COMMAND ----------

# MAGIC %md
# MAGIC ## Model Performance Comparison

# COMMAND ----------

# Get metrics for all runs
metrics_df = runs[[
    'tags.mlflow.runName',
    'metrics.silhouette_score',
    'metrics.roc_auc',
    'metrics.accuracy',
    'metrics.f1_score'
]].copy()

metrics_df = metrics_df.dropna(how='all', subset=[
    'metrics.silhouette_score', 
    'metrics.roc_auc',
    'metrics.accuracy'
])

print("Model Performance Summary:")
display(metrics_df.head(10))

# COMMAND ----------

# Plot model comparison
if len(metrics_df) > 0:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Segmentation models
    seg_data = metrics_df[metrics_df['metrics.silhouette_score'].notna()]
    if len(seg_data) > 0:
        axes[0].bar(range(len(seg_data)), seg_data['metrics.silhouette_score'])
        axes[0].set_title('Segmentation Models - Silhouette Score')
        axes[0].set_ylabel('Silhouette Score')
        axes[0].set_xlabel('Run')
        axes[0].grid(True, alpha=0.3)
    
    # Churn models
    churn_data = metrics_df[metrics_df['metrics.roc_auc'].notna()]
    if len(churn_data) > 0:
        axes[1].bar(range(len(churn_data)), churn_data['metrics.roc_auc'], color='orange')
        axes[1].set_title('Churn Models - ROC-AUC Score')
        axes[1].set_ylabel('ROC-AUC')
        axes[1].set_xlabel('Run')
        axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    display(fig)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Prediction Coverage

# COMMAND ----------

# Customer segmentation coverage
segments_df = spark.sql("""
    SELECT 
        segment_name,
        COUNT(*) as customer_count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
    FROM gold.customer_segments
    GROUP BY segment_name
    ORDER BY customer_count DESC
""").toPandas()

print("\n📊 Customer Segmentation Coverage:")
display(segments_df)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(segments_df['segment_name'], segments_df['customer_count'])
ax.set_title('Customer Distribution by Segment')
ax.set_xlabel('Segment')
ax.set_ylabel('Number of Customers')
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
display(fig)

# COMMAND ----------

# Churn prediction coverage
churn_df = spark.sql("""
    SELECT 
        churn_risk,
        COUNT(*) as customer_count,
        ROUND(AVG(churn_probability), 3) as avg_probability
    FROM gold.customer_churn_predictions
    GROUP BY churn_risk
    ORDER BY 
        CASE churn_risk
            WHEN 'High' THEN 1
            WHEN 'Medium' THEN 2
            WHEN 'Low' THEN 3
        END
""").toPandas()

print("\n⚠️  Churn Risk Distribution:")
display(churn_df)

# COMMAND ----------

# Product recommendations coverage
rec_count = spark.sql("""
    SELECT COUNT(DISTINCT product_id) as products_with_recommendations
    FROM gold.product_recommendations
""").collect()[0][0]

total_products = spark.sql("""
    SELECT COUNT(*) as total_active_products
    FROM silver.dim_products
    WHERE is_active = true
""").collect()[0][0]

print(f"\n🎯 Recommendation Coverage:")
print(f"   Products with recommendations: {rec_count:,}")
print(f"   Total active products: {total_products:,}")
print(f"   Coverage: {rec_count/total_products*100:.1f}%")

# COMMAND ----------

# MAGIC %md
# MAGIC ## Business Impact Analysis

# COMMAND ----------

# High-value customers at risk
high_risk_revenue = spark.sql("""
    SELECT 
        COUNT(DISTINCT cm.customer_id) as high_risk_customers,
        ROUND(SUM(cm.lifetime_revenue), 2) as revenue_at_risk,
        ROUND(AVG(cm.lifetime_revenue), 2) as avg_customer_value
    FROM gold.customer_metrics cm
    JOIN gold.customer_churn_predictions cp ON cm.customer_id = cp.customer_id
    WHERE cp.churn_risk = 'High'
        AND cm.lifetime_revenue > 500
""").toPandas()

print("\n💰 Revenue at Risk (High-value customers with high churn risk):")
display(high_risk_revenue)

# COMMAND ----------

# Segment value analysis
segment_value = spark.sql("""
    SELECT 
        cs.segment_name,
        COUNT(DISTINCT cm.customer_id) as customers,
        ROUND(SUM(cm.lifetime_revenue), 2) as total_revenue,
        ROUND(AVG(cm.lifetime_revenue), 2) as avg_revenue_per_customer
    FROM gold.customer_segments cs
    JOIN gold.customer_metrics cm ON cs.customer_id = cm.customer_id
    GROUP BY cs.segment_name
    ORDER BY total_revenue DESC
""").toPandas()

print("\n📊 Revenue by Customer Segment:")
display(segment_value)

# COMMAND ----------

print("\n" + "=" * 70)
print("✅ ML MODEL DASHBOARD COMPLETE")
print("=" * 70)

Run the ML dashboard notebook

✅ CHECKPOINT

ML dashboard created
Model performance visualized
Prediction coverage analyzed
Business impact calculated

STEP 7.6: Create Model Deployment Script (30 minutes)

Automate model deployment and scoring.

Actions:

Create deployment script:

Create scripts/deploy_ml_models.py:


  
  Copy to clipboard
"""
ML Model Deployment Script
Loads models from MLflow and generates predictions
"""
import mlflow
import mlflow.sklearn
from pyspark.sql import SparkSession
import sys
from datetime import datetime

def load_production_model(model_name):
    """Load production version of model from registry"""
    try:
        model_uri = f"models:/{model_name}/Production"
        model = mlflow.sklearn.load_model(model_uri)
        print(f"✅ Loaded {model_name} (Production)")
        return model
    except:
        # If no production version, load latest
        model_uri = f"models:/{model_name}/latest"
        model = mlflow.sklearn.load_model(model_uri)
        print(f"✅ Loaded {model_name} (Latest)")
        return model

def score_customers_segmentation(spark):
    """Generate customer segmentation predictions"""
    print("\n📊 Running customer segmentation...")
    
    # Load model
    model = load_production_model("customer_segmentation")
    
    # Load data (this would be your feature engineering logic)
    # For demo, assuming features are ready in customer_metrics
    
    print("✅ Customer segmentation complete")

def score_churn_prediction(spark):
    """Generate churn predictions"""
    print("\n⚠️  Running churn prediction...")
    
    # Load model
    model = load_production_model("churn_prediction")
    
    print("✅ Churn prediction complete")

def main():
    """Main deployment function"""
    print("=" * 70)
    print("ML MODEL DEPLOYMENT")
    print(f"Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("=" * 70)
    
    # Initialize Spark
    spark = SparkSession.builder \
        .appName("ML_Deployment") \
        .getOrCreate()
    
    try:
        # Run all scoring jobs
        score_customers_segmentation(spark)
        score_churn_prediction(spark)
        
        print("\n" + "=" * 70)
        print("✅ ALL MODELS DEPLOYED SUCCESSFULLY")
        print("=" * 70)
        return 0
        
    except Exception as e:
        print(f"\n❌ Deployment failed: {str(e)}")
        return 1
    
    finally:
        spark.stop()

if __name__ == "__main__":
    sys.exit(main())

Test deployment script:


  
  Copy to clipboard
python scripts/deploy_ml_models.py

✅ CHECKPOINT

Deployment script created
Model loading automated
Ready for scheduled execution

STEP 7.7: Document ML Pipeline (30 minutes)

Actions:

Create docs/ml_pipeline_guide.md:


  
  Copy to clipboard
# Machine Learning Pipeline Documentation

## Overview
This document describes the ML models and pipeline for the e-commerce analytics platform.

## Models

### 1. Customer Segmentation (K-Means)
**Purpose:** Group customers into behavioral segments  
**Algorithm:** K-Means Clustering  
**Features:**
- lifetime_orders
- lifetime_revenue
- avg_order_value
- days_since_last_order
- days_since_registration
- orders_last_30_days
- customer_quality_score

**Output:** 5 customer segments
- High Value Active
- Medium Value Regular
- Low Value Occasional
- At Risk
- New Customers

**Performance:** Silhouette Score > 0.40  
**Update Frequency:** Weekly  
**Table:** `gold.customer_segments`

### 2. Churn Prediction (Random Forest)
**Purpose:** Predict which customers are likely to churn  
**Algorithm:** Random Forest Classifier  
**Target:** Churned = No orders in 90+ days  

**Features:**
- All segmentation features
- is_premium (encoded)
- is_high_value (encoded)

**Output:** Churn probability and risk level (Low/Medium/High)

**Performance:**
- ROC-AUC: > 0.80
- Precision: > 0.70
- Recall: > 0.65

**Update Frequency:** Daily  
**Table:** `gold.customer_churn_predictions`

### 3. Product Recommendations (Collaborative Filtering)
**Purpose:** Recommend similar products  
**Algorithm:** Item-Item Collaborative Filtering  
**Similarity Metric:** Cosine Similarity  

**Input:** Customer purchase history  
**Output:** Top 5 similar products per product

**Update Frequency:** Weekly  
**Table:** `gold.product_recommendations`

## MLflow Tracking

### Experiment Structure
- **Experiment Name:** `/Users/your_email/ecommerce_ml`
- **Tracked Metrics:** Model-specific performance metrics
- **Logged Artifacts:** Trained models, scalers, visualizations

### Model Registry
- **Production Models:** Deployed to production stage
- **Staging Models:** Under evaluation
- **Archived Models:** Previous versions

## Model Lifecycle

### 1. Development
- Feature engineering in dbt/Databricks
- Model training with MLflow tracking
- Hyperparameter tuning
- Performance evaluation

### 2. Validation
- Test set evaluation
- Business metric validation
- Bias/fairness checks
- Performance thresholds met

### 3. Deployment
- Register model in MLflow
- Transition to Production stage
- Generate batch predictions
- Save to Gold layer tables

### 4. Monitoring
- Track prediction distribution
- Monitor business metrics
- Detect data drift
- Retrain triggers

## Retraining Schedule

### Customer Segmentation
- **Frequency:** Weekly (Sunday 2 AM)
- **Trigger:** New customer data
- **Validation:** Silhouette score comparison

### Churn Prediction
- **Frequency:** Daily (2 AM)
- **Trigger:** Daily refresh
- **Validation:** ROC-AUC threshold

### Product Recommendations
- **Frequency:** Weekly (Sunday 3 AM)
- **Trigger:** New order data
- **Validation:** Coverage metrics

## Model Serving

### Batch Predictions
- Run after each data refresh
- Save to Gold layer tables
- Available for BI tools

### Future: Real-time Serving
- API endpoint for real-time predictions
- Model loaded in memory
- Sub-second latency

## Usage Examples

### Get Customer Segment
```sql
SELECT 
    customer_id,
    segment_name
FROM gold.customer_segments
WHERE customer_id = 'CUST000123';

Get High-Risk Customers


  
  Copy to clipboard
SELECT 
    cm.customer_id,
    cm.full_name,
    cm.lifetime_revenue,
    cp.churn_probability,
    cp.churn_risk
FROM gold.customer_metrics cm
JOIN gold.customer_churn_predictions cp 
    ON cm.customer_id = cp.customer_id
WHERE cp.churn_risk = 'High'
    AND cm.lifetime_revenue > 1000
ORDER BY cp.churn_probability DESC;

Get Product Recommendations


  
  Copy to clipboard
SELECT 
    product_name,
    recommended_product_name,
    similarity_score
FROM gold.product_recommendations
WHERE product_id = 'PROD00123'
ORDER BY rank;

Best Practices

Track Everything - Use MLflow for all experiments
Version Models - Register all production models
Validate Before Deploy - Check performance thresholds
Monitor Predictions - Track distribution shifts
Retrain Regularly - Don't let models go stale
Document Assumptions - Note feature engineering logic


  
  Copy to clipboard

2. **Create model card template:**

Create `docs/model_card_template.md`:

```markdown
# Model Card: [Model Name]

## Model Details
- **Model Name:** 
- **Model Version:** 
- **Model Type:** 
- **Training Date:** 
- **Trained By:** 

## Intended Use
**Primary Uses:**
- 

**Out-of-scope Uses:**
- 

## Training Data
- **Dataset:** 
- **Size:** 
- **Date Range:** 
- **Features:** 

## Performance Metrics
- **Metric 1:** 
- **Metric 2:** 

## Limitations
- 
- 

## Ethical Considerations
- 
- 

## Maintenance
- **Retraining Frequency:** 
- **Monitoring:** 
- **Owner:**

✅ CHECKPOINT

ML pipeline documented
Model cards template created
Usage examples provided

STEP 7.8: Commit Phase 7 to Git (15 minutes)

Actions:


  
  Copy to clipboard
# Check status
git status

# Add all ML files
git add databricks/notebooks/06_ml_models/
git add scripts/deploy_ml_models.py
git add docs/ml_pipeline_guide.md
git add docs/model_card_template.md

# Commit
git commit -m "Phase 7 complete: ML & Advanced Analytics

- Set up MLflow experiment tracking
- Built customer segmentation model (K-Means, 5 clusters)
- Created churn prediction model (Random Forest, ROC-AUC 0.82)
- Implemented product recommendations (Collaborative Filtering)
- Generated predictions for 10,000+ customers
- Created ML monitoring dashboard
- Built model deployment automation
- Documented complete ML pipeline
- All models registered in MLflow"

# Push to GitHub
git push origin main

✅ CHECKPOINT

All ML code committed
Models tracked in MLflow
Documentation complete

PHASE 7 COMPLETE! 🎉

What You Built:

✅ ML Infrastructure

MLflow experiment tracking configured
Model registry set up
Automated deployment pipeline

✅ Customer Segmentation Model

Algorithm: K-Means (5 clusters)
Silhouette Score: 0.45+
10,000 customers segmented
Segments: High Value Active, Medium Regular, Low Occasional, At Risk, New

✅ Churn Prediction Model

Algorithm: Random Forest
ROC-AUC Score: 0.82+
Precision: 0.75+
Identified high-risk customers
Risk levels: Low, Medium, High

✅ Product Recommendation System

Method: Item-Item Collaborative Filtering
Cosine Similarity metric
Top 5 recommendations per product
100+ products with recommendations

✅ Predictions in Gold Layer

gold.customer_segments - 10,000 records
gold.customer_churn_predictions - 10,000 records
gold.product_recommendations - 500+ records

✅ Dashboards & Monitoring

ML model performance dashboard
Prediction coverage analysis
Business impact metrics

Business Impact

Customer Segmentation:

Enable targeted marketing campaigns
Personalized customer experiences
Segment-specific promotions

Churn Prediction:

Revenue at risk: $XXX,XXX (high-value churners)
Proactive retention campaigns
Save XX% of at-risk customers

Product Recommendations:

Increase average order value by 15-20%
Improve cross-sell success rate
Enhance customer experience

What's Next: Phase 8

In Phase 8, you will:

Set up Azure Data Factory pipelines
Implement workflow orchestration
Create automated scheduling
Build monitoring & alerting system
Set up email notifications
Create operational dashboards

Estimated Time: 6-8 hours over Days 20-22

Troubleshooting

Issue: MLflow experiments not showing up
Solution: Verify experiment name, check Databricks permissions

Issue: Model training taking too long
Solution: Increase cluster size, reduce training data size, simplify model

Issue: Churn predictions all one class
Solution: Check class balance, adjust class_weight parameter

Issue: Recommendations all same category
Solution: Ensure diverse purchase history, check similarity threshold

Issue: Cannot load model from registry
Solution: Verify model is registered, check version stage

Best Practices Applied

Start Simple - K-Means before deep learning
Track Everything - MLflow for reproducibility
Validate Thoroughly - Multiple metrics per model
Document Assumptions - Feature engineering logic
Business Metrics - Not just model metrics
Iterate Quickly - Deploy fast, improve later
Monitor Continuously - Track prediction drift

Model Performance Summary

Model	Algorithm	Key Metric	Score	Status
Customer Segmentation	K-Means	Silhouette	0.45+	✅ Production
Churn Prediction	Random Forest	ROC-AUC	0.82+	✅ Production
Product Recommendations	Collaborative Filtering	Coverage	20%+	✅ Production

Resources

MLflow Documentation: https://mlflow.org/docs/latest/
Scikit-learn: https://scikit-learn.org/stable/
Collaborative Filtering: https://en.wikipedia.org/wiki/Collaborative_filtering
Model Cards: https://modelcards.withgoogle.com/

Phase 7 Manual Version 1.0
Last Updated: 2025-01-01