Duration: Days 1-3 | 6-8 hours total
Goal: Set up project structure and generate realistic sample data
In Phase 1, you will:
Go to GitHub and create new repository:
ecommerce-data-platformClone to your local machine:
git clone https://github.com/YOUR_USERNAME/ecommerce-data-platform.git
cd ecommerce-data-platform
# Config and data directories
mkdir -p config
mkdir -p data/raw data/mock_apis data/schemas
# Databricks notebooks
mkdir -p databricks/notebooks/01_setup
mkdir -p databricks/notebooks/02_bronze_ingestion
mkdir -p databricks/notebooks/03_silver_transformations
mkdir -p databricks/notebooks/04_gold_aggregations
mkdir -p databricks/notebooks/05_data_quality
mkdir -p databricks/notebooks/06_ml_models
mkdir -p databricks/jobs
# dbt directories
mkdir -p dbt/models/bronze
mkdir -p dbt/models/silver
mkdir -p dbt/models/gold
mkdir -p dbt/tests
mkdir -p dbt/macros
# Other directories
mkdir -p terraform
mkdir -p scripts
mkdir -p tests
mkdir -p .github/workflows
mkdir -p docs
tree -L 3 # or use 'ls -R' on Windows
You should see all directories created. Run ls -la to verify.
requirements.txtCreate file in project root:
# Core data processing
pandas==2.0.3
numpy==1.24.3
pyarrow==12.0.1
# Azure SDK
azure-identity==1.13.0
azure-storage-blob==12.17.0
azure-keyvault-secrets==4.7.0
# Databricks
databricks-cli==0.17.7
databricks-sql-connector==2.9.3
# dbt
dbt-databricks==1.6.2
# Machine Learning
scikit-learn==1.3.0
mlflow==2.6.0
matplotlib==3.7.2
seaborn==0.12.2
# API and utilities
requests==2.31.0
python-dotenv==1.0.0
pyyaml==6.0.1
# Testing
pytest==7.4.0
pytest-cov==4.1.0
# Code quality
black==23.7.0
flake8==6.0.0
# Development
jupyter==1.0.0
ipykernel==6.25.0
.env.exampleCreate in project root:
# Azure Configuration
AZURE_SUBSCRIPTION_ID=your-subscription-id-here
AZURE_RESOURCE_GROUP=ecommerce-analytics-rg
AZURE_STORAGE_ACCOUNT=ecommercedata001
AZURE_STORAGE_KEY=your-storage-key-here
# Databricks Configuration
DATABRICKS_HOST=https://adb-xxxxx.azuredatabricks.net
DATABRICKS_TOKEN=your-databricks-personal-access-token
DATABRICKS_CLUSTER_ID=your-cluster-id-here
# Environment Settings
ENVIRONMENT=development
LOG_LEVEL=INFO
setup.pyCreate in project root:
from setuptools import setup, find_packages
setup(
name="ecommerce-data-platform",
version="1.0.0",
description="E-commerce analytics platform with Databricks medallion architecture",
author="Your Name",
author_email="your.email@example.com",
packages=find_packages(),
install_requires=[
line.strip()
for line in open("requirements.txt").readlines()
if line.strip() and not line.startswith("#")
],
python_requires=">=3.9",
classifiers=[
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
],
)
README.mdReplace contents with:
# E-Commerce Analytics Platform
Production-ready e-commerce analytics platform implementing Databricks medallion architecture.
## Architecture
- **Bronze Layer:** Raw data ingestion from CSV files
- **Silver Layer:** Cleaned and validated data transformations using dbt
- **Gold Layer:** Business-ready aggregations and analytics
## Tech Stack
- **Data Processing:** Databricks (PySpark), Delta Lake
- **Transformations:** dbt (data build tool)
- **Cloud:** Azure Data Lake Storage, Azure Data Factory
- **ML:** MLflow, scikit-learn
- **Infrastructure:** Terraform
- **CI/CD:** GitHub Actions
## Project Status
🚧 **Phase 1 Complete:** Environment setup and sample data generation
⏳ **Phase 2 In Progress:** Azure infrastructure setup
## Quick Start
See [docs/setup_guide.md](docs/setup_guide.md) for detailed setup instructions.
## License
MIT License
.gitignoreAdd these lines to your existing .gitignore:
# Environment variables
.env
.env.local
# Virtual environment
venv/
env/
# Data files (don't commit large CSVs)
data/raw/*.csv
*.parquet
# Databricks
.databricks/
# dbt
dbt/target/
dbt/dbt_packages/
dbt/logs/
# Terraform
terraform/.terraform/
terraform/*.tfstate
terraform/*.tfstate.backup
All 5 configuration files created. Run git status to see new files.
python3 -m venv venv
# On macOS/Linux:
source venv/bin/activate
# On Windows (Command Prompt):
venv\Scripts\activate
# On Windows (PowerShell):
venv\Scripts\Activate.ps1
You should see (venv) prefix in your terminal.
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
This will take 3-5 minutes to install all packages.
pip list | grep -E "pandas|databricks|dbt"

You should see pandas, databricks-cli, and dbt-databricks in the output.
No errors during installation. All key packages visible in .
pip list
This is the main step where you create realistic data for the project.
scripts/generate_sample_data.py"""
Generate realistic e-commerce sample data for analytics platform.
Creates customers, products, orders, order items, and web events.
"""
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
from pathlib import Path
# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)
# Configuration
START_DATE = datetime(2023, 1, 1)
END_DATE = datetime(2024, 12, 31)
NUM_CUSTOMERS = 10000
NUM_PRODUCTS = 500
NUM_ORDERS = 50000
NUM_WEB_EVENTS = 200000
OUTPUT_DIR = Path("data/raw")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print("=" * 70)
print("E-COMMERCE DATA GENERATION SCRIPT")
print("=" * 70)
print(f"\nConfiguration:")
print(f" Date Range: {START_DATE.date()} to {END_DATE.date()}")
print(f" Customers: {NUM_CUSTOMERS:,}")
print(f" Products: {NUM_PRODUCTS:,}")
print(f" Orders: {NUM_ORDERS:,}")
print(f" Web Events: {NUM_WEB_EVENTS:,}")
print(f" Output: {OUTPUT_DIR}\n")
# ====================
# 1. GENERATE CUSTOMERS
# ====================
def generate_customers(n=NUM_CUSTOMERS):
"""Generate customer master data"""
print(f"📊 Generating {n:,} customers...")
segments = ['Premium', 'Regular', 'Occasional', 'New']
segment_weights = [0.15, 0.35, 0.35, 0.15]
countries = ['USA', 'UK', 'Canada', 'Germany', 'France', 'Australia']
country_weights = [0.40, 0.15, 0.15, 0.12, 0.10, 0.08]
us_states = ['CA', 'NY', 'TX', 'FL', 'IL', 'PA', 'OH', 'GA', 'NC', 'MI']
customers = []
for i in range(n):
# Registration date (weighted toward recent)
days_ago = int(np.random.exponential(scale=300))
reg_date = END_DATE - timedelta(days=min(days_ago, 730))
country = np.random.choice(countries, p=country_weights)
state = np.random.choice(us_states) if country == 'USA' else None
customer = {
'customer_id': f"CUST{i+1:06d}",
'email': f"customer{i+1}@email.com",
'first_name': f"FirstName{i+1}",
'last_name': f"LastName{i+1}",
'registration_date': reg_date.strftime('%Y-%m-%d'),
'country': country,
'state': state,
'segment': np.random.choice(segments, p=segment_weights),
'marketing_opt_in': np.random.choice([True, False], p=[0.6, 0.4]),
'created_at': reg_date.strftime('%Y-%m-%d %H:%M:%S'),
'updated_at': (reg_date + timedelta(days=random.randint(0, 100))).strftime('%Y-%m-%d %H:%M:%S')
}
customers.append(customer)
df = pd.DataFrame(customers)
print(f" ✅ Generated {len(df):,} customers")
print(f" Segments: Premium={len(df[df['segment']=='Premium'])}, Regular={len(df[df['segment']=='Regular'])}")
return df
# ====================
# 2. GENERATE PRODUCTS
# ====================
def generate_products(n=NUM_PRODUCTS):
"""Generate product catalog"""
print(f"\n📦 Generating {n:,} products...")
categories = {
'Electronics': ['Smartphone', 'Laptop', 'Tablet', 'Headphones', 'Camera', 'Smart Watch'],
'Clothing': ['Shirt', 'Pants', 'Dress', 'Jacket', 'Shoes', 'Accessories'],
'Home & Garden': ['Furniture', 'Decor', 'Kitchen', 'Bedding', 'Tools'],
'Books': ['Fiction', 'Non-Fiction', 'Educational', 'Comics'],
'Sports': ['Equipment', 'Apparel', 'Supplements', 'Accessories']
}
products = []
product_id = 1
for category, subcategories in categories.items():
num_in_category = n // len(categories)
for _ in range(num_in_category):
subcategory = random.choice(subcategories)
# Price varies by category
if category == 'Electronics':
base_price = random.uniform(50, 2000)
elif category == 'Clothing':
base_price = random.uniform(20, 300)
elif category == 'Home & Garden':
base_price = random.uniform(30, 1000)
elif category == 'Books':
base_price = random.uniform(10, 50)
else: # Sports
base_price = random.uniform(25, 500)
product = {
'product_id': f"PROD{product_id:05d}",
'product_name': f"{subcategory} {product_id}",
'category': category,
'subcategory': subcategory,
'price': round(base_price, 2),
'cost': round(base_price * random.uniform(0.4, 0.7), 2),
'stock_quantity': random.randint(0, 1000),
'supplier': f"Supplier_{random.randint(1, 50)}",
'created_at': (START_DATE + timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d %H:%M:%S'),
'is_active': random.choice([True, True, True, False]) # 75% active
}
products.append(product)
product_id += 1
df = pd.DataFrame(products)
print(f" ✅ Generated {len(df):,} products")
print(f" Active: {len(df[df['is_active']==True])}, Price range: ${df['price'].min():.2f}-${df['price'].max():.2f}")
return df
# ====================
# 3. GENERATE ORDERS
# ====================
def generate_orders(customers_df, products_df, n=NUM_ORDERS):
"""Generate order transactions and order items"""
print(f"\n🛒 Generating {n:,} orders...")
statuses = ['completed', 'pending', 'cancelled', 'returned']
status_weights = [0.85, 0.05, 0.05, 0.05]
payment_methods = ['credit_card', 'debit_card', 'paypal', 'apple_pay', 'google_pay']
payment_weights = [0.45, 0.25, 0.15, 0.08, 0.07]
active_products = products_df[products_df['is_active']].copy()
orders = []
order_items = []
for order_num in range(n):
order_id = f"ORD{order_num+1:07d}"
# Select customer
customer = customers_df.sample(1).iloc[0]
customer_id = customer['customer_id']
# Order date (weighted toward recent)
days_ago = int(np.random.exponential(scale=180))
order_date = END_DATE - timedelta(days=min(days_ago, 730))
# Ensure order after customer registration
customer_reg = pd.to_datetime(customer['registration_date'])
if order_date < customer_reg:
order_date = customer_reg + timedelta(days=random.randint(1, 30))
# Number of items (1-4, weighted toward 1-2)
num_items = np.random.choice([1, 2, 3, 4], p=[0.50, 0.30, 0.15, 0.05])
# Select products
ordered_products = active_products.sample(n=num_items, replace=False)
# Calculate order totals
subtotal = 0
for _, product in ordered_products.iterrows():
quantity = random.randint(1, 3)
item_total = product['price'] * quantity
subtotal += item_total
order_items.append({
'order_id': order_id,
'product_id': product['product_id'],
'quantity': quantity,
'unit_price': product['price'],
'total_price': round(item_total, 2)
})
# Discount for premium customers
discount = 0
if customer['segment'] == 'Premium':
discount = subtotal * random.uniform(0.05, 0.15)
elif random.random() < 0.1: # 10% chance of promo
discount = subtotal * random.uniform(0.05, 0.10)
# Shipping (free over $50)
shipping_cost = 0 if subtotal > 50 else random.uniform(5, 15)
# Tax (8%)
tax = (subtotal - discount) * 0.08
total = subtotal - discount + shipping_cost + tax
orders.append({
'order_id': order_id,
'customer_id': customer_id,
'order_date': order_date.strftime('%Y-%m-%d'),
'order_timestamp': order_date.strftime('%Y-%m-%d %H:%M:%S'),
'status': np.random.choice(statuses, p=status_weights),
'payment_method': np.random.choice(payment_methods, p=payment_weights),
'subtotal': round(subtotal, 2),
'discount': round(discount, 2),
'shipping_cost': round(shipping_cost, 2),
'tax': round(tax, 2),
'total': round(total, 2),
'shipping_address': f"{random.randint(1, 9999)} Main St",
'shipping_city': f"City{random.randint(1, 100)}",
'shipping_state': customer['state'],
'shipping_country': customer['country'],
'created_at': order_date.strftime('%Y-%m-%d %H:%M:%S'),
'updated_at': (order_date + timedelta(hours=random.randint(1, 48))).strftime('%Y-%m-%d %H:%M:%S')
})
# Progress indicator
if (order_num + 1) % 10000 == 0:
print(f" Progress: {order_num + 1:,} / {n:,} orders")
orders_df = pd.DataFrame(orders)
order_items_df = pd.DataFrame(order_items)
print(f" ✅ Generated {len(orders_df):,} orders with {len(order_items_df):,} line items")
print(f" Total Revenue: ${orders_df['total'].sum():,.2f}")
print(f" Avg Order Value: ${orders_df['total'].mean():.2f}")
return orders_df, order_items_df
# ====================
# 4. GENERATE WEB EVENTS
# ====================
def generate_web_events(customers_df, products_df, n=NUM_WEB_EVENTS):
"""Generate website interaction events"""
print(f"\n🌐 Generating {n:,} web events...")
event_types = ['page_view', 'product_view', 'add_to_cart', 'remove_from_cart',
'search', 'wishlist_add', 'review_submit']
event_weights = [0.45, 0.25, 0.12, 0.03, 0.10, 0.03, 0.02]
devices = ['desktop', 'mobile', 'tablet']
device_weights = [0.45, 0.45, 0.10]
browsers = ['Chrome', 'Safari', 'Firefox', 'Edge', 'Other']
browser_weights = [0.55, 0.25, 0.10, 0.07, 0.03]
active_products = products_df[products_df['is_active']].copy()
events = []
for event_num in range(n):
# Timestamp
days_ago = int(np.random.exponential(scale=90))
event_time = END_DATE - timedelta(
days=min(days_ago, 730),
hours=random.randint(0, 23),
minutes=random.randint(0, 59),
seconds=random.randint(0, 59)
)
# 70% registered users, 30% anonymous
if random.random() < 0.7:
customer = customers_df.sample(1).iloc[0]
customer_id = customer['customer_id']
session_id = f"SESSION_{customer_id}_{event_time.strftime('%Y%m%d%H')}"
else:
customer_id = None
session_id = f"SESSION_ANON_{random.randint(1, 50000)}_{event_time.strftime('%Y%m%d%H')}"
event_type = np.random.choice(event_types, p=event_weights)
# Product for product-related events
product_id = None
if event_type in ['product_view', 'add_to_cart', 'remove_from_cart', 'wishlist_add', 'review_submit']:
product_id = active_products.sample(1).iloc[0]['product_id']
# Search query
search_query = None
if event_type == 'search':
search_terms = ['laptop', 'phone', 'shoes', 'furniture', 'book', 'camera', 'watch']
search_query = random.choice(search_terms)
event = {
'event_id': f"EVT{event_num+1:08d}",
'session_id': session_id,
'customer_id': customer_id,
'event_timestamp': event_time.strftime('%Y-%m-%d %H:%M:%S'),
'event_type': event_type,
'product_id': product_id,
'search_query': search_query,
'device_type': np.random.choice(devices, p=device_weights),
'browser': np.random.choice(browsers, p=browser_weights),
'page_url': f"/products/{random.randint(1, 100)}" if event_type == 'page_view' else None,
'referrer_url': random.choice([None, 'google.com', 'facebook.com', 'instagram.com', None, None])
}
events.append(event)
# Progress indicator
if (event_num + 1) % 50000 == 0:
print(f" Progress: {event_num + 1:,} / {n:,} events")
df = pd.DataFrame(events)
print(f" ✅ Generated {len(df):,} web events")
return df
# ====================
# EXECUTE DATA GENERATION
# ====================
print("\n" + "=" * 70)
print("STARTING DATA GENERATION")
print("=" * 70 + "\n")
# Generate all datasets
customers_df = generate_customers()
products_df = generate_products()
orders_df, order_items_df = generate_orders(customers_df, products_df)
web_events_df = generate_web_events(customers_df, products_df)
# ====================
# SAVE TO CSV FILES
# ====================
print("\n" + "=" * 70)
print("SAVING DATA TO CSV FILES")
print("=" * 70 + "\n")
customers_df.to_csv(OUTPUT_DIR / 'customers.csv', index=False)
print(f"✅ Saved customers.csv ({len(customers_df):,} rows)")
products_df.to_csv(OUTPUT_DIR / 'products.csv', index=False)
print(f"✅ Saved products.csv ({len(products_df):,} rows)")
orders_df.to_csv(OUTPUT_DIR / 'orders.csv', index=False)
print(f"✅ Saved orders.csv ({len(orders_df):,} rows)")
order_items_df.to_csv(OUTPUT_DIR / 'order_items.csv', index=False)
print(f"✅ Saved order_items.csv ({len(order_items_df):,} rows)")
web_events_df.to_csv(OUTPUT_DIR / 'web_events.csv', index=False)
print(f"✅ Saved web_events.csv ({len(web_events_df):,} rows)")
# ====================
# FINAL SUMMARY
# ====================
print("\n" + "=" * 70)
print("DATA GENERATION COMPLETE!")
print("=" * 70)
print(f"\n📁 All files saved to: {OUTPUT_DIR}")
print(f"\n📊 Dataset Summary:")
print(f" • Customers: {len(customers_df):>8,} records")
print(f" • Products: {len(products_df):>8,} records")
print(f" • Orders: {len(orders_df):>8,} records")
print(f" • Order Items: {len(order_items_df):>8,} records")
print(f" • Web Events: {len(web_events_df):>8,} records")
print(f" • TOTAL: {len(customers_df) + len(products_df) + len(orders_df) + len(order_items_df) + len(web_events_df):>8,} records")
print(f"\n💰 Total Revenue: ${orders_df['total'].sum():,.2f}")
print(f"\n🎉 Ready for Phase 2: Azure Infrastructure Setup!")
print("=" * 70 + "\n")
python scripts/generate_sample_data.py
This will take 2-3 minutes to complete. You'll see progress indicators for orders and web events.
You should see:
data/raw/ directory
python3
Then run:
import pandas as pd
# Load and check each file
customers = pd.read_csv('data/raw/customers.csv')
products = pd.read_csv('data/raw/products.csv')
orders = pd.read_csv('data/raw/orders.csv')
order_items = pd.read_csv('data/raw/order_items.csv')
web_events = pd.read_csv('data/raw/web_events.csv')
print(f"Customers: {len(customers):,} rows")
print(f"Products: {len(products):,} rows")
print(f"Orders: {len(orders):,} rows")
print(f"Order Items: {len(order_items):,} rows")
print(f"Web Events: {len(web_events):,} rows")
# Check first few rows
print("\nCustomers sample:")
print(customers.head())
# Exit Python
exit()
All files load successfully, row counts match expected values.
# Check what files were created
git status
# Add all new files (except data/raw/*.csv which is in .gitignore)
git add .
# Commit with descriptive message
git commit -m "Phase 1 complete: Project setup and sample data generation
- Created project directory structure
- Added configuration files (requirements.txt, setup.py, .env.example)
- Generated 260,500+ records of realistic e-commerce data
- Created data generation script with 5 datasets
- Set up Python virtual environment"
# Push to GitHub
git push origin main
Check GitHub - you should see all files except the large CSV files (they're gitignored).
✅ Project Infrastructure
✅ Development Environment
✅ Sample Data (260,500+ records)
✅ Data Quality
In Phase 2, you will:
Estimated Time: 4-6 hours over Days 4-5
Issue: fails with permission error
pip install
Solution: Make sure virtual environment is activated (you should see (venv) in terminal)
Issue: Data generation script runs out of memory
Solution: Reduce NUM_CUSTOMERS, NUM_ORDERS, or NUM_WEB_EVENTS in the script
Issue: CSV files not in data/raw/
Solution: Check that OUTPUT_DIR path is correct and directory was created
Phase 1 Manual Version 1.0
Last Updated: 2025-01-01