E-Commerce Analytics Platform

Phase 1: Environment Setup & Data Generation

Duration: Days 1-3 | 6-8 hours total
Goal: Set up project structure and generate realistic sample data


OVERVIEW

In Phase 1, you will:


STEP 1.1: Create GitHub Repository (30 minutes)

Actions:

  1. Go to GitHub and create new repository:

    • Repository name: ecommerce-data-platform
    • Description: "Production e-commerce analytics platform with Databricks medallion architecture"
    • Visibility: Public
    • Initialize with README: Yes
    • Add .gitignore: Python template
    • License: MIT (optional)
  2. Clone to your local machine:

Copy to clipboard
git clone https://github.com/YOUR_USERNAME/ecommerce-data-platform.git cd ecommerce-data-platform
  1. Create complete directory structure:
Copy to clipboard
# Config and data directories mkdir -p config mkdir -p data/raw data/mock_apis data/schemas # Databricks notebooks mkdir -p databricks/notebooks/01_setup mkdir -p databricks/notebooks/02_bronze_ingestion mkdir -p databricks/notebooks/03_silver_transformations mkdir -p databricks/notebooks/04_gold_aggregations mkdir -p databricks/notebooks/05_data_quality mkdir -p databricks/notebooks/06_ml_models mkdir -p databricks/jobs # dbt directories mkdir -p dbt/models/bronze mkdir -p dbt/models/silver mkdir -p dbt/models/gold mkdir -p dbt/tests mkdir -p dbt/macros # Other directories mkdir -p terraform mkdir -p scripts mkdir -p tests mkdir -p .github/workflows mkdir -p docs
  1. Verify structure:
Copy to clipboard
tree -L 3 # or use 'ls -R' on Windows

✅ CHECKPOINT

You should see all directories created. Run ls -la to verify.


STEP 1.2: Create Configuration Files (45 minutes)

File 1: requirements.txt

Create file in project root:

Copy to clipboard
# Core data processing pandas==2.0.3 numpy==1.24.3 pyarrow==12.0.1 # Azure SDK azure-identity==1.13.0 azure-storage-blob==12.17.0 azure-keyvault-secrets==4.7.0 # Databricks databricks-cli==0.17.7 databricks-sql-connector==2.9.3 # dbt dbt-databricks==1.6.2 # Machine Learning scikit-learn==1.3.0 mlflow==2.6.0 matplotlib==3.7.2 seaborn==0.12.2 # API and utilities requests==2.31.0 python-dotenv==1.0.0 pyyaml==6.0.1 # Testing pytest==7.4.0 pytest-cov==4.1.0 # Code quality black==23.7.0 flake8==6.0.0 # Development jupyter==1.0.0 ipykernel==6.25.0

File 2: .env.example

Create in project root:

Copy to clipboard
# Azure Configuration AZURE_SUBSCRIPTION_ID=your-subscription-id-here AZURE_RESOURCE_GROUP=ecommerce-analytics-rg AZURE_STORAGE_ACCOUNT=ecommercedata001 AZURE_STORAGE_KEY=your-storage-key-here # Databricks Configuration DATABRICKS_HOST=https://adb-xxxxx.azuredatabricks.net DATABRICKS_TOKEN=your-databricks-personal-access-token DATABRICKS_CLUSTER_ID=your-cluster-id-here # Environment Settings ENVIRONMENT=development LOG_LEVEL=INFO

File 3: setup.py

Create in project root:

Copy to clipboard
from setuptools import setup, find_packages setup( name="ecommerce-data-platform", version="1.0.0", description="E-commerce analytics platform with Databricks medallion architecture", author="Your Name", author_email="your.email@example.com", packages=find_packages(), install_requires=[ line.strip() for line in open("requirements.txt").readlines() if line.strip() and not line.startswith("#") ], python_requires=">=3.9", classifiers=[ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", ], )

File 4: Update README.md

Replace contents with:

Copy to clipboard
# E-Commerce Analytics Platform Production-ready e-commerce analytics platform implementing Databricks medallion architecture. ## Architecture - **Bronze Layer:** Raw data ingestion from CSV files - **Silver Layer:** Cleaned and validated data transformations using dbt - **Gold Layer:** Business-ready aggregations and analytics ## Tech Stack - **Data Processing:** Databricks (PySpark), Delta Lake - **Transformations:** dbt (data build tool) - **Cloud:** Azure Data Lake Storage, Azure Data Factory - **ML:** MLflow, scikit-learn - **Infrastructure:** Terraform - **CI/CD:** GitHub Actions ## Project Status 🚧 **Phase 1 Complete:** Environment setup and sample data generation ⏳ **Phase 2 In Progress:** Azure infrastructure setup ## Quick Start See [docs/setup_guide.md](docs/setup_guide.md) for detailed setup instructions. ## License MIT License

File 5: .gitignore

Add these lines to your existing .gitignore:

Copy to clipboard
# Environment variables .env .env.local # Virtual environment venv/ env/ # Data files (don't commit large CSVs) data/raw/*.csv *.parquet # Databricks .databricks/ # dbt dbt/target/ dbt/dbt_packages/ dbt/logs/ # Terraform terraform/.terraform/ terraform/*.tfstate terraform/*.tfstate.backup

✅ CHECKPOINT

All 5 configuration files created. Run git status to see new files.


STEP 1.3: Set Up Python Virtual Environment (20 minutes)

Actions:

  1. Create virtual environment:
Copy to clipboard
python3 -m venv venv
  1. Activate virtual environment:
Copy to clipboard
# On macOS/Linux: source venv/bin/activate # On Windows (Command Prompt): venv\Scripts\activate # On Windows (PowerShell): venv\Scripts\Activate.ps1

You should see (venv) prefix in your terminal.

  1. Upgrade pip:
Copy to clipboard
pip install --upgrade pip setuptools wheel
  1. Install all requirements:
Copy to clipboard
pip install -r requirements.txt

This will take 3-5 minutes to install all packages.

  1. Verify installation:
Copy to clipboard
pip list | grep -E "pandas|databricks|dbt"

You should see pandas, databricks-cli, and dbt-databricks in the output.

✅ CHECKPOINT

No errors during installation. All key packages visible in pip list.


STEP 1.4: Generate Sample E-Commerce Data (2 hours)

This is the main step where you create realistic data for the project.

Create scripts/generate_sample_data.py

Copy to clipboard
""" Generate realistic e-commerce sample data for analytics platform. Creates customers, products, orders, order items, and web events. """ import pandas as pd import numpy as np from datetime import datetime, timedelta import random from pathlib import Path # Set random seeds for reproducibility np.random.seed(42) random.seed(42) # Configuration START_DATE = datetime(2023, 1, 1) END_DATE = datetime(2024, 12, 31) NUM_CUSTOMERS = 10000 NUM_PRODUCTS = 500 NUM_ORDERS = 50000 NUM_WEB_EVENTS = 200000 OUTPUT_DIR = Path("data/raw") OUTPUT_DIR.mkdir(parents=True, exist_ok=True) print("=" * 70) print("E-COMMERCE DATA GENERATION SCRIPT") print("=" * 70) print(f"\nConfiguration:") print(f" Date Range: {START_DATE.date()} to {END_DATE.date()}") print(f" Customers: {NUM_CUSTOMERS:,}") print(f" Products: {NUM_PRODUCTS:,}") print(f" Orders: {NUM_ORDERS:,}") print(f" Web Events: {NUM_WEB_EVENTS:,}") print(f" Output: {OUTPUT_DIR}\n") # ==================== # 1. GENERATE CUSTOMERS # ==================== def generate_customers(n=NUM_CUSTOMERS): """Generate customer master data""" print(f"📊 Generating {n:,} customers...") segments = ['Premium', 'Regular', 'Occasional', 'New'] segment_weights = [0.15, 0.35, 0.35, 0.15] countries = ['USA', 'UK', 'Canada', 'Germany', 'France', 'Australia'] country_weights = [0.40, 0.15, 0.15, 0.12, 0.10, 0.08] us_states = ['CA', 'NY', 'TX', 'FL', 'IL', 'PA', 'OH', 'GA', 'NC', 'MI'] customers = [] for i in range(n): # Registration date (weighted toward recent) days_ago = int(np.random.exponential(scale=300)) reg_date = END_DATE - timedelta(days=min(days_ago, 730)) country = np.random.choice(countries, p=country_weights) state = np.random.choice(us_states) if country == 'USA' else None customer = { 'customer_id': f"CUST{i+1:06d}", 'email': f"customer{i+1}@email.com", 'first_name': f"FirstName{i+1}", 'last_name': f"LastName{i+1}", 'registration_date': reg_date.strftime('%Y-%m-%d'), 'country': country, 'state': state, 'segment': np.random.choice(segments, p=segment_weights), 'marketing_opt_in': np.random.choice([True, False], p=[0.6, 0.4]), 'created_at': reg_date.strftime('%Y-%m-%d %H:%M:%S'), 'updated_at': (reg_date + timedelta(days=random.randint(0, 100))).strftime('%Y-%m-%d %H:%M:%S') } customers.append(customer) df = pd.DataFrame(customers) print(f" ✅ Generated {len(df):,} customers") print(f" Segments: Premium={len(df[df['segment']=='Premium'])}, Regular={len(df[df['segment']=='Regular'])}") return df # ==================== # 2. GENERATE PRODUCTS # ==================== def generate_products(n=NUM_PRODUCTS): """Generate product catalog""" print(f"\n📦 Generating {n:,} products...") categories = { 'Electronics': ['Smartphone', 'Laptop', 'Tablet', 'Headphones', 'Camera', 'Smart Watch'], 'Clothing': ['Shirt', 'Pants', 'Dress', 'Jacket', 'Shoes', 'Accessories'], 'Home & Garden': ['Furniture', 'Decor', 'Kitchen', 'Bedding', 'Tools'], 'Books': ['Fiction', 'Non-Fiction', 'Educational', 'Comics'], 'Sports': ['Equipment', 'Apparel', 'Supplements', 'Accessories'] } products = [] product_id = 1 for category, subcategories in categories.items(): num_in_category = n // len(categories) for _ in range(num_in_category): subcategory = random.choice(subcategories) # Price varies by category if category == 'Electronics': base_price = random.uniform(50, 2000) elif category == 'Clothing': base_price = random.uniform(20, 300) elif category == 'Home & Garden': base_price = random.uniform(30, 1000) elif category == 'Books': base_price = random.uniform(10, 50) else: # Sports base_price = random.uniform(25, 500) product = { 'product_id': f"PROD{product_id:05d}", 'product_name': f"{subcategory} {product_id}", 'category': category, 'subcategory': subcategory, 'price': round(base_price, 2), 'cost': round(base_price * random.uniform(0.4, 0.7), 2), 'stock_quantity': random.randint(0, 1000), 'supplier': f"Supplier_{random.randint(1, 50)}", 'created_at': (START_DATE + timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d %H:%M:%S'), 'is_active': random.choice([True, True, True, False]) # 75% active } products.append(product) product_id += 1 df = pd.DataFrame(products) print(f" ✅ Generated {len(df):,} products") print(f" Active: {len(df[df['is_active']==True])}, Price range: ${df['price'].min():.2f}-${df['price'].max():.2f}") return df # ==================== # 3. GENERATE ORDERS # ==================== def generate_orders(customers_df, products_df, n=NUM_ORDERS): """Generate order transactions and order items""" print(f"\n🛒 Generating {n:,} orders...") statuses = ['completed', 'pending', 'cancelled', 'returned'] status_weights = [0.85, 0.05, 0.05, 0.05] payment_methods = ['credit_card', 'debit_card', 'paypal', 'apple_pay', 'google_pay'] payment_weights = [0.45, 0.25, 0.15, 0.08, 0.07] active_products = products_df[products_df['is_active']].copy() orders = [] order_items = [] for order_num in range(n): order_id = f"ORD{order_num+1:07d}" # Select customer customer = customers_df.sample(1).iloc[0] customer_id = customer['customer_id'] # Order date (weighted toward recent) days_ago = int(np.random.exponential(scale=180)) order_date = END_DATE - timedelta(days=min(days_ago, 730)) # Ensure order after customer registration customer_reg = pd.to_datetime(customer['registration_date']) if order_date < customer_reg: order_date = customer_reg + timedelta(days=random.randint(1, 30)) # Number of items (1-4, weighted toward 1-2) num_items = np.random.choice([1, 2, 3, 4], p=[0.50, 0.30, 0.15, 0.05]) # Select products ordered_products = active_products.sample(n=num_items, replace=False) # Calculate order totals subtotal = 0 for _, product in ordered_products.iterrows(): quantity = random.randint(1, 3) item_total = product['price'] * quantity subtotal += item_total order_items.append({ 'order_id': order_id, 'product_id': product['product_id'], 'quantity': quantity, 'unit_price': product['price'], 'total_price': round(item_total, 2) }) # Discount for premium customers discount = 0 if customer['segment'] == 'Premium': discount = subtotal * random.uniform(0.05, 0.15) elif random.random() < 0.1: # 10% chance of promo discount = subtotal * random.uniform(0.05, 0.10) # Shipping (free over $50) shipping_cost = 0 if subtotal > 50 else random.uniform(5, 15) # Tax (8%) tax = (subtotal - discount) * 0.08 total = subtotal - discount + shipping_cost + tax orders.append({ 'order_id': order_id, 'customer_id': customer_id, 'order_date': order_date.strftime('%Y-%m-%d'), 'order_timestamp': order_date.strftime('%Y-%m-%d %H:%M:%S'), 'status': np.random.choice(statuses, p=status_weights), 'payment_method': np.random.choice(payment_methods, p=payment_weights), 'subtotal': round(subtotal, 2), 'discount': round(discount, 2), 'shipping_cost': round(shipping_cost, 2), 'tax': round(tax, 2), 'total': round(total, 2), 'shipping_address': f"{random.randint(1, 9999)} Main St", 'shipping_city': f"City{random.randint(1, 100)}", 'shipping_state': customer['state'], 'shipping_country': customer['country'], 'created_at': order_date.strftime('%Y-%m-%d %H:%M:%S'), 'updated_at': (order_date + timedelta(hours=random.randint(1, 48))).strftime('%Y-%m-%d %H:%M:%S') }) # Progress indicator if (order_num + 1) % 10000 == 0: print(f" Progress: {order_num + 1:,} / {n:,} orders") orders_df = pd.DataFrame(orders) order_items_df = pd.DataFrame(order_items) print(f" ✅ Generated {len(orders_df):,} orders with {len(order_items_df):,} line items") print(f" Total Revenue: ${orders_df['total'].sum():,.2f}") print(f" Avg Order Value: ${orders_df['total'].mean():.2f}") return orders_df, order_items_df # ==================== # 4. GENERATE WEB EVENTS # ==================== def generate_web_events(customers_df, products_df, n=NUM_WEB_EVENTS): """Generate website interaction events""" print(f"\n🌐 Generating {n:,} web events...") event_types = ['page_view', 'product_view', 'add_to_cart', 'remove_from_cart', 'search', 'wishlist_add', 'review_submit'] event_weights = [0.45, 0.25, 0.12, 0.03, 0.10, 0.03, 0.02] devices = ['desktop', 'mobile', 'tablet'] device_weights = [0.45, 0.45, 0.10] browsers = ['Chrome', 'Safari', 'Firefox', 'Edge', 'Other'] browser_weights = [0.55, 0.25, 0.10, 0.07, 0.03] active_products = products_df[products_df['is_active']].copy() events = [] for event_num in range(n): # Timestamp days_ago = int(np.random.exponential(scale=90)) event_time = END_DATE - timedelta( days=min(days_ago, 730), hours=random.randint(0, 23), minutes=random.randint(0, 59), seconds=random.randint(0, 59) ) # 70% registered users, 30% anonymous if random.random() < 0.7: customer = customers_df.sample(1).iloc[0] customer_id = customer['customer_id'] session_id = f"SESSION_{customer_id}_{event_time.strftime('%Y%m%d%H')}" else: customer_id = None session_id = f"SESSION_ANON_{random.randint(1, 50000)}_{event_time.strftime('%Y%m%d%H')}" event_type = np.random.choice(event_types, p=event_weights) # Product for product-related events product_id = None if event_type in ['product_view', 'add_to_cart', 'remove_from_cart', 'wishlist_add', 'review_submit']: product_id = active_products.sample(1).iloc[0]['product_id'] # Search query search_query = None if event_type == 'search': search_terms = ['laptop', 'phone', 'shoes', 'furniture', 'book', 'camera', 'watch'] search_query = random.choice(search_terms) event = { 'event_id': f"EVT{event_num+1:08d}", 'session_id': session_id, 'customer_id': customer_id, 'event_timestamp': event_time.strftime('%Y-%m-%d %H:%M:%S'), 'event_type': event_type, 'product_id': product_id, 'search_query': search_query, 'device_type': np.random.choice(devices, p=device_weights), 'browser': np.random.choice(browsers, p=browser_weights), 'page_url': f"/products/{random.randint(1, 100)}" if event_type == 'page_view' else None, 'referrer_url': random.choice([None, 'google.com', 'facebook.com', 'instagram.com', None, None]) } events.append(event) # Progress indicator if (event_num + 1) % 50000 == 0: print(f" Progress: {event_num + 1:,} / {n:,} events") df = pd.DataFrame(events) print(f" ✅ Generated {len(df):,} web events") return df # ==================== # EXECUTE DATA GENERATION # ==================== print("\n" + "=" * 70) print("STARTING DATA GENERATION") print("=" * 70 + "\n") # Generate all datasets customers_df = generate_customers() products_df = generate_products() orders_df, order_items_df = generate_orders(customers_df, products_df) web_events_df = generate_web_events(customers_df, products_df) # ==================== # SAVE TO CSV FILES # ==================== print("\n" + "=" * 70) print("SAVING DATA TO CSV FILES") print("=" * 70 + "\n") customers_df.to_csv(OUTPUT_DIR / 'customers.csv', index=False) print(f"✅ Saved customers.csv ({len(customers_df):,} rows)") products_df.to_csv(OUTPUT_DIR / 'products.csv', index=False) print(f"✅ Saved products.csv ({len(products_df):,} rows)") orders_df.to_csv(OUTPUT_DIR / 'orders.csv', index=False) print(f"✅ Saved orders.csv ({len(orders_df):,} rows)") order_items_df.to_csv(OUTPUT_DIR / 'order_items.csv', index=False) print(f"✅ Saved order_items.csv ({len(order_items_df):,} rows)") web_events_df.to_csv(OUTPUT_DIR / 'web_events.csv', index=False) print(f"✅ Saved web_events.csv ({len(web_events_df):,} rows)") # ==================== # FINAL SUMMARY # ==================== print("\n" + "=" * 70) print("DATA GENERATION COMPLETE!") print("=" * 70) print(f"\n📁 All files saved to: {OUTPUT_DIR}") print(f"\n📊 Dataset Summary:") print(f" • Customers: {len(customers_df):>8,} records") print(f" • Products: {len(products_df):>8,} records") print(f" • Orders: {len(orders_df):>8,} records") print(f" • Order Items: {len(order_items_df):>8,} records") print(f" • Web Events: {len(web_events_df):>8,} records") print(f" • TOTAL: {len(customers_df) + len(products_df) + len(orders_df) + len(order_items_df) + len(web_events_df):>8,} records") print(f"\n💰 Total Revenue: ${orders_df['total'].sum():,.2f}") print(f"\n🎉 Ready for Phase 2: Azure Infrastructure Setup!") print("=" * 70 + "\n")

Run the Data Generation Script

Copy to clipboard
python scripts/generate_sample_data.py

This will take 2-3 minutes to complete. You'll see progress indicators for orders and web events.

✅ CHECKPOINT

You should see:


STEP 1.5: Verify Generated Data (15 minutes)

Quick verification with Python:

Copy to clipboard
python3

Then run:

Copy to clipboard
import pandas as pd # Load and check each file customers = pd.read_csv('data/raw/customers.csv') products = pd.read_csv('data/raw/products.csv') orders = pd.read_csv('data/raw/orders.csv') order_items = pd.read_csv('data/raw/order_items.csv') web_events = pd.read_csv('data/raw/web_events.csv') print(f"Customers: {len(customers):,} rows") print(f"Products: {len(products):,} rows") print(f"Orders: {len(orders):,} rows") print(f"Order Items: {len(order_items):,} rows") print(f"Web Events: {len(web_events):,} rows") # Check first few rows print("\nCustomers sample:") print(customers.head()) # Exit Python exit()

✅ CHECKPOINT

All files load successfully, row counts match expected values.


STEP 1.6: Commit Phase 1 to Git (15 minutes)

Copy to clipboard
# Check what files were created git status # Add all new files (except data/raw/*.csv which is in .gitignore) git add . # Commit with descriptive message git commit -m "Phase 1 complete: Project setup and sample data generation - Created project directory structure - Added configuration files (requirements.txt, setup.py, .env.example) - Generated 260,500+ records of realistic e-commerce data - Created data generation script with 5 datasets - Set up Python virtual environment" # Push to GitHub git push origin main

✅ CHECKPOINT

Check GitHub - you should see all files except the large CSV files (they're gitignored).


PHASE 1 COMPLETE! 🎉

What You Built:

Project Infrastructure

Development Environment

Sample Data (260,500+ records)

Data Quality


What's Next: Phase 2

In Phase 2, you will:

Estimated Time: 4-6 hours over Days 4-5


Troubleshooting

Issue: pip install fails with permission error
Solution: Make sure virtual environment is activated (you should see (venv) in terminal)

Issue: Data generation script runs out of memory
Solution: Reduce NUM_CUSTOMERS, NUM_ORDERS, or NUM_WEB_EVENTS in the script

Issue: CSV files not in data/raw/
Solution: Check that OUTPUT_DIR path is correct and directory was created


Phase 1 Manual Version 1.0
Last Updated: 2025-01-01