E-Commerce Analytics Platform

Phase 1: Environment Setup & Data Generation

Duration: Days 1-3 | 6-8 hours total
Goal: Set up project structure and generate realistic sample data

OVERVIEW

In Phase 1, you will:

Create GitHub repository with proper structure
Set up Python virtual environment
Generate 260,000+ records of realistic e-commerce data
Create configuration files for the project

STEP 1.1: Create GitHub Repository (30 minutes)

Actions:

Go to GitHub and create new repository:
- Repository name: ecommerce-data-platform
- Description: "Production e-commerce analytics platform with Databricks medallion architecture"
- Visibility: Public
- Initialize with README: Yes
- Add .gitignore: Python template
- License: MIT (optional)
Clone to your local machine:


  
  Copy to clipboard
git clone https://github.com/YOUR_USERNAME/ecommerce-data-platform.git
cd ecommerce-data-platform

Create complete directory structure:


  
  Copy to clipboard
# Config and data directories
mkdir -p config
mkdir -p data/raw data/mock_apis data/schemas

# Databricks notebooks
mkdir -p databricks/notebooks/01_setup
mkdir -p databricks/notebooks/02_bronze_ingestion
mkdir -p databricks/notebooks/03_silver_transformations
mkdir -p databricks/notebooks/04_gold_aggregations
mkdir -p databricks/notebooks/05_data_quality
mkdir -p databricks/notebooks/06_ml_models
mkdir -p databricks/jobs

# dbt directories
mkdir -p dbt/models/bronze
mkdir -p dbt/models/silver
mkdir -p dbt/models/gold
mkdir -p dbt/tests
mkdir -p dbt/macros

# Other directories
mkdir -p terraform
mkdir -p scripts
mkdir -p tests
mkdir -p .github/workflows
mkdir -p docs

Verify structure:


  
  Copy to clipboard
tree -L 3  # or use 'ls -R' on Windows

✅ CHECKPOINT

You should see all directories created. Run ls -la to verify.

STEP 1.2: Create Configuration Files (45 minutes)

File 1: `requirements.txt`

Create file in project root:


  
  Copy to clipboard
# Core data processing
pandas==2.0.3
numpy==1.24.3
pyarrow==12.0.1

# Azure SDK
azure-identity==1.13.0
azure-storage-blob==12.17.0
azure-keyvault-secrets==4.7.0

# Databricks
databricks-cli==0.17.7
databricks-sql-connector==2.9.3

# dbt
dbt-databricks==1.6.2

# Machine Learning
scikit-learn==1.3.0
mlflow==2.6.0
matplotlib==3.7.2
seaborn==0.12.2

# API and utilities
requests==2.31.0
python-dotenv==1.0.0
pyyaml==6.0.1

# Testing
pytest==7.4.0
pytest-cov==4.1.0

# Code quality
black==23.7.0
flake8==6.0.0

# Development
jupyter==1.0.0
ipykernel==6.25.0

File 2: `.env.example`

Create in project root:


  
  Copy to clipboard
# Azure Configuration
AZURE_SUBSCRIPTION_ID=your-subscription-id-here
AZURE_RESOURCE_GROUP=ecommerce-analytics-rg
AZURE_STORAGE_ACCOUNT=ecommercedata001
AZURE_STORAGE_KEY=your-storage-key-here

# Databricks Configuration
DATABRICKS_HOST=https://adb-xxxxx.azuredatabricks.net
DATABRICKS_TOKEN=your-databricks-personal-access-token
DATABRICKS_CLUSTER_ID=your-cluster-id-here

# Environment Settings
ENVIRONMENT=development
LOG_LEVEL=INFO

File 3: `setup.py`

Create in project root:


  
  Copy to clipboard
from setuptools import setup, find_packages

setup(
    name="ecommerce-data-platform",
    version="1.0.0",
    description="E-commerce analytics platform with Databricks medallion architecture",
    author="Your Name",
    author_email="your.email@example.com",
    packages=find_packages(),
    install_requires=[
        line.strip()
        for line in open("requirements.txt").readlines()
        if line.strip() and not line.startswith("#")
    ],
    python_requires=">=3.9",
    classifiers=[
        "Development Status :: 4 - Beta",
        "Intended Audience :: Developers",
        "Programming Language :: Python :: 3.9",
        "Programming Language :: Python :: 3.10",
        "Programming Language :: Python :: 3.11",
    ],
)

File 4: Update `README.md`

Replace contents with:


  
  Copy to clipboard
# E-Commerce Analytics Platform

Production-ready e-commerce analytics platform implementing Databricks medallion architecture.

## Architecture

- **Bronze Layer:** Raw data ingestion from CSV files
- **Silver Layer:** Cleaned and validated data transformations using dbt
- **Gold Layer:** Business-ready aggregations and analytics

## Tech Stack

- **Data Processing:** Databricks (PySpark), Delta Lake
- **Transformations:** dbt (data build tool)
- **Cloud:** Azure Data Lake Storage, Azure Data Factory
- **ML:** MLflow, scikit-learn
- **Infrastructure:** Terraform
- **CI/CD:** GitHub Actions

## Project Status

🚧 **Phase 1 Complete:** Environment setup and sample data generation  
⏳ **Phase 2 In Progress:** Azure infrastructure setup

## Quick Start

See [docs/setup_guide.md](docs/setup_guide.md) for detailed setup instructions.

## License

MIT License

File 5: `.gitignore`

Add these lines to your existing .gitignore:


  
  Copy to clipboard
# Environment variables
.env
.env.local

# Virtual environment
venv/
env/

# Data files (don't commit large CSVs)
data/raw/*.csv
*.parquet

# Databricks
.databricks/

# dbt
dbt/target/
dbt/dbt_packages/
dbt/logs/

# Terraform
terraform/.terraform/
terraform/*.tfstate
terraform/*.tfstate.backup

✅ CHECKPOINT

All 5 configuration files created. Run git status to see new files.

STEP 1.3: Set Up Python Virtual Environment (20 minutes)

Actions:

Create virtual environment:


  
  Copy to clipboard
python3 -m venv venv

Activate virtual environment:


  
  Copy to clipboard
# On macOS/Linux:
source venv/bin/activate

# On Windows (Command Prompt):
venv\Scripts\activate

# On Windows (PowerShell):
venv\Scripts\Activate.ps1

You should see (venv) prefix in your terminal.

Upgrade pip:


  
  Copy to clipboard
pip install --upgrade pip setuptools wheel

Install all requirements:


  
  Copy to clipboard
pip install -r requirements.txt

This will take 3-5 minutes to install all packages.

Verify installation:


  
  Copy to clipboard
pip list | grep -E "pandas|databricks|dbt"

You should see pandas, databricks-cli, and dbt-databricks in the output.

✅ CHECKPOINT

No errors during installation. All key packages visible in pip list.

STEP 1.4: Generate Sample E-Commerce Data (2 hours)

This is the main step where you create realistic data for the project.

Create `scripts/generate_sample_data.py`


  
  Copy to clipboard
"""
Generate realistic e-commerce sample data for analytics platform.
Creates customers, products, orders, order items, and web events.
"""

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
from pathlib import Path

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)

# Configuration
START_DATE = datetime(2023, 1, 1)
END_DATE = datetime(2024, 12, 31)
NUM_CUSTOMERS = 10000
NUM_PRODUCTS = 500
NUM_ORDERS = 50000
NUM_WEB_EVENTS = 200000

OUTPUT_DIR = Path("data/raw")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("=" * 70)
print("E-COMMERCE DATA GENERATION SCRIPT")
print("=" * 70)
print(f"\nConfiguration:")
print(f"  Date Range: {START_DATE.date()} to {END_DATE.date()}")
print(f"  Customers: {NUM_CUSTOMERS:,}")
print(f"  Products: {NUM_PRODUCTS:,}")
print(f"  Orders: {NUM_ORDERS:,}")
print(f"  Web Events: {NUM_WEB_EVENTS:,}")
print(f"  Output: {OUTPUT_DIR}\n")


# ====================
# 1. GENERATE CUSTOMERS
# ====================
def generate_customers(n=NUM_CUSTOMERS):
    """Generate customer master data"""
    print(f"📊 Generating {n:,} customers...")
    
    segments = ['Premium', 'Regular', 'Occasional', 'New']
    segment_weights = [0.15, 0.35, 0.35, 0.15]
    
    countries = ['USA', 'UK', 'Canada', 'Germany', 'France', 'Australia']
    country_weights = [0.40, 0.15, 0.15, 0.12, 0.10, 0.08]
    
    us_states = ['CA', 'NY', 'TX', 'FL', 'IL', 'PA', 'OH', 'GA', 'NC', 'MI']
    
    customers = []
    
    for i in range(n):
        # Registration date (weighted toward recent)
        days_ago = int(np.random.exponential(scale=300))
        reg_date = END_DATE - timedelta(days=min(days_ago, 730))
        
        country = np.random.choice(countries, p=country_weights)
        state = np.random.choice(us_states) if country == 'USA' else None
        
        customer = {
            'customer_id': f"CUST{i+1:06d}",
            'email': f"customer{i+1}@email.com",
            'first_name': f"FirstName{i+1}",
            'last_name': f"LastName{i+1}",
            'registration_date': reg_date.strftime('%Y-%m-%d'),
            'country': country,
            'state': state,
            'segment': np.random.choice(segments, p=segment_weights),
            'marketing_opt_in': np.random.choice([True, False], p=[0.6, 0.4]),
            'created_at': reg_date.strftime('%Y-%m-%d %H:%M:%S'),
            'updated_at': (reg_date + timedelta(days=random.randint(0, 100))).strftime('%Y-%m-%d %H:%M:%S')
        }
        customers.append(customer)
    
    df = pd.DataFrame(customers)
    print(f"  ✅ Generated {len(df):,} customers")
    print(f"     Segments: Premium={len(df[df['segment']=='Premium'])}, Regular={len(df[df['segment']=='Regular'])}")
    
    return df


# ====================
# 2. GENERATE PRODUCTS
# ====================
def generate_products(n=NUM_PRODUCTS):
    """Generate product catalog"""
    print(f"\n📦 Generating {n:,} products...")
    
    categories = {
        'Electronics': ['Smartphone', 'Laptop', 'Tablet', 'Headphones', 'Camera', 'Smart Watch'],
        'Clothing': ['Shirt', 'Pants', 'Dress', 'Jacket', 'Shoes', 'Accessories'],
        'Home & Garden': ['Furniture', 'Decor', 'Kitchen', 'Bedding', 'Tools'],
        'Books': ['Fiction', 'Non-Fiction', 'Educational', 'Comics'],
        'Sports': ['Equipment', 'Apparel', 'Supplements', 'Accessories']
    }
    
    products = []
    product_id = 1
    
    for category, subcategories in categories.items():
        num_in_category = n // len(categories)
        
        for _ in range(num_in_category):
            subcategory = random.choice(subcategories)
            
            # Price varies by category
            if category == 'Electronics':
                base_price = random.uniform(50, 2000)
            elif category == 'Clothing':
                base_price = random.uniform(20, 300)
            elif category == 'Home & Garden':
                base_price = random.uniform(30, 1000)
            elif category == 'Books':
                base_price = random.uniform(10, 50)
            else:  # Sports
                base_price = random.uniform(25, 500)
            
            product = {
                'product_id': f"PROD{product_id:05d}",
                'product_name': f"{subcategory} {product_id}",
                'category': category,
                'subcategory': subcategory,
                'price': round(base_price, 2),
                'cost': round(base_price * random.uniform(0.4, 0.7), 2),
                'stock_quantity': random.randint(0, 1000),
                'supplier': f"Supplier_{random.randint(1, 50)}",
                'created_at': (START_DATE + timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d %H:%M:%S'),
                'is_active': random.choice([True, True, True, False])  # 75% active
            }
            products.append(product)
            product_id += 1
    
    df = pd.DataFrame(products)
    print(f"  ✅ Generated {len(df):,} products")
    print(f"     Active: {len(df[df['is_active']==True])}, Price range: ${df['price'].min():.2f}-${df['price'].max():.2f}")
    
    return df


# ====================
# 3. GENERATE ORDERS
# ====================
def generate_orders(customers_df, products_df, n=NUM_ORDERS):
    """Generate order transactions and order items"""
    print(f"\n🛒 Generating {n:,} orders...")
    
    statuses = ['completed', 'pending', 'cancelled', 'returned']
    status_weights = [0.85, 0.05, 0.05, 0.05]
    
    payment_methods = ['credit_card', 'debit_card', 'paypal', 'apple_pay', 'google_pay']
    payment_weights = [0.45, 0.25, 0.15, 0.08, 0.07]
    
    active_products = products_df[products_df['is_active']].copy()
    
    orders = []
    order_items = []
    
    for order_num in range(n):
        order_id = f"ORD{order_num+1:07d}"
        
        # Select customer
        customer = customers_df.sample(1).iloc[0]
        customer_id = customer['customer_id']
        
        # Order date (weighted toward recent)
        days_ago = int(np.random.exponential(scale=180))
        order_date = END_DATE - timedelta(days=min(days_ago, 730))
        
        # Ensure order after customer registration
        customer_reg = pd.to_datetime(customer['registration_date'])
        if order_date < customer_reg:
            order_date = customer_reg + timedelta(days=random.randint(1, 30))
        
        # Number of items (1-4, weighted toward 1-2)
        num_items = np.random.choice([1, 2, 3, 4], p=[0.50, 0.30, 0.15, 0.05])
        
        # Select products
        ordered_products = active_products.sample(n=num_items, replace=False)
        
        # Calculate order totals
        subtotal = 0
        for _, product in ordered_products.iterrows():
            quantity = random.randint(1, 3)
            item_total = product['price'] * quantity
            subtotal += item_total
            
            order_items.append({
                'order_id': order_id,
                'product_id': product['product_id'],
                'quantity': quantity,
                'unit_price': product['price'],
                'total_price': round(item_total, 2)
            })
        
        # Discount for premium customers
        discount = 0
        if customer['segment'] == 'Premium':
            discount = subtotal * random.uniform(0.05, 0.15)
        elif random.random() < 0.1:  # 10% chance of promo
            discount = subtotal * random.uniform(0.05, 0.10)
        
        # Shipping (free over $50)
        shipping_cost = 0 if subtotal > 50 else random.uniform(5, 15)
        
        # Tax (8%)
        tax = (subtotal - discount) * 0.08
        
        total = subtotal - discount + shipping_cost + tax
        
        orders.append({
            'order_id': order_id,
            'customer_id': customer_id,
            'order_date': order_date.strftime('%Y-%m-%d'),
            'order_timestamp': order_date.strftime('%Y-%m-%d %H:%M:%S'),
            'status': np.random.choice(statuses, p=status_weights),
            'payment_method': np.random.choice(payment_methods, p=payment_weights),
            'subtotal': round(subtotal, 2),
            'discount': round(discount, 2),
            'shipping_cost': round(shipping_cost, 2),
            'tax': round(tax, 2),
            'total': round(total, 2),
            'shipping_address': f"{random.randint(1, 9999)} Main St",
            'shipping_city': f"City{random.randint(1, 100)}",
            'shipping_state': customer['state'],
            'shipping_country': customer['country'],
            'created_at': order_date.strftime('%Y-%m-%d %H:%M:%S'),
            'updated_at': (order_date + timedelta(hours=random.randint(1, 48))).strftime('%Y-%m-%d %H:%M:%S')
        })
        
        # Progress indicator
        if (order_num + 1) % 10000 == 0:
            print(f"  Progress: {order_num + 1:,} / {n:,} orders")
    
    orders_df = pd.DataFrame(orders)
    order_items_df = pd.DataFrame(order_items)
    
    print(f"  ✅ Generated {len(orders_df):,} orders with {len(order_items_df):,} line items")
    print(f"     Total Revenue: ${orders_df['total'].sum():,.2f}")
    print(f"     Avg Order Value: ${orders_df['total'].mean():.2f}")
    
    return orders_df, order_items_df


# ====================
# 4. GENERATE WEB EVENTS
# ====================
def generate_web_events(customers_df, products_df, n=NUM_WEB_EVENTS):
    """Generate website interaction events"""
    print(f"\n🌐 Generating {n:,} web events...")
    
    event_types = ['page_view', 'product_view', 'add_to_cart', 'remove_from_cart', 
                   'search', 'wishlist_add', 'review_submit']
    event_weights = [0.45, 0.25, 0.12, 0.03, 0.10, 0.03, 0.02]
    
    devices = ['desktop', 'mobile', 'tablet']
    device_weights = [0.45, 0.45, 0.10]
    
    browsers = ['Chrome', 'Safari', 'Firefox', 'Edge', 'Other']
    browser_weights = [0.55, 0.25, 0.10, 0.07, 0.03]
    
    active_products = products_df[products_df['is_active']].copy()
    events = []
    
    for event_num in range(n):
        # Timestamp
        days_ago = int(np.random.exponential(scale=90))
        event_time = END_DATE - timedelta(
            days=min(days_ago, 730),
            hours=random.randint(0, 23),
            minutes=random.randint(0, 59),
            seconds=random.randint(0, 59)
        )
        
        # 70% registered users, 30% anonymous
        if random.random() < 0.7:
            customer = customers_df.sample(1).iloc[0]
            customer_id = customer['customer_id']
            session_id = f"SESSION_{customer_id}_{event_time.strftime('%Y%m%d%H')}"
        else:
            customer_id = None
            session_id = f"SESSION_ANON_{random.randint(1, 50000)}_{event_time.strftime('%Y%m%d%H')}"
        
        event_type = np.random.choice(event_types, p=event_weights)
        
        # Product for product-related events
        product_id = None
        if event_type in ['product_view', 'add_to_cart', 'remove_from_cart', 'wishlist_add', 'review_submit']:
            product_id = active_products.sample(1).iloc[0]['product_id']
        
        # Search query
        search_query = None
        if event_type == 'search':
            search_terms = ['laptop', 'phone', 'shoes', 'furniture', 'book', 'camera', 'watch']
            search_query = random.choice(search_terms)
        
        event = {
            'event_id': f"EVT{event_num+1:08d}",
            'session_id': session_id,
            'customer_id': customer_id,
            'event_timestamp': event_time.strftime('%Y-%m-%d %H:%M:%S'),
            'event_type': event_type,
            'product_id': product_id,
            'search_query': search_query,
            'device_type': np.random.choice(devices, p=device_weights),
            'browser': np.random.choice(browsers, p=browser_weights),
            'page_url': f"/products/{random.randint(1, 100)}" if event_type == 'page_view' else None,
            'referrer_url': random.choice([None, 'google.com', 'facebook.com', 'instagram.com', None, None])
        }
        events.append(event)
        
        # Progress indicator
        if (event_num + 1) % 50000 == 0:
            print(f"  Progress: {event_num + 1:,} / {n:,} events")
    
    df = pd.DataFrame(events)
    print(f"  ✅ Generated {len(df):,} web events")
    
    return df


# ====================
# EXECUTE DATA GENERATION
# ====================
print("\n" + "=" * 70)
print("STARTING DATA GENERATION")
print("=" * 70 + "\n")

# Generate all datasets
customers_df = generate_customers()
products_df = generate_products()
orders_df, order_items_df = generate_orders(customers_df, products_df)
web_events_df = generate_web_events(customers_df, products_df)

# ====================
# SAVE TO CSV FILES
# ====================
print("\n" + "=" * 70)
print("SAVING DATA TO CSV FILES")
print("=" * 70 + "\n")

customers_df.to_csv(OUTPUT_DIR / 'customers.csv', index=False)
print(f"✅ Saved customers.csv ({len(customers_df):,} rows)")

products_df.to_csv(OUTPUT_DIR / 'products.csv', index=False)
print(f"✅ Saved products.csv ({len(products_df):,} rows)")

orders_df.to_csv(OUTPUT_DIR / 'orders.csv', index=False)
print(f"✅ Saved orders.csv ({len(orders_df):,} rows)")

order_items_df.to_csv(OUTPUT_DIR / 'order_items.csv', index=False)
print(f"✅ Saved order_items.csv ({len(order_items_df):,} rows)")

web_events_df.to_csv(OUTPUT_DIR / 'web_events.csv', index=False)
print(f"✅ Saved web_events.csv ({len(web_events_df):,} rows)")

# ====================
# FINAL SUMMARY
# ====================
print("\n" + "=" * 70)
print("DATA GENERATION COMPLETE!")
print("=" * 70)
print(f"\n📁 All files saved to: {OUTPUT_DIR}")
print(f"\n📊 Dataset Summary:")
print(f"   • Customers:     {len(customers_df):>8,} records")
print(f"   • Products:      {len(products_df):>8,} records")
print(f"   • Orders:        {len(orders_df):>8,} records")
print(f"   • Order Items:   {len(order_items_df):>8,} records")
print(f"   • Web Events:    {len(web_events_df):>8,} records")
print(f"   • TOTAL:         {len(customers_df) + len(products_df) + len(orders_df) + len(order_items_df) + len(web_events_df):>8,} records")
print(f"\n💰 Total Revenue: ${orders_df['total'].sum():,.2f}")
print(f"\n🎉 Ready for Phase 2: Azure Infrastructure Setup!")
print("=" * 70 + "\n")

Run the Data Generation Script


  
  Copy to clipboard
python scripts/generate_sample_data.py

This will take 2-3 minutes to complete. You'll see progress indicators for orders and web events.

✅ CHECKPOINT

You should see:

5 CSV files in data/raw/ directory
Total of 260,500+ records generated
No errors during execution
Summary showing total revenue

STEP 1.5: Verify Generated Data (15 minutes)

Quick verification with Python:


  
  Copy to clipboard
python3

Then run:


  
  Copy to clipboard
import pandas as pd

# Load and check each file
customers = pd.read_csv('data/raw/customers.csv')
products = pd.read_csv('data/raw/products.csv')
orders = pd.read_csv('data/raw/orders.csv')
order_items = pd.read_csv('data/raw/order_items.csv')
web_events = pd.read_csv('data/raw/web_events.csv')

print(f"Customers: {len(customers):,} rows")
print(f"Products: {len(products):,} rows")
print(f"Orders: {len(orders):,} rows")
print(f"Order Items: {len(order_items):,} rows")
print(f"Web Events: {len(web_events):,} rows")

# Check first few rows
print("\nCustomers sample:")
print(customers.head())

# Exit Python
exit()

✅ CHECKPOINT

All files load successfully, row counts match expected values.

STEP 1.6: Commit Phase 1 to Git (15 minutes)


  
  Copy to clipboard
# Check what files were created
git status

# Add all new files (except data/raw/*.csv which is in .gitignore)
git add .

# Commit with descriptive message
git commit -m "Phase 1 complete: Project setup and sample data generation

- Created project directory structure
- Added configuration files (requirements.txt, setup.py, .env.example)
- Generated 260,500+ records of realistic e-commerce data
- Created data generation script with 5 datasets
- Set up Python virtual environment"

# Push to GitHub
git push origin main

✅ CHECKPOINT

Check GitHub - you should see all files except the large CSV files (they're gitignored).

PHASE 1 COMPLETE! 🎉

What You Built:

✅ Project Infrastructure

Complete directory structure for all phases
Git repository with proper .gitignore
Configuration files ready for Azure and Databricks

✅ Development Environment

Python virtual environment
All required packages installed
Jupyter notebook support ready

✅ Sample Data (260,500+ records)

10,000 customers across 6 countries
500 products in 5 categories
50,000 orders with realistic patterns
100,000+ order line items
200,000 web interaction events
$2.5M+ in total revenue

✅ Data Quality

Realistic distributions (customer segments, product categories)
Temporal patterns (orders weighted toward recent dates)
Referential integrity (all foreign keys valid)
Business logic (discounts for premium customers, free shipping over $50)

What's Next: Phase 2

In Phase 2, you will:

Create Azure account and resource group
Set up Azure Data Lake Storage with medallion architecture
Create Databricks workspace and cluster
Upload sample data to cloud
Configure storage connections

Estimated Time: 4-6 hours over Days 4-5

Troubleshooting

Issue: pip install fails with permission error
Solution: Make sure virtual environment is activated (you should see (venv) in terminal)

Issue: Data generation script runs out of memory
Solution: Reduce NUM_CUSTOMERS, NUM_ORDERS, or NUM_WEB_EVENTS in the script

Issue: CSV files not in data/raw/
Solution: Check that OUTPUT_DIR path is correct and directory was created

Phase 1 Manual Version 1.0
Last Updated: 2025-01-01

E-Commerce Analytics Platform

Phase 1: Environment Setup & Data Generation

OVERVIEW

STEP 1.1: Create GitHub Repository (30 minutes)

Actions:

✅ CHECKPOINT

STEP 1.2: Create Configuration Files (45 minutes)

File 1: requirements.txt

File 2: .env.example

File 3: setup.py

File 4: Update README.md

File 5: .gitignore

✅ CHECKPOINT

STEP 1.3: Set Up Python Virtual Environment (20 minutes)

Actions:

✅ CHECKPOINT

STEP 1.4: Generate Sample E-Commerce Data (2 hours)

Create scripts/generate_sample_data.py

Run the Data Generation Script

✅ CHECKPOINT

STEP 1.5: Verify Generated Data (15 minutes)

Quick verification with Python:

✅ CHECKPOINT

STEP 1.6: Commit Phase 1 to Git (15 minutes)

✅ CHECKPOINT

PHASE 1 COMPLETE! 🎉

What You Built:

What's Next: Phase 2

Troubleshooting

File 1: `requirements.txt`

File 2: `.env.example`

File 3: `setup.py`

File 4: Update `README.md`

File 5: `.gitignore`

Create `scripts/generate_sample_data.py`