E-Commerce Analytics Platform

Phase 2: Azure Infrastructure Setup

Duration: Days 4-5 | 4-6 hours total
Goal: Set up Azure cloud infrastructure for data storage and processing

OVERVIEW

In Phase 2, you will:

Create Azure account and resource group
Set up Azure Data Lake Storage with medallion architecture containers
Create Databricks workspace and compute cluster
Upload sample data to cloud storage
Configure secure connections between services

PREREQUISITES

Before starting Phase 2:

✅ Phase 1 completed (project structure and sample data generated)
✅ Credit card for Azure account (free tier available, $200 credit)
✅ Terminal/Command Line access
✅ Admin permissions on your computer

STEP 2.1: Create Azure Account (30 minutes)

Actions:

Sign up for Azure Free Account:
- Visit: https://azure.microsoft.com/free
- Click "Start free"
- Sign in with Microsoft account (or create one)
- Complete verification (requires credit card, but won't charge for free tier)
- You get $200 free credit for 30 days
Install Azure CLI:

macOS:


  
  Copy to clipboard
brew update
brew install azure-cli

Windows:

Download installer: https://aka.ms/installazurecliwindows
Run the downloaded .msi file
Follow installation wizard

Linux (Ubuntu/Debian):


  
  Copy to clipboard
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

Verify installation:


  
  Copy to clipboard
az --version

You should see version information for Azure CLI.

Login to Azure:


  
  Copy to clipboard
az login

This will open a browser window for authentication. Sign in with your Azure credentials.

List your subscriptions:


  
  Copy to clipboard
az account list --output table

You should see your subscription (probably named "Azure subscription 1" or similar).

Set active subscription:


  
  Copy to clipboard
az account set --subscription "YOUR_SUBSCRIPTION_ID"

Replace YOUR_SUBSCRIPTION_ID with the ID from the previous command.

Verify active subscription:


  
  Copy to clipboard
az account show --output table

✅ CHECKPOINT

Azure CLI installed and working
Logged in successfully
Active subscription set

STEP 2.2: Create Resource Group (15 minutes)

A resource group is a container that holds related Azure resources.

Actions:

Create resource group:


  
  Copy to clipboard
az group create \
  --name ecommerce-analytics-rg \
  --location eastus \
  --tags Environment=Development Project=EcommerceAnalytics

Location options: eastus, westus2, westeurope, southeastasia Choose the closest region to you for better performance.

Verify resource group creation:


  
  Copy to clipboard
az group list --output table

You should see "ecommerce-analytics-rg" in the list.

Show resource group details:


  
  Copy to clipboard
az group show --name ecommerce-analytics-rg --output json

✅ CHECKPOINT

Resource group "ecommerce-analytics-rg" created
Located in eastus (or your chosen region)
Tags visible in output

STEP 2.3: Create Data Lake Storage (45 minutes)

Azure Data Lake Storage Gen2 will hold all our data in medallion architecture.

Actions:

Create storage account:


  
  Copy to clipboard
az storage account create \
  --name ecommercedata001 \
  --resource-group ecommerce-analytics-rg \
  --location eastus \
  --sku Standard_LRS \
  --kind StorageV2 \
  --hierarchical-namespace true \
  --tags Environment=Development

Note: Storage account names must be globally unique, all lowercase, 3-24 characters. If "ecommercedata001" is taken, try "ecommercedata" + your initials + random numbers.

Wait for creation to complete (takes 1-2 minutes)
Get storage account key:


  
  Copy to clipboard
az storage account keys list \
  --resource-group ecommerce-analytics-rg \
  --account-name ecommercedata001 \
  --query '[0].value' \
  --output tsv

IMPORTANT: Copy this key and save it securely. You'll need it later.

Get connection string:


  
  Copy to clipboard
CONN_STRING=$(az storage account show-connection-string \
  --name ecommercedata001 \
  --resource-group ecommerce-analytics-rg \
  --output tsv)

This stores the connection string in a variable for use in next steps.

Create containers for medallion architecture:


  
  Copy to clipboard
# Bronze layer (raw data)
az storage container create \
  --name bronze \
  --connection-string $CONN_STRING

# Silver layer (cleaned data)
az storage container create \
  --name silver \
  --connection-string $CONN_STRING

# Gold layer (aggregated data)
az storage container create \
  --name gold \
  --connection-string $CONN_STRING

# Raw landing zone
az storage container create \
  --name raw-landing \
  --connection-string $CONN_STRING

# Logs
az storage container create \
  --name logs \
  --connection-string $CONN_STRING

Verify containers were created:


  
  Copy to clipboard
az storage container list \
  --connection-string $CONN_STRING \
  --output table

You should see 5 containers: bronze, silver, gold, raw-landing, logs

Upload sample data from Phase 1:


  
  Copy to clipboard
az storage blob upload-batch \
  --destination raw-landing \
  --source data/raw \
  --connection-string $CONN_STRING \
  --pattern "*.csv"

This uploads all CSV files from your local data/raw folder to the raw-landing container.

Verify data upload:


  
  Copy to clipboard
az storage blob list \
  --container-name raw-landing \
  --connection-string $CONN_STRING \
  --output table

You should see 5 CSV files: customers.csv, products.csv, orders.csv, order_items.csv, web_events.csv

✅ CHECKPOINT

Storage account created with unique name
5 containers created (bronze, silver, gold, raw-landing, logs)
All 5 CSV files uploaded to raw-landing container
Connection string saved

STEP 2.4: Create Azure Key Vault (30 minutes)

Key Vault securely stores secrets like storage keys and connection strings.

Actions:

Create Key Vault:


  
  Copy to clipboard
az keyvault create \
  --name ecommerce-kv001 \
  --resource-group ecommerce-analytics-rg \
  --location eastus \
  --enabled-for-deployment true \
  --enabled-for-template-deployment true

Note: Key Vault names must be globally unique. If taken, try adding your initials.

Store storage account key in Key Vault:


  
  Copy to clipboard
# First, get the storage key
STORAGE_KEY=$(az storage account keys list \
  --resource-group ecommerce-analytics-rg \
  --account-name ecommercedata001 \
  --query '[0].value' \
  --output tsv)

# Then store it in Key Vault
az keyvault secret set \
  --vault-name ecommerce-kv001 \
  --name storage-account-key \
  --value "$STORAGE_KEY"

Verify secret was stored:


  
  Copy to clipboard
az keyvault secret list \
  --vault-name ecommerce-kv001 \
  --output table

You should see "storage-account-key" in the list.

Get Key Vault URI (needed for Databricks):


  
  Copy to clipboard
az keyvault show \
  --name ecommerce-kv001 \
  --resource-group ecommerce-analytics-rg \
  --query properties.vaultUri \
  --output tsv

Save this URI - you'll need it when configuring Databricks.

✅ CHECKPOINT

Key Vault created
Storage key stored as secret
Vault URI retrieved and saved

STEP 2.5: Create Databricks Workspace (1 hour)

Actions:

Create Databricks workspace:


  
  Copy to clipboard
az databricks workspace create \
  --resource-group ecommerce-analytics-rg \
  --name ecommerce-databricks \
  --location eastus \
  --sku premium \
  --tags Environment=Development

Note: This takes 5-10 minutes to complete. The premium SKU is needed for some advanced features we'll use later.

Get workspace URL:


  
  Copy to clipboard
az databricks workspace show \
  --resource-group ecommerce-analytics-rg \
  --name ecommerce-databricks \
  --query workspaceUrl \
  --output tsv

Save this URL - this is where you'll access Databricks.

Open Databricks workspace in browser:

Copy the URL from previous step
Open in browser (should be: https://adb-XXXXXXXXXX.XX.azuredatabricks.net)
Sign in with your Azure account

Generate Personal Access Token:

In Databricks UI, click your username (top right)
Settings → Developer
Access Tokens tab
Click "Generate New Token"
Comment: "CLI Access"
Lifetime: 90 days
Click "Generate"
COPY THE TOKEN IMMEDIATELY - you won't see it again!

Install Databricks CLI:


  
  Copy to clipboard
pip install databricks-cli

Configure Databricks CLI:


  
  Copy to clipboard
databricks configure --token

When prompted:

Host: [paste your workspace URL, e.g., https://adb-XXXXXXXXXX.XX.azuredatabricks.net]
Token: [paste the personal access token you just generated]

Verify CLI connection:


  
  Copy to clipboard
databricks workspace ls /

You should see a list of folders in your workspace (Users, Shared, etc.)

✅ CHECKPOINT

Databricks workspace created
Workspace accessible in browser
Personal access token generated and saved
Databricks CLI configured and working

STEP 2.6: Create Databricks Cluster (45 minutes)

Actions:

Create cluster configuration file:

Create config/databricks_cluster.json in your project:


  
  Copy to clipboard
{
  "cluster_name": "ecommerce-analytics-cluster",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "driver_node_type_id": "Standard_DS3_v2",
  "autoscale": {
    "min_workers": 2,
    "max_workers": 4
  },
  "autotermination_minutes": 30,
  "spark_conf": {
    "spark.sql.adaptive.enabled": "true",
    "spark.databricks.delta.preview.enabled": "true",
    "spark.databricks.delta.optimizeWrite.enabled": "true",
    "spark.databricks.delta.autoCompact.enabled": "true"
  },
  "spark_env_vars": {
    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
  },
  "custom_tags": {
    "Environment": "Development",
    "Project": "EcommerceAnalytics"
  }
}

Create cluster using CLI:


  
  Copy to clipboard
databricks clusters create --json-file config/databricks_cluster.json

This returns a cluster ID. Save this ID!

Wait for cluster to start (5-10 minutes):


  
  Copy to clipboard
databricks clusters get --cluster-id YOUR_CLUSTER_ID

Keep running this until you see "state": "RUNNING"

Alternative: Check in Databricks UI:

Go to Compute tab
You should see "ecommerce-analytics-cluster" starting up
Wait for green "Running" status

Verify cluster is running:

In Databricks UI:

Go to Compute
Click on "ecommerce-analytics-cluster"
Should show "Running" status
Click "Test" to run a simple notebook

✅ CHECKPOINT

Cluster configuration created
Cluster successfully started
Cluster ID saved
Status shows "Running"

STEP 2.7: Update Environment Variables (15 minutes)

Update your .env file with all the values you collected:

Actions:

Create .env file in project root:


  
  Copy to clipboard
# Azure Configuration
AZURE_SUBSCRIPTION_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
AZURE_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
AZURE_RESOURCE_GROUP=ecommerce-analytics-rg

# Storage Configuration
AZURE_STORAGE_ACCOUNT=ecommercedata001
AZURE_STORAGE_KEY=<your-storage-key-from-step-2.3>
AZURE_STORAGE_CONTAINER_BRONZE=bronze
AZURE_STORAGE_CONTAINER_SILVER=silver
AZURE_STORAGE_CONTAINER_GOLD=gold
AZURE_STORAGE_CONTAINER_RAW=raw-landing

# Key Vault Configuration
AZURE_KEYVAULT_NAME=ecommerce-kv001
AZURE_KEYVAULT_URI=https://ecommerce-kv001.vault.azure.net/

# Databricks Configuration
DATABRICKS_HOST=https://adb-XXXXXXXXXX.XX.azuredatabricks.net
DATABRICKS_TOKEN=<your-personal-access-token>
DATABRICKS_CLUSTER_ID=<your-cluster-id-from-step-2.6>

# Environment Settings
ENVIRONMENT=development
LOG_LEVEL=INFO
REGION=eastus

Verify .env is in .gitignore:


  
  Copy to clipboard
cat .gitignore | grep .env

Should show .env is ignored (so secrets don't get committed to git).

✅ CHECKPOINT

.env file created with all values
File is in .gitignore
All placeholder values replaced with actual values

STEP 2.8: Test Azure Connection (15 minutes)

Create a test script to verify everything is connected.

Actions:

Create scripts/test_azure_connection.py:


  
  Copy to clipboard
"""
Test Azure connectivity and verify Phase 2 setup
"""
import os
from dotenv import load_dotenv
from azure.storage.blob import BlobServiceClient
from azure.identity import DefaultAzureCredential

# Load environment variables
load_dotenv()

print("=" * 70)
print("TESTING AZURE CONNECTIVITY")
print("=" * 70)

# Test 1: Storage Account Connection
print("\n1. Testing Storage Account Connection...")
try:
    storage_account = os.getenv('AZURE_STORAGE_ACCOUNT')
    storage_key = os.getenv('AZURE_STORAGE_KEY')
    
    connection_string = f"DefaultEndpointsProtocol=https;AccountName={storage_account};AccountKey={storage_key};EndpointSuffix=core.windows.net"
    
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    
    # List containers
    containers = list(blob_service_client.list_containers())
    print(f"   ✅ Connected to storage account: {storage_account}")
    print(f"   ✅ Found {len(containers)} containers:")
    for container in containers:
        print(f"      - {container.name}")
    
except Exception as e:
    print(f"   ❌ Error: {e}")

# Test 2: Check data in raw-landing
print("\n2. Testing Data in raw-landing Container...")
try:
    container_client = blob_service_client.get_container_client("raw-landing")
    blobs = list(container_client.list_blobs())
    
    print(f"   ✅ Found {len(blobs)} files in raw-landing:")
    for blob in blobs:
        size_mb = blob.size / (1024 * 1024)
        print(f"      - {blob.name} ({size_mb:.2f} MB)")
    
except Exception as e:
    print(f"   ❌ Error: {e}")

# Test 3: Databricks Connection
print("\n3. Testing Databricks Connection...")
try:
    databricks_host = os.getenv('DATABRICKS_HOST')
    databricks_token = os.getenv('DATABRICKS_TOKEN')
    
    if databricks_host and databricks_token:
        print(f"   ✅ Databricks credentials configured")
        print(f"      Host: {databricks_host}")
        print(f"      Token: {'*' * 20} (hidden)")
    else:
        print("   ❌ Databricks credentials not found in .env")
        
except Exception as e:
    print(f"   ❌ Error: {e}")

# Test 4: Environment Variables
print("\n4. Verifying Environment Variables...")
required_vars = [
    'AZURE_SUBSCRIPTION_ID',
    'AZURE_RESOURCE_GROUP',
    'AZURE_STORAGE_ACCOUNT',
    'AZURE_STORAGE_KEY',
    'DATABRICKS_HOST',
    'DATABRICKS_TOKEN',
    'DATABRICKS_CLUSTER_ID'
]

all_present = True
for var in required_vars:
    value = os.getenv(var)
    if value:
        print(f"   ✅ {var}: configured")
    else:
        print(f"   ❌ {var}: MISSING")
        all_present = False

print("\n" + "=" * 70)
if all_present:
    print("✅ ALL TESTS PASSED - Phase 2 setup complete!")
else:
    print("❌ SOME TESTS FAILED - Check configuration")
print("=" * 70)

Run the test script:


  
  Copy to clipboard
python scripts/test_azure_connection.py

Expected Output:


  
  Copy to clipboard
======================================================================
TESTING AZURE CONNECTIVITY
======================================================================

1. Testing Storage Account Connection...
   ✅ Connected to storage account: ecommercedata001
   ✅ Found 5 containers:
      - bronze
      - silver
      - gold
      - raw-landing
      - logs

2. Testing Data in raw-landing Container...
   ✅ Found 5 files in raw-landing:
      - customers.csv (X.XX MB)
      - products.csv (X.XX MB)
      - orders.csv (X.XX MB)
      - order_items.csv (X.XX MB)
      - web_events.csv (X.XX MB)

3. Testing Databricks Connection...
   ✅ Databricks credentials configured
      Host: https://adb-XXXXXXXXXX.XX.azuredatabricks.net
      Token: ******************** (hidden)

4. Verifying Environment Variables...
   ✅ AZURE_SUBSCRIPTION_ID: configured
   ✅ AZURE_RESOURCE_GROUP: configured
   ✅ AZURE_STORAGE_ACCOUNT: configured
   ✅ AZURE_STORAGE_KEY: configured
   ✅ DATABRICKS_HOST: configured
   ✅ DATABRICKS_TOKEN: configured
   ✅ DATABRICKS_CLUSTER_ID: configured

======================================================================
✅ ALL TESTS PASSED - Phase 2 setup complete!
======================================================================

✅ CHECKPOINT

All tests passing
Storage account accessible
Data visible in raw-landing
Databricks credentials working

STEP 2.9: Commit Phase 2 Changes (10 minutes)


  
  Copy to clipboard
# Check status
git status

# Add new files (but not .env which is gitignored)
git add config/databricks_cluster.json
git add scripts/test_azure_connection.py
git add .env.example  # Update this with new variables

# Commit
git commit -m "Phase 2 complete: Azure infrastructure setup

- Created Azure resource group and storage account
- Set up Data Lake with medallion architecture containers
- Created Databricks workspace and cluster
- Uploaded sample data to cloud
- Configured secure connections with Key Vault
- All connectivity tests passing"

# Push to GitHub
git push origin main

✅ CHECKPOINT

Changes committed to git
.env file NOT committed (remains private)
GitHub shows Phase 2 completion

PHASE 2 COMPLETE! 🎉

What You Built:

✅ Azure Infrastructure

Resource group in East US region
Data Lake Storage account with Gen2 features

✅ Storage Architecture

5 containers implementing medallion architecture
Bronze, Silver, Gold layers ready
Raw landing zone for incoming data
Logs container for audit trails

✅ Data Upload

All 5 CSV files (260,000+ records) uploaded to cloud
Data accessible from Azure and Databricks

✅ Databricks Environment

Premium workspace created
Compute cluster configured and running
CLI tools installed and connected

✅ Security Setup

Key Vault storing secrets
Connection strings secured
Personal access tokens configured

✅ Cost Optimization

Auto-termination enabled (30 minutes idle)
Auto-scaling configured (2-4 workers)
Free tier and credits being utilized

Cost Breakdown (Approximate)

With free tier and $200 credit:

Resource Group: Free
Storage Account: ~$5-10/month
Databricks Workspace: ~$0.20/hour when running
Cluster (Standard_DS3_v2): ~$0.30/hour when running
Key Vault: First 10,000 operations free

Total estimated cost: $20-40/month if running 8 hours/day With auto-termination: Much less!

What's Next: Phase 3

In Phase 3, you will:

Mount Azure storage to Databricks
Create Bronze layer notebooks
Ingest all 5 CSV files into Delta Lake tables
Implement schema validation
Add data quality checks
Create Bronze database with partitioned tables

Estimated Time: 6-8 hours over Days 6-8

Troubleshooting

Issue: Storage account name already taken
Solution: Use a more unique name like "ecommercedata" + your initials + random numbers

Issue: "Premium SKU not available in region"
Solution: Try a different region (westus2, westeurope) or use Standard SKU

Issue: Cluster won't start
Solution: Check quota limits in Azure portal, may need to request increase

Issue: Connection test fails
Solution: Verify all values in .env are correct, no extra spaces or quotes

Issue: "Cannot upload blob"
Solution: Verify connection string is correct, check storage account has public access enabled

Resources

Azure Portal: https://portal.azure.com
Databricks Documentation: https://docs.databricks.com
Azure CLI Reference: https://docs.microsoft.com/cli/azure/
Delta Lake Guide: https://docs.delta.io

Phase 2 Manual Version 1.0
Last Updated: 2025-01-01