E-Commerce Analytics Platform

Phase 2: Azure Infrastructure Setup

Duration: Days 4-5 | 4-6 hours total
Goal: Set up Azure cloud infrastructure for data storage and processing


OVERVIEW

In Phase 2, you will:


PREREQUISITES

Before starting Phase 2:


STEP 2.1: Create Azure Account (30 minutes)

Actions:

  1. Sign up for Azure Free Account:

    • Visit: https://azure.microsoft.com/free
    • Click "Start free"
    • Sign in with Microsoft account (or create one)
    • Complete verification (requires credit card, but won't charge for free tier)
    • You get $200 free credit for 30 days
  2. Install Azure CLI:

macOS:

Copy to clipboard
brew update brew install azure-cli

Windows:

Linux (Ubuntu/Debian):

Copy to clipboard
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
  1. Verify installation:
Copy to clipboard
az --version

You should see version information for Azure CLI.

  1. Login to Azure:
Copy to clipboard
az login

This will open a browser window for authentication. Sign in with your Azure credentials.

  1. List your subscriptions:
Copy to clipboard
az account list --output table

You should see your subscription (probably named "Azure subscription 1" or similar).

  1. Set active subscription:
Copy to clipboard
az account set --subscription "YOUR_SUBSCRIPTION_ID"

Replace YOUR_SUBSCRIPTION_ID with the ID from the previous command.

  1. Verify active subscription:
Copy to clipboard
az account show --output table

✅ CHECKPOINT


STEP 2.2: Create Resource Group (15 minutes)

A resource group is a container that holds related Azure resources.

Actions:

  1. Create resource group:
Copy to clipboard
az group create \ --name ecommerce-analytics-rg \ --location eastus \ --tags Environment=Development Project=EcommerceAnalytics

Location options: eastus, westus2, westeurope, southeastasia Choose the closest region to you for better performance.

  1. Verify resource group creation:
Copy to clipboard
az group list --output table

You should see "ecommerce-analytics-rg" in the list.

  1. Show resource group details:
Copy to clipboard
az group show --name ecommerce-analytics-rg --output json

✅ CHECKPOINT


STEP 2.3: Create Data Lake Storage (45 minutes)

Azure Data Lake Storage Gen2 will hold all our data in medallion architecture.

Actions:

  1. Create storage account:
Copy to clipboard
az storage account create \ --name ecommercedata001 \ --resource-group ecommerce-analytics-rg \ --location eastus \ --sku Standard_LRS \ --kind StorageV2 \ --hierarchical-namespace true \ --tags Environment=Development

Note: Storage account names must be globally unique, all lowercase, 3-24 characters. If "ecommercedata001" is taken, try "ecommercedata" + your initials + random numbers.

  1. Wait for creation to complete (takes 1-2 minutes)

  2. Get storage account key:

Copy to clipboard
az storage account keys list \ --resource-group ecommerce-analytics-rg \ --account-name ecommercedata001 \ --query '[0].value' \ --output tsv

IMPORTANT: Copy this key and save it securely. You'll need it later.

  1. Get connection string:
Copy to clipboard
CONN_STRING=$(az storage account show-connection-string \ --name ecommercedata001 \ --resource-group ecommerce-analytics-rg \ --output tsv)

This stores the connection string in a variable for use in next steps.

  1. Create containers for medallion architecture:
Copy to clipboard
# Bronze layer (raw data) az storage container create \ --name bronze \ --connection-string $CONN_STRING # Silver layer (cleaned data) az storage container create \ --name silver \ --connection-string $CONN_STRING # Gold layer (aggregated data) az storage container create \ --name gold \ --connection-string $CONN_STRING # Raw landing zone az storage container create \ --name raw-landing \ --connection-string $CONN_STRING # Logs az storage container create \ --name logs \ --connection-string $CONN_STRING
  1. Verify containers were created:
Copy to clipboard
az storage container list \ --connection-string $CONN_STRING \ --output table

You should see 5 containers: bronze, silver, gold, raw-landing, logs

  1. Upload sample data from Phase 1:
Copy to clipboard
az storage blob upload-batch \ --destination raw-landing \ --source data/raw \ --connection-string $CONN_STRING \ --pattern "*.csv"

This uploads all CSV files from your local data/raw folder to the raw-landing container.

  1. Verify data upload:
Copy to clipboard
az storage blob list \ --container-name raw-landing \ --connection-string $CONN_STRING \ --output table

You should see 5 CSV files: customers.csv, products.csv, orders.csv, order_items.csv, web_events.csv

✅ CHECKPOINT


STEP 2.4: Create Azure Key Vault (30 minutes)

Key Vault securely stores secrets like storage keys and connection strings.

Actions:

  1. Create Key Vault:
Copy to clipboard
az keyvault create \ --name ecommerce-kv001 \ --resource-group ecommerce-analytics-rg \ --location eastus \ --enabled-for-deployment true \ --enabled-for-template-deployment true

Note: Key Vault names must be globally unique. If taken, try adding your initials.

  1. Store storage account key in Key Vault:
Copy to clipboard
# First, get the storage key STORAGE_KEY=$(az storage account keys list \ --resource-group ecommerce-analytics-rg \ --account-name ecommercedata001 \ --query '[0].value' \ --output tsv) # Then store it in Key Vault az keyvault secret set \ --vault-name ecommerce-kv001 \ --name storage-account-key \ --value "$STORAGE_KEY"
  1. Verify secret was stored:
Copy to clipboard
az keyvault secret list \ --vault-name ecommerce-kv001 \ --output table

You should see "storage-account-key" in the list.

  1. Get Key Vault URI (needed for Databricks):
Copy to clipboard
az keyvault show \ --name ecommerce-kv001 \ --resource-group ecommerce-analytics-rg \ --query properties.vaultUri \ --output tsv

Save this URI - you'll need it when configuring Databricks.

✅ CHECKPOINT


STEP 2.5: Create Databricks Workspace (1 hour)

Actions:

  1. Create Databricks workspace:
Copy to clipboard
az databricks workspace create \ --resource-group ecommerce-analytics-rg \ --name ecommerce-databricks \ --location eastus \ --sku premium \ --tags Environment=Development

Note: This takes 5-10 minutes to complete. The premium SKU is needed for some advanced features we'll use later.

  1. Get workspace URL:
Copy to clipboard
az databricks workspace show \ --resource-group ecommerce-analytics-rg \ --name ecommerce-databricks \ --query workspaceUrl \ --output tsv

Save this URL - this is where you'll access Databricks.

  1. Open Databricks workspace in browser:
  1. Generate Personal Access Token:
  1. Install Databricks CLI:
Copy to clipboard
pip install databricks-cli
  1. Configure Databricks CLI:
Copy to clipboard
databricks configure --token

When prompted:

  1. Verify CLI connection:
Copy to clipboard
databricks workspace ls /

You should see a list of folders in your workspace (Users, Shared, etc.)

✅ CHECKPOINT


STEP 2.6: Create Databricks Cluster (45 minutes)

Actions:

  1. Create cluster configuration file:

Create config/databricks_cluster.json in your project:

Copy to clipboard
{ "cluster_name": "ecommerce-analytics-cluster", "spark_version": "13.3.x-scala2.12", "node_type_id": "Standard_DS3_v2", "driver_node_type_id": "Standard_DS3_v2", "autoscale": { "min_workers": 2, "max_workers": 4 }, "autotermination_minutes": 30, "spark_conf": { "spark.sql.adaptive.enabled": "true", "spark.databricks.delta.preview.enabled": "true", "spark.databricks.delta.optimizeWrite.enabled": "true", "spark.databricks.delta.autoCompact.enabled": "true" }, "spark_env_vars": { "PYSPARK_PYTHON": "/databricks/python3/bin/python3" }, "custom_tags": { "Environment": "Development", "Project": "EcommerceAnalytics" } }
  1. Create cluster using CLI:
Copy to clipboard
databricks clusters create --json-file config/databricks_cluster.json

This returns a cluster ID. Save this ID!

  1. Wait for cluster to start (5-10 minutes):
Copy to clipboard
databricks clusters get --cluster-id YOUR_CLUSTER_ID

Keep running this until you see "state": "RUNNING"

Alternative: Check in Databricks UI:

  1. Verify cluster is running:

In Databricks UI:

✅ CHECKPOINT


STEP 2.7: Update Environment Variables (15 minutes)

Update your .env file with all the values you collected:

Actions:

  1. Create .env file in project root:
Copy to clipboard
# Azure Configuration AZURE_SUBSCRIPTION_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx AZURE_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx AZURE_RESOURCE_GROUP=ecommerce-analytics-rg # Storage Configuration AZURE_STORAGE_ACCOUNT=ecommercedata001 AZURE_STORAGE_KEY=<your-storage-key-from-step-2.3> AZURE_STORAGE_CONTAINER_BRONZE=bronze AZURE_STORAGE_CONTAINER_SILVER=silver AZURE_STORAGE_CONTAINER_GOLD=gold AZURE_STORAGE_CONTAINER_RAW=raw-landing # Key Vault Configuration AZURE_KEYVAULT_NAME=ecommerce-kv001 AZURE_KEYVAULT_URI=https://ecommerce-kv001.vault.azure.net/ # Databricks Configuration DATABRICKS_HOST=https://adb-XXXXXXXXXX.XX.azuredatabricks.net DATABRICKS_TOKEN=<your-personal-access-token> DATABRICKS_CLUSTER_ID=<your-cluster-id-from-step-2.6> # Environment Settings ENVIRONMENT=development LOG_LEVEL=INFO REGION=eastus
  1. Verify .env is in .gitignore:
Copy to clipboard
cat .gitignore | grep .env

Should show .env is ignored (so secrets don't get committed to git).

✅ CHECKPOINT


STEP 2.8: Test Azure Connection (15 minutes)

Create a test script to verify everything is connected.

Actions:

  1. Create scripts/test_azure_connection.py:
Copy to clipboard
""" Test Azure connectivity and verify Phase 2 setup """ import os from dotenv import load_dotenv from azure.storage.blob import BlobServiceClient from azure.identity import DefaultAzureCredential # Load environment variables load_dotenv() print("=" * 70) print("TESTING AZURE CONNECTIVITY") print("=" * 70) # Test 1: Storage Account Connection print("\n1. Testing Storage Account Connection...") try: storage_account = os.getenv('AZURE_STORAGE_ACCOUNT') storage_key = os.getenv('AZURE_STORAGE_KEY') connection_string = f"DefaultEndpointsProtocol=https;AccountName={storage_account};AccountKey={storage_key};EndpointSuffix=core.windows.net" blob_service_client = BlobServiceClient.from_connection_string(connection_string) # List containers containers = list(blob_service_client.list_containers()) print(f" ✅ Connected to storage account: {storage_account}") print(f" ✅ Found {len(containers)} containers:") for container in containers: print(f" - {container.name}") except Exception as e: print(f" ❌ Error: {e}") # Test 2: Check data in raw-landing print("\n2. Testing Data in raw-landing Container...") try: container_client = blob_service_client.get_container_client("raw-landing") blobs = list(container_client.list_blobs()) print(f" ✅ Found {len(blobs)} files in raw-landing:") for blob in blobs: size_mb = blob.size / (1024 * 1024) print(f" - {blob.name} ({size_mb:.2f} MB)") except Exception as e: print(f" ❌ Error: {e}") # Test 3: Databricks Connection print("\n3. Testing Databricks Connection...") try: databricks_host = os.getenv('DATABRICKS_HOST') databricks_token = os.getenv('DATABRICKS_TOKEN') if databricks_host and databricks_token: print(f" ✅ Databricks credentials configured") print(f" Host: {databricks_host}") print(f" Token: {'*' * 20} (hidden)") else: print(" ❌ Databricks credentials not found in .env") except Exception as e: print(f" ❌ Error: {e}") # Test 4: Environment Variables print("\n4. Verifying Environment Variables...") required_vars = [ 'AZURE_SUBSCRIPTION_ID', 'AZURE_RESOURCE_GROUP', 'AZURE_STORAGE_ACCOUNT', 'AZURE_STORAGE_KEY', 'DATABRICKS_HOST', 'DATABRICKS_TOKEN', 'DATABRICKS_CLUSTER_ID' ] all_present = True for var in required_vars: value = os.getenv(var) if value: print(f"{var}: configured") else: print(f"{var}: MISSING") all_present = False print("\n" + "=" * 70) if all_present: print("✅ ALL TESTS PASSED - Phase 2 setup complete!") else: print("❌ SOME TESTS FAILED - Check configuration") print("=" * 70)
  1. Run the test script:
Copy to clipboard
python scripts/test_azure_connection.py

Expected Output:

Copy to clipboard
====================================================================== TESTING AZURE CONNECTIVITY ====================================================================== 1. Testing Storage Account Connection... ✅ Connected to storage account: ecommercedata001 ✅ Found 5 containers: - bronze - silver - gold - raw-landing - logs 2. Testing Data in raw-landing Container... ✅ Found 5 files in raw-landing: - customers.csv (X.XX MB) - products.csv (X.XX MB) - orders.csv (X.XX MB) - order_items.csv (X.XX MB) - web_events.csv (X.XX MB) 3. Testing Databricks Connection... ✅ Databricks credentials configured Host: https://adb-XXXXXXXXXX.XX.azuredatabricks.net Token: ******************** (hidden) 4. Verifying Environment Variables... ✅ AZURE_SUBSCRIPTION_ID: configured ✅ AZURE_RESOURCE_GROUP: configured ✅ AZURE_STORAGE_ACCOUNT: configured ✅ AZURE_STORAGE_KEY: configured ✅ DATABRICKS_HOST: configured ✅ DATABRICKS_TOKEN: configured ✅ DATABRICKS_CLUSTER_ID: configured ====================================================================== ✅ ALL TESTS PASSED - Phase 2 setup complete! ======================================================================

✅ CHECKPOINT


STEP 2.9: Commit Phase 2 Changes (10 minutes)

Copy to clipboard
# Check status git status # Add new files (but not .env which is gitignored) git add config/databricks_cluster.json git add scripts/test_azure_connection.py git add .env.example # Update this with new variables # Commit git commit -m "Phase 2 complete: Azure infrastructure setup - Created Azure resource group and storage account - Set up Data Lake with medallion architecture containers - Created Databricks workspace and cluster - Uploaded sample data to cloud - Configured secure connections with Key Vault - All connectivity tests passing" # Push to GitHub git push origin main

✅ CHECKPOINT


PHASE 2 COMPLETE! 🎉

What You Built:

Azure Infrastructure

Storage Architecture

Data Upload

Databricks Environment

Security Setup

Cost Optimization


Cost Breakdown (Approximate)

With free tier and $200 credit:

Total estimated cost: $20-40/month if running 8 hours/day With auto-termination: Much less!


What's Next: Phase 3

In Phase 3, you will:

Estimated Time: 6-8 hours over Days 6-8


Troubleshooting

Issue: Storage account name already taken
Solution: Use a more unique name like "ecommercedata" + your initials + random numbers

Issue: "Premium SKU not available in region"
Solution: Try a different region (westus2, westeurope) or use Standard SKU

Issue: Cluster won't start
Solution: Check quota limits in Azure portal, may need to request increase

Issue: Connection test fails
Solution: Verify all values in .env are correct, no extra spaces or quotes

Issue: "Cannot upload blob"
Solution: Verify connection string is correct, check storage account has public access enabled


Resources


Phase 2 Manual Version 1.0
Last Updated: 2025-01-01