Duration: Days 4-5 | 4-6 hours total
Goal: Set up Azure cloud infrastructure for data storage and processing
In Phase 2, you will:
Before starting Phase 2:
Sign up for Azure Free Account:
Install Azure CLI:
macOS:
brew update
brew install azure-cli
Windows:
Linux (Ubuntu/Debian):
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az --version
You should see version information for Azure CLI.
az login
This will open a browser window for authentication. Sign in with your Azure credentials.
az account list --output table
You should see your subscription (probably named "Azure subscription 1" or similar).
az account set --subscription "YOUR_SUBSCRIPTION_ID"
Replace YOUR_SUBSCRIPTION_ID with the ID from the previous command.
az account show --output table
A resource group is a container that holds related Azure resources.
az group create \
--name ecommerce-analytics-rg \
--location eastus \
--tags Environment=Development Project=EcommerceAnalytics
Location options: eastus, westus2, westeurope, southeastasia Choose the closest region to you for better performance.
az group list --output table
You should see "ecommerce-analytics-rg" in the list.
az group show --name ecommerce-analytics-rg --output json
Azure Data Lake Storage Gen2 will hold all our data in medallion architecture.
az storage account create \
--name ecommercedata001 \
--resource-group ecommerce-analytics-rg \
--location eastus \
--sku Standard_LRS \
--kind StorageV2 \
--hierarchical-namespace true \
--tags Environment=Development
Note: Storage account names must be globally unique, all lowercase, 3-24 characters. If "ecommercedata001" is taken, try "ecommercedata" + your initials + random numbers.
Wait for creation to complete (takes 1-2 minutes)
Get storage account key:
az storage account keys list \
--resource-group ecommerce-analytics-rg \
--account-name ecommercedata001 \
--query '[0].value' \
--output tsv
IMPORTANT: Copy this key and save it securely. You'll need it later.
CONN_STRING=$(az storage account show-connection-string \
--name ecommercedata001 \
--resource-group ecommerce-analytics-rg \
--output tsv)
This stores the connection string in a variable for use in next steps.
# Bronze layer (raw data)
az storage container create \
--name bronze \
--connection-string $CONN_STRING
# Silver layer (cleaned data)
az storage container create \
--name silver \
--connection-string $CONN_STRING
# Gold layer (aggregated data)
az storage container create \
--name gold \
--connection-string $CONN_STRING
# Raw landing zone
az storage container create \
--name raw-landing \
--connection-string $CONN_STRING
# Logs
az storage container create \
--name logs \
--connection-string $CONN_STRING
az storage container list \
--connection-string $CONN_STRING \
--output table
You should see 5 containers: bronze, silver, gold, raw-landing, logs
az storage blob upload-batch \
--destination raw-landing \
--source data/raw \
--connection-string $CONN_STRING \
--pattern "*.csv"
This uploads all CSV files from your local data/raw folder to the raw-landing container.
az storage blob list \
--container-name raw-landing \
--connection-string $CONN_STRING \
--output table
You should see 5 CSV files: customers.csv, products.csv, orders.csv, order_items.csv, web_events.csv
Key Vault securely stores secrets like storage keys and connection strings.
az keyvault create \
--name ecommerce-kv001 \
--resource-group ecommerce-analytics-rg \
--location eastus \
--enabled-for-deployment true \
--enabled-for-template-deployment true
Note: Key Vault names must be globally unique. If taken, try adding your initials.
# First, get the storage key
STORAGE_KEY=$(az storage account keys list \
--resource-group ecommerce-analytics-rg \
--account-name ecommercedata001 \
--query '[0].value' \
--output tsv)
# Then store it in Key Vault
az keyvault secret set \
--vault-name ecommerce-kv001 \
--name storage-account-key \
--value "$STORAGE_KEY"
az keyvault secret list \
--vault-name ecommerce-kv001 \
--output table
You should see "storage-account-key" in the list.
az keyvault show \
--name ecommerce-kv001 \
--resource-group ecommerce-analytics-rg \
--query properties.vaultUri \
--output tsv
Save this URI - you'll need it when configuring Databricks.
az databricks workspace create \
--resource-group ecommerce-analytics-rg \
--name ecommerce-databricks \
--location eastus \
--sku premium \
--tags Environment=Development
Note: This takes 5-10 minutes to complete. The premium SKU is needed for some advanced features we'll use later.
az databricks workspace show \
--resource-group ecommerce-analytics-rg \
--name ecommerce-databricks \
--query workspaceUrl \
--output tsv
Save this URL - this is where you'll access Databricks.
pip install databricks-cli
databricks configure --token
When prompted:
databricks workspace ls /
You should see a list of folders in your workspace (Users, Shared, etc.)
Create config/databricks_cluster.json in your project:
{
"cluster_name": "ecommerce-analytics-cluster",
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autoscale": {
"min_workers": 2,
"max_workers": 4
},
"autotermination_minutes": 30,
"spark_conf": {
"spark.sql.adaptive.enabled": "true",
"spark.databricks.delta.preview.enabled": "true",
"spark.databricks.delta.optimizeWrite.enabled": "true",
"spark.databricks.delta.autoCompact.enabled": "true"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"custom_tags": {
"Environment": "Development",
"Project": "EcommerceAnalytics"
}
}
databricks clusters create --json-file config/databricks_cluster.json
This returns a cluster ID. Save this ID!
databricks clusters get --cluster-id YOUR_CLUSTER_ID
Keep running this until you see "state": "RUNNING"
Alternative: Check in Databricks UI:
In Databricks UI:
Update your .env file with all the values you collected:
.env file in project root:
# Azure Configuration
AZURE_SUBSCRIPTION_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
AZURE_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
AZURE_RESOURCE_GROUP=ecommerce-analytics-rg
# Storage Configuration
AZURE_STORAGE_ACCOUNT=ecommercedata001
AZURE_STORAGE_KEY=<your-storage-key-from-step-2.3>
AZURE_STORAGE_CONTAINER_BRONZE=bronze
AZURE_STORAGE_CONTAINER_SILVER=silver
AZURE_STORAGE_CONTAINER_GOLD=gold
AZURE_STORAGE_CONTAINER_RAW=raw-landing
# Key Vault Configuration
AZURE_KEYVAULT_NAME=ecommerce-kv001
AZURE_KEYVAULT_URI=https://ecommerce-kv001.vault.azure.net/
# Databricks Configuration
DATABRICKS_HOST=https://adb-XXXXXXXXXX.XX.azuredatabricks.net
DATABRICKS_TOKEN=<your-personal-access-token>
DATABRICKS_CLUSTER_ID=<your-cluster-id-from-step-2.6>
# Environment Settings
ENVIRONMENT=development
LOG_LEVEL=INFO
REGION=eastus
cat .gitignore | grep .env
Should show .env is ignored (so secrets don't get committed to git).
Create a test script to verify everything is connected.
scripts/test_azure_connection.py:"""
Test Azure connectivity and verify Phase 2 setup
"""
import os
from dotenv import load_dotenv
from azure.storage.blob import BlobServiceClient
from azure.identity import DefaultAzureCredential
# Load environment variables
load_dotenv()
print("=" * 70)
print("TESTING AZURE CONNECTIVITY")
print("=" * 70)
# Test 1: Storage Account Connection
print("\n1. Testing Storage Account Connection...")
try:
storage_account = os.getenv('AZURE_STORAGE_ACCOUNT')
storage_key = os.getenv('AZURE_STORAGE_KEY')
connection_string = f"DefaultEndpointsProtocol=https;AccountName={storage_account};AccountKey={storage_key};EndpointSuffix=core.windows.net"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# List containers
containers = list(blob_service_client.list_containers())
print(f" ✅ Connected to storage account: {storage_account}")
print(f" ✅ Found {len(containers)} containers:")
for container in containers:
print(f" - {container.name}")
except Exception as e:
print(f" ❌ Error: {e}")
# Test 2: Check data in raw-landing
print("\n2. Testing Data in raw-landing Container...")
try:
container_client = blob_service_client.get_container_client("raw-landing")
blobs = list(container_client.list_blobs())
print(f" ✅ Found {len(blobs)} files in raw-landing:")
for blob in blobs:
size_mb = blob.size / (1024 * 1024)
print(f" - {blob.name} ({size_mb:.2f} MB)")
except Exception as e:
print(f" ❌ Error: {e}")
# Test 3: Databricks Connection
print("\n3. Testing Databricks Connection...")
try:
databricks_host = os.getenv('DATABRICKS_HOST')
databricks_token = os.getenv('DATABRICKS_TOKEN')
if databricks_host and databricks_token:
print(f" ✅ Databricks credentials configured")
print(f" Host: {databricks_host}")
print(f" Token: {'*' * 20} (hidden)")
else:
print(" ❌ Databricks credentials not found in .env")
except Exception as e:
print(f" ❌ Error: {e}")
# Test 4: Environment Variables
print("\n4. Verifying Environment Variables...")
required_vars = [
'AZURE_SUBSCRIPTION_ID',
'AZURE_RESOURCE_GROUP',
'AZURE_STORAGE_ACCOUNT',
'AZURE_STORAGE_KEY',
'DATABRICKS_HOST',
'DATABRICKS_TOKEN',
'DATABRICKS_CLUSTER_ID'
]
all_present = True
for var in required_vars:
value = os.getenv(var)
if value:
print(f" ✅ {var}: configured")
else:
print(f" ❌ {var}: MISSING")
all_present = False
print("\n" + "=" * 70)
if all_present:
print("✅ ALL TESTS PASSED - Phase 2 setup complete!")
else:
print("❌ SOME TESTS FAILED - Check configuration")
print("=" * 70)
python scripts/test_azure_connection.py
======================================================================
TESTING AZURE CONNECTIVITY
======================================================================
1. Testing Storage Account Connection...
✅ Connected to storage account: ecommercedata001
✅ Found 5 containers:
- bronze
- silver
- gold
- raw-landing
- logs
2. Testing Data in raw-landing Container...
✅ Found 5 files in raw-landing:
- customers.csv (X.XX MB)
- products.csv (X.XX MB)
- orders.csv (X.XX MB)
- order_items.csv (X.XX MB)
- web_events.csv (X.XX MB)
3. Testing Databricks Connection...
✅ Databricks credentials configured
Host: https://adb-XXXXXXXXXX.XX.azuredatabricks.net
Token: ******************** (hidden)
4. Verifying Environment Variables...
✅ AZURE_SUBSCRIPTION_ID: configured
✅ AZURE_RESOURCE_GROUP: configured
✅ AZURE_STORAGE_ACCOUNT: configured
✅ AZURE_STORAGE_KEY: configured
✅ DATABRICKS_HOST: configured
✅ DATABRICKS_TOKEN: configured
✅ DATABRICKS_CLUSTER_ID: configured
======================================================================
✅ ALL TESTS PASSED - Phase 2 setup complete!
======================================================================
# Check status
git status
# Add new files (but not .env which is gitignored)
git add config/databricks_cluster.json
git add scripts/test_azure_connection.py
git add .env.example # Update this with new variables
# Commit
git commit -m "Phase 2 complete: Azure infrastructure setup
- Created Azure resource group and storage account
- Set up Data Lake with medallion architecture containers
- Created Databricks workspace and cluster
- Uploaded sample data to cloud
- Configured secure connections with Key Vault
- All connectivity tests passing"
# Push to GitHub
git push origin main
✅ Azure Infrastructure
✅ Storage Architecture
✅ Data Upload
✅ Databricks Environment
✅ Security Setup
✅ Cost Optimization
With free tier and $200 credit:
Total estimated cost: $20-40/month if running 8 hours/day With auto-termination: Much less!
In Phase 3, you will:
Estimated Time: 6-8 hours over Days 6-8
Issue: Storage account name already taken
Solution: Use a more unique name like "ecommercedata" + your initials + random numbers
Issue: "Premium SKU not available in region"
Solution: Try a different region (westus2, westeurope) or use Standard SKU
Issue: Cluster won't start
Solution: Check quota limits in Azure portal, may need to request increase
Issue: Connection test fails
Solution: Verify all values in .env are correct, no extra spaces or quotes
Issue: "Cannot upload blob"
Solution: Verify connection string is correct, check storage account has public access enabled
Phase 2 Manual Version 1.0
Last Updated: 2025-01-01