In this project, you will step into the shoes of an entry-level data analyst at an online retail company, helping interpret real-world data to help make a key business decision.
In this project, you will be working with transactional data from an online retail store. The dataset contains information about customer purchases, including product details, quantities, prices, and timestamps. Your task is to explore and analyze this dataset to gain insights into the store's sales trends, customer behavior, and popular products.
By conducting exploratory data analysis, you will identify patterns, outliers, and correlations in the data, allowing you to make data-driven decisions and recommendations to optimize the store's operations and improve customer satisfaction. Through visualizations and statistical analysis, you will uncover key trends, such as the busiest sales months, best-selling products, and the store's most valuable customers. Ultimately, this project aims to provide actionable insights that can drive strategic business decisions and enhance the store's overall performance in the competitive online retail market.
Before starting this project, you should have some basic knowledge of Python programming and Pandas. In addition, you may want to use the following packages in your Python environment:
These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using !pip install packagename
within a notebook cell such as:
!pip install pandas
!pip install matplotlib
The dataset you will be working with is the "Online Retail" dataset. It contains transactional data of an online retail store from 2010 to 2011. The dataset is available as a .xlsx file named Online Retail.xlsx
. This data file is already included in the Coursera Jupyter Notebook environment, however if you are working off-platform it can also be downloaded here.
The dataset contains the following columns:
You may explore this dataset in any way you would like - however if you'd like some help getting started, here are a few ideas:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_excel("Online Retail.xlsx")
print(df.info())
df["CustomerID"] = df["CustomerID"].fillna(0).astype("int32")
df['revenue'] = df['Quantity'] * df['UnitPrice']
#remove outlier orders with crazy price, must be mistake. this one basket has a price of 600 instead of its usual $3
df= df[~((df['StockCode'] == 22502) & (df['UnitPrice'] > 600))]
#remove postage, manuals, and bank charges, ect -- we just want to see sales data.
df = df[~df['StockCode'].isin(['M','DOT','BANK CHARGES', 'CRUK','AMAZONFEE','POST',
'PADS','B','D','C2','gift_0001_30','gift_0001_40',
'gift_0001_50','gift_0001_20','gift_0001_10','S'])]
unique_items = df['StockCode'].nunique()
unique_customers = df["CustomerID"].nunique()
unique_countries = df['Country'].nunique()
total_units_sold = df['Quantity'].sum()
total_revenue = df['revenue'].sum()
print(f"Unique products: {unique_items}")
print(f"Unique customers: {unique_customers}")
print(f"Unique countires: {unique_countries}")
print(f"total units sold: {total_units_sold}")
print(f"total revenue: ${total_revenue}")
# Check for duplicate rows
df["DayOfWeek"] = df["InvoiceDate"].dt.day_name()
df["Month"] = df["InvoiceDate"].dt.month_name()
rev_by_day = df.groupby('DayOfWeek')['revenue'].sum()
rev_by_month = df.groupby('Month')['revenue'].sum()
# Sales by Day of the Week
plt.figure(figsize=(10, 4))
sales_by_day = rev_by_day.reindex(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
sales_by_day.plot(kind="bar", color="skyblue")
plt.title("Revenue by Day of the Week")
plt.xlabel("Day")
plt.ylabel("Total Revenue")
plt.xticks(rotation=45)
plt.show()
# Sales by Month
plt.figure(figsize=(10, 4))
sales_by_month = rev_by_month.reindex(["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"])
sales_by_month.plot(kind="bar", color="lightgreen")
plt.title("Revenue by Month")
plt.xlabel("Month")
plt.ylabel("Total Revenue")
plt.xticks(rotation=45)
plt.show()
countries = df.groupby('Country')['revenue'].sum().sort_values(ascending=False).head(5)
print(countries)
plt.figure(figsize=(12, 7))
# Plot the pie chart with labels
plt.figure(figsize=(12, 7))
countries.plot(kind="pie", autopct='%1.1f%%')
# Add title and show the chart
plt.title("Top 5 Countries by Revenue")
plt.ylabel("") # Removes default y-axis label
plt.show()
countries.plot(kind="bar")
plt.ylabel('Revenue')
plt.xlabel('Product ID')
products = df.groupby('Description')['revenue'].sum().sort_values(ascending=False).head(15)
products2 = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(15)
productss = df.groupby('StockCode').agg({
'revenue': 'sum',
'Description': 'first', # Keeps the first description for each StockCode
'Quantity': 'sum'
}).sort_values(by='revenue', ascending=False).head(15)
print(productss)
# Plot the pie chart with labels
plt.figure(figsize=(12, 7))
products.plot(kind="pie", autopct='%1.1f%%')
# Add title and show the chart
plt.title("Top 15 Products by Revenue")
plt.ylabel("") # Removes default y-axis label
plt.show()
products.plot(kind="bar")
plt.ylabel('Revenue')
plt.xlabel('Product ID')
# Plot the pie chart with labels
plt.figure(figsize=(12, 7))
products2.plot(kind="pie", autopct='%1.1f%%')
# Add title and show the chart
plt.title("Top 15 Products by Quantity")
plt.ylabel("") # Removes default y-axis label
plt.show()
products2.plot(kind="bar")
plt.ylabel('Quantity')
plt.xlabel('Product ID')
plt.figure(figsize=(12, 5))
sns.boxplot(x=df["Quantity"])
plt.title("Outliers in Quantity")
plt.show()
plt.figure(figsize=(12, 5))
sns.boxplot(x=df["UnitPrice"])
plt.title("Outliers in Unit Price")
plt.show()
sns.boxplot(x=df["revenue"])
plt.title("Outliers in Revenue")
plt.show()
UK = df[df['Country'] == 'United Kingdom']
UK.sort_values(by='revenue', ascending = False).head(25)
#UK.sort_values(by='Quantity', ascending = True).head(25)
top_cust_rev = df.groupby('CustomerID')['revenue'].sum().sort_values(ascending=False).head(15)
top_cust_rev1 = df.groupby('CustomerID').agg({
'revenue':'sum',
'Quantity':'sum',
'Country':'first'}).sort_values(by = 'revenue',ascending = False).head(15)
print(top_cust_rev1)
plt.figure(figsize=(14, 10))
top_cust_rev.plot(kind='pie', autopct='%1.1f%%')
plt.title("Top 15 CustomerID's including all Nulls as 0")
plt.ylabel("")
tops_cust= df[~df['CustomerID'].isin([0])]
top_custs = tops_cust.groupby('CustomerID')['revenue'].sum().sort_values(ascending=False).head(15)
plt.figure(figsize=(12, 7))
top_custs.plot(kind='pie',autopct='%1.1f%%')
plt.title('Top 15 CustomerIDs Excluding all nulls')
plt.ylabel("")
#lets see where our top customers are from, and their favorite products
t_cust = tops_cust.groupby('CustomerID').agg({'revenue':'sum','Quantity':'sum','UnitPrice':'first','Country':'first', 'Description': lambda x: x.mode().iloc[0] }).sort_values(by='revenue', ascending = False).head(25)
print(t_cust)
!pip install plotly
import plotly.express as px
#revenue by country
country_revenue = df.groupby("Country")["revenue"].sum().reset_index()
fig = px.choropleth(
country_revenue,
locations="Country",
locationmode="country names",
color="revenue",
color_continuous_scale="reds",
title="Revenue by Country"
)
fig.show()
## analysis removing all customer null entries
products123 = tops_cust.groupby('Description')['revenue'].sum().sort_values(ascending=False).head(15)
products1234 = tops_cust.groupby('Description').agg({
'revenue':'sum',
'Country': 'max'}).sort_values(by ='revenue',ascending=False).head(15)
print(products1234)
# Plot the pie chart with labels
plt.figure(figsize=(12, 7))
products123.plot(kind="pie", autopct='%1.1f%%')
# Add title and show the chart
plt.title("Top 15 Products by Revenue")
plt.ylabel("") # Removes default y-axis label
plt.show()
products123.plot(kind="bar")
plt.ylabel('Revenue')
plt.xlabel('Product ID')
#we noticed prices for same products are different at times,
#we want to find out what price performed best, to see what prices we can get away with
#FIND MOST SALES PER PRICE POINT
price_performance = df.groupby(["StockCode", "UnitPrice"])["Quantity"].sum().reset_index()
best_prices = price_performance.loc[price_performance.groupby("StockCode")["Quantity"].idxmax()]
# Show first few rows
print(best_prices.head())
# Plot top 20 best-performing prices as a bar chart
plt.figure(figsize=(12, 6))
top_products = best_prices.sort_values(by="Quantity", ascending=False).head(20)
plt.bar(top_products["StockCode"].astype(str), top_products["UnitPrice"], color="skyblue")
plt.xlabel("Product StockCode")
plt.ylabel("Best Performing Price")
plt.title("Best Performing Prices for Top 20 Products")
plt.xticks(rotation=90)
plt.show()
# Find best price by revenue
price_performance = df.groupby(["StockCode", "UnitPrice"])["Quantity"].sum().reset_index()
price_performance["Revenue"] = price_performance["UnitPrice"] * price_performance["Quantity"]
best_prices = price_performance.loc[price_performance.groupby("StockCode")["Revenue"].idxmax()]
# Select top 20 revenue-generating products
top_products = best_prices.sort_values(by="Revenue", ascending=False).head(20)
# Create the bar chart
plt.figure(figsize=(13, 8))
bars = plt.bar(top_products["StockCode"].astype(str), top_products["Revenue"], color="green")
# Add labels above bars (show price instead of revenue)
for bar, price in zip(bars, top_products["UnitPrice"]):
yval = bar.get_height() # Get revenue (height of the bar)
plt.text(bar.get_x() + bar.get_width()/2, yval + (yval * 0.02), f"${price:.2f}", ha="center", va="bottom", fontsize=10)
# Formatting
plt.xlabel("Product StockCode")
plt.ylabel("Total Revenue")
plt.title("Best Performing Prices for Top 20 Products")
plt.xticks(rotation=90)
plt.show()
df['StockCode'] = df['StockCode'].astype(str)
top_item = df[df['StockCode'] == '22423']
prices = top_item.groupby('UnitPrice')['Quantity'].sum().sort_values(ascending=False)
print(prices.head(50))
ppp=df.groupby('StockCode').agg({'UnitPrice':'max','Description':'first','Quantity':'sum'}).sort_values(by = 'UnitPrice', ascending=False)
print(ppp.head(25))
df['CustomerID'] = df['CustomerID'].astype(str)
t = df[df['CustomerID'] == '14646']
tt = t.groupby('Description').agg({'Quantity':'sum','UnitPrice':'max','revenue':'sum'}).sort_values(by='Quantity',ascending=False)
print(tt.head(50))