Back to Projects
online_retail

Portfolio Project: Online Retail Exploratory Data Analysis with Python

Overview

In this project, you will step into the shoes of an entry-level data analyst at an online retail company, helping interpret real-world data to help make a key business decision.

Case Study

In this project, you will be working with transactional data from an online retail store. The dataset contains information about customer purchases, including product details, quantities, prices, and timestamps. Your task is to explore and analyze this dataset to gain insights into the store's sales trends, customer behavior, and popular products.

By conducting exploratory data analysis, you will identify patterns, outliers, and correlations in the data, allowing you to make data-driven decisions and recommendations to optimize the store's operations and improve customer satisfaction. Through visualizations and statistical analysis, you will uncover key trends, such as the busiest sales months, best-selling products, and the store's most valuable customers. Ultimately, this project aims to provide actionable insights that can drive strategic business decisions and enhance the store's overall performance in the competitive online retail market.

Prerequisites

Before starting this project, you should have some basic knowledge of Python programming and Pandas. In addition, you may want to use the following packages in your Python environment:

  • pandas
  • numpy
  • seaborn
  • matplotlib

These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using !pip install packagename within a notebook cell such as:

  • !pip install pandas
  • !pip install matplotlib

Project Objectives

  1. Describe data to answer key questions to uncover insights
  2. Gain valuable insights that will help improve online retail performance
  3. Provide analytic insights and data-driven recommendations

Dataset

The dataset you will be working with is the "Online Retail" dataset. It contains transactional data of an online retail store from 2010 to 2011. The dataset is available as a .xlsx file named Online Retail.xlsx. This data file is already included in the Coursera Jupyter Notebook environment, however if you are working off-platform it can also be downloaded here.

The dataset contains the following columns:

  • InvoiceNo: Invoice number of the transaction
  • StockCode: Unique code of the product
  • Description: Description of the product
  • Quantity: Quantity of the product in the transaction
  • InvoiceDate: Date and time of the transaction
  • UnitPrice: Unit price of the product
  • CustomerID: Unique identifier of the customer
  • Country: Country where the transaction occurred

Tasks

You may explore this dataset in any way you would like - however if you'd like some help getting started, here are a few ideas:

  1. Load the dataset into a Pandas DataFrame and display the first few rows to get an overview of the data.
  2. Perform data cleaning by handling missing values, if any, and removing any redundant or unnecessary columns.
  3. Explore the basic statistics of the dataset, including measures of central tendency and dispersion.
  4. Perform data visualization to gain insights into the dataset. Generate appropriate plots, such as histograms, scatter plots, or bar plots, to visualize different aspects of the data.
  5. Analyze the sales trends over time. Identify the busiest months and days of the week in terms of sales.
  6. Explore the top-selling products and countries based on the quantity sold.
  7. Identify any outliers or anomalies in the dataset and discuss their potential impact on the analysis.
  8. Draw conclusions and summarize your findings from the exploratory data analysis.

Task 1: Load the Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_excel("Online Retail.xlsx")
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
None

Null customer Id's are filled with '0' for now, to be filtered later. Also will make a revenue column, and remove non-product data from our set before checking out key statistics:

In [2]:
df["CustomerID"] = df["CustomerID"].fillna(0).astype("int32")
df['revenue'] = df['Quantity'] * df['UnitPrice']

#remove outlier orders with crazy price, must be mistake. this one basket has a price of 600 instead of its usual $3
df= df[~((df['StockCode'] == 22502) & (df['UnitPrice'] > 600))]

#remove postage, manuals, and bank charges, ect -- we just want to see sales data.
df = df[~df['StockCode'].isin(['M','DOT','BANK CHARGES', 'CRUK','AMAZONFEE','POST',
                               'PADS','B','D','C2','gift_0001_30','gift_0001_40',
                               'gift_0001_50','gift_0001_20','gift_0001_10','S'])]





unique_items = df['StockCode'].nunique()
unique_customers = df["CustomerID"].nunique()
unique_countries = df['Country'].nunique()
total_units_sold = df['Quantity'].sum()
total_revenue = df['revenue'].sum()

print(f"Unique products:  {unique_items}")
print(f"Unique customers:  {unique_customers}")
print(f"Unique countires:  {unique_countries}")
print(f"total units sold:  {total_units_sold}")
print(f"total revenue:  ${total_revenue}")
# Check for duplicate rows
Unique products:  4054
Unique customers:  4364
Unique countires:  38
total units sold:  5166077
total revenue:  $9753091.930000002

Below compares sales/revenue by day-of-week to see any patterns, and check month over month growth. Saturday has no sales data, and the monthly chart indicates a strong end of year growth:

In [3]:
df["DayOfWeek"] = df["InvoiceDate"].dt.day_name()
df["Month"] = df["InvoiceDate"].dt.month_name()

rev_by_day = df.groupby('DayOfWeek')['revenue'].sum()
rev_by_month = df.groupby('Month')['revenue'].sum()


# Sales by Day of the Week
plt.figure(figsize=(10, 4))
sales_by_day = rev_by_day.reindex(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
sales_by_day.plot(kind="bar", color="skyblue")

plt.title("Revenue by Day of the Week")
plt.xlabel("Day")
plt.ylabel("Total Revenue")
plt.xticks(rotation=45)
plt.show()


# Sales by Month
plt.figure(figsize=(10, 4))
sales_by_month = rev_by_month.reindex(["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"])
sales_by_month.plot(kind="bar", color="lightgreen")

plt.title("Revenue by Month")
plt.xlabel("Month")
plt.ylabel("Total Revenue")
plt.xticks(rotation=45)
plt.show()

Which countries give the most business? The answer is UK by far. We can try to expand marketing to the other top countries:

In [4]:
countries = df.groupby('Country')['revenue'].sum().sort_values(ascending=False).head(5)
print(countries)
plt.figure(figsize=(12, 7))

# Plot the pie chart with labels
plt.figure(figsize=(12, 7))
countries.plot(kind="pie", autopct='%1.1f%%')

# Add title and show the chart
plt.title("Top 5 Countries by Revenue")
plt.ylabel("")  # Removes default y-axis label
plt.show()

countries.plot(kind="bar")
plt.ylabel('Revenue')
plt.xlabel('Product ID')
Country
United Kingdom    8.262101e+06
Netherlands       2.834795e+05
EIRE              2.596635e+05
Germany           2.008084e+05
France            1.821634e+05
Name: revenue, dtype: float64
<Figure size 864x504 with 0 Axes>
Out[4]:
Text(0.5, 0, 'Product ID')

Below shows top products by quantity and revenue. The biggest quantity is not always the biggest revenue:

In [5]:
products = df.groupby('Description')['revenue'].sum().sort_values(ascending=False).head(15)
products2 = df.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(15)

productss = df.groupby('StockCode').agg({
    'revenue': 'sum',  
    'Description': 'first',  # Keeps the first description for each StockCode
    'Quantity': 'sum'
}).sort_values(by='revenue', ascending=False).head(15)

print(productss)

# Plot the pie chart with labels
plt.figure(figsize=(12, 7))
products.plot(kind="pie", autopct='%1.1f%%')
# Add title and show the chart
plt.title("Top 15 Products by Revenue")
plt.ylabel("")  # Removes default y-axis label
plt.show()

products.plot(kind="bar")
plt.ylabel('Revenue')
plt.xlabel('Product ID')


# Plot the pie chart with labels
plt.figure(figsize=(12, 7))
products2.plot(kind="pie", autopct='%1.1f%%')
# Add title and show the chart
plt.title("Top 15 Products by Quantity")
plt.ylabel("")  # Removes default y-axis label
plt.show()

products2.plot(kind="bar")
plt.ylabel('Quantity')
plt.xlabel('Product ID')
             revenue                         Description  Quantity
StockCode                                                         
22423      164762.19            REGENCY CAKESTAND 3 TIER     12980
47566       98302.98                       PARTY BUNTING     18022
85123A      97894.50  WHITE HANGING HEART T-LIGHT HOLDER     38830
85099B      92356.03             JUMBO BAG RED RETROSPOT     47363
23084       66756.59                  RABBIT NIGHT LIGHT     30646
22086       63791.94     PAPER CHAIN KIT 50'S CHRISTMAS      18902
84879       58959.73       ASSORTED COLOUR BIRD ORNAMENT     36221
79321       53768.06                       CHILLI LIGHTS     10343
22197       50987.47                SMALL POPCORN HOLDER     56450
23298       42700.02                      SPOTTY BUNTING      8534
22386       41619.66             JUMBO BAG PINK POLKADOT     21009
23203       40991.38                             mailout     20013
21137       40596.96            BLACK RECORD COVER FRAME     11616
22720       37413.44   SET OF 3 CAKE TINS PANTRY DESIGN       7286
23284       36565.39       DOORMAT KEEP CALM AND COME IN      5264
Out[5]:
Text(0.5, 0, 'Product ID')

below takes a look at the outliers in quantity, unitprice, and revenue. negative values were due to returns and canceled out when summing:

In [6]:
plt.figure(figsize=(12, 5))
sns.boxplot(x=df["Quantity"])
plt.title("Outliers in Quantity")
plt.show()

plt.figure(figsize=(12, 5))
sns.boxplot(x=df["UnitPrice"])
plt.title("Outliers in Unit Price")
plt.show()

sns.boxplot(x=df["revenue"])
plt.title("Outliers in Revenue")
plt.show()

Look below to see the country UK's top 25 orders sorted by revenue

In [7]:
UK = df[df['Country'] == 'United Kingdom']
UK.sort_values(by='revenue', ascending = False).head(25)
#UK.sort_values(by='Quantity', ascending = True).head(25)
Out[7]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country revenue DayOfWeek Month
540421 581483 23843 PAPER CRAFT , LITTLE BIRDIE 80995 2011-12-09 09:15:00 2.08 16446 United Kingdom 168469.60 Friday December
61619 541431 23166 MEDIUM CERAMIC TOP STORAGE JAR 74215 2011-01-18 10:01:00 1.04 12346 United Kingdom 77183.60 Tuesday January
348325 567423 23243 SET OF TEA COFFEE SUGAR TINS PANTRY 1412 2011-09-20 11:05:00 5.06 17450 United Kingdom 7144.72 Tuesday September
52711 540815 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR 3114 2011-01-11 12:55:00 2.10 15749 United Kingdom 6539.40 Tuesday January
160546 550461 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR 3114 2011-04-18 13:20:00 2.10 15749 United Kingdom 6539.40 Monday April
52709 540815 85123A WHITE HANGING HEART T-LIGHT HOLDER 1930 2011-01-11 12:55:00 2.55 15749 United Kingdom 4921.50 Tuesday January
467804 576365 22086 PAPER CHAIN KIT 50'S CHRISTMAS 688 2011-11-14 17:55:00 6.95 0 United Kingdom 4781.60 Monday November
160542 550461 85123A WHITE HANGING HEART T-LIGHT HOLDER 1930 2011-04-18 13:20:00 2.40 15749 United Kingdom 4632.00 Monday April
52772 540818 48185 DOORMAT FAIRY CAKE 670 2011-01-11 12:57:00 6.75 15749 United Kingdom 4522.50 Tuesday January
244419 558526 23173 REGENCY TEAPOT ROSES 540 2011-06-30 11:01:00 8.15 17949 United Kingdom 4401.00 Thursday June
160544 550461 48185 DOORMAT FAIRY CAKE 670 2011-04-18 13:20:00 6.35 15749 United Kingdom 4254.50 Monday April
291249 562439 84879 ASSORTED COLOUR BIRD ORNAMENT 2880 2011-08-04 18:06:00 1.45 12931 United Kingdom 4176.00 Thursday August
52134 540689 22470 HEART OF WICKER LARGE 1284 2011-01-11 08:43:00 3.21 17450 United Kingdom 4121.64 Tuesday January
533812 581115 22413 METAL SIGN TAKE IT OR LEAVE IT 1404 2011-12-07 12:20:00 2.75 15195 United Kingdom 3861.00 Wednesday December
16438 537659 21623 VINTAGE UNION JACK MEMOBOARD 600 2010-12-07 16:43:00 6.38 18102 United Kingdom 3828.00 Tuesday December
348319 567423 23113 PANTRY CHOPPING BOARD 756 2011-09-20 11:05:00 5.06 17450 United Kingdom 3825.36 Tuesday September
52710 540815 21175 GIN + TONIC DIET METAL SIGN 2000 2011-01-11 12:55:00 1.85 15749 United Kingdom 3700.00 Tuesday January
348323 567423 22722 SET OF 6 SPICE TINS PANTRY DESIGN 852 2011-09-20 11:05:00 4.25 17450 United Kingdom 3621.00 Tuesday September
447035 574941 22197 POPCORN HOLDER 1820 2011-11-07 17:42:00 1.95 0 United Kingdom 3549.00 Monday November
160545 550461 21175 GIN + TONIC DIET METAL SIGN 2000 2011-04-18 13:20:00 1.69 15749 United Kingdom 3380.00 Monday April
447040 574941 22086 PAPER CHAIN KIT 50'S CHRISTMAS 478 2011-11-07 17:42:00 6.95 0 United Kingdom 3322.10 Monday November
52771 540818 47556B TEA TIME TEA TOWELS 1300 2011-01-11 12:57:00 2.55 15749 United Kingdom 3315.00 Tuesday January
411229 572209 23556 LANDMARK FRAME COVENT GARDEN 300 2011-10-21 12:08:00 10.95 18102 United Kingdom 3285.00 Friday October
411235 572209 23557 LANDMARK FRAME BAKER STREET 300 2011-10-21 12:08:00 10.95 18102 United Kingdom 3285.00 Friday October
411234 572209 23553 LANDMARK FRAME CAMDEN TOWN 300 2011-10-21 12:08:00 10.95 18102 United Kingdom 3285.00 Friday October

Below explores the top 15 customers on a pie chart, including all of the null values previously filled with 0:

In [8]:
top_cust_rev = df.groupby('CustomerID')['revenue'].sum().sort_values(ascending=False).head(15)
top_cust_rev1 = df.groupby('CustomerID').agg({
    'revenue':'sum',
    'Quantity':'sum',
    'Country':'first'}).sort_values(by = 'revenue',ascending = False).head(15)
print(top_cust_rev1)

plt.figure(figsize=(14, 10))
top_cust_rev.plot(kind='pie', autopct='%1.1f%%')
plt.title("Top 15 CustomerID's including all Nulls as 0")
plt.ylabel("")
               revenue  Quantity         Country
CustomerID                                      
0           1506048.11    264391  United Kingdom
14646        278778.02    197132     Netherlands
18102        259657.30     64124  United Kingdom
17450        189735.53     69041  United Kingdom
14911        128882.13     77098            EIRE
12415        123638.18     77242       Australia
14156        113855.32     57013            EIRE
17511         88138.20     63014  United Kingdom
16684         65920.12     49391  United Kingdom
13694         62924.10     61808  United Kingdom
15311         59419.34     37720  United Kingdom
16029         58193.85     32203  United Kingdom
13089         57385.88     30787  United Kingdom
15061         54250.34     28638  United Kingdom
14096         53258.43     16336  United Kingdom
Out[8]:
Text(0, 0.5, '')

Filtering the null values out of the top customer chart:

In [9]:
tops_cust= df[~df['CustomerID'].isin([0])] 

top_custs = tops_cust.groupby('CustomerID')['revenue'].sum().sort_values(ascending=False).head(15)

plt.figure(figsize=(12, 7))
top_custs.plot(kind='pie',autopct='%1.1f%%')
plt.title('Top 15 CustomerIDs Excluding all nulls')
plt.ylabel("")

#lets see where our top customers are from, and their favorite products
t_cust = tops_cust.groupby('CustomerID').agg({'revenue':'sum','Quantity':'sum','UnitPrice':'first','Country':'first', 'Description': lambda x: x.mode().iloc[0] }).sort_values(by='revenue', ascending = False).head(25)
print(t_cust)
              revenue  Quantity  UnitPrice         Country  \
CustomerID                                                   
14646       278778.02    197132       0.29     Netherlands   
18102       259657.30     64124       4.78  United Kingdom   
17450       189735.53     69041       6.60  United Kingdom   
14911       128882.13     77098       9.95            EIRE   
12415       123638.18     77242       1.85       Australia   
14156       113855.32     57013       4.25            EIRE   
17511        88138.20     63014       8.50  United Kingdom   
16684        65920.12     49391       2.10  United Kingdom   
13694        62924.10     61808       1.06  United Kingdom   
15311        59419.34     37720       4.25  United Kingdom   
16029        58193.85     32203       4.95  United Kingdom   
13089        57385.88     30787      10.95  United Kingdom   
15061        54250.34     28638       0.36  United Kingdom   
14096        53258.43     16336       2.46  United Kingdom   
17949        53215.74     27572      10.95  United Kingdom   
15769        51823.72     27660       2.55  United Kingdom   
14298        50862.44     58021       1.25  United Kingdom   
14088        50415.49     12593       5.95  United Kingdom   
17841        40333.22     22850       2.10  United Kingdom   
13798        36351.42     23514       2.55  United Kingdom   
16422        34311.00     33584       1.25  United Kingdom   
12931        33544.81     23577       0.72  United Kingdom   
16013        33453.45     13942       4.25  United Kingdom   
15838        33362.28      9008       0.03  United Kingdom   
17389        31317.48      7442       2.95  United Kingdom   

                                  Description  
CustomerID                                     
14646                     SPACEBOY LUNCH BOX   
18102                 CREAM HEART CARD HOLDER  
17450              COOK WITH WINE METAL SIGN   
14911                REGENCY CAKESTAND 3 TIER  
12415                    DOLLY GIRL LUNCH BOX  
14156                REGENCY CAKESTAND 3 TIER  
17511                 JUMBO BAG RED RETROSPOT  
16684         GREEN REGENCY TEACUP AND SAUCER  
13694                        JUMBO BAG APPLES  
15311               EDWARDIAN PARASOL NATURAL  
16029                 JUMBO BAG RED RETROSPOT  
13089           SILVER HANGING T-LIGHT HOLDER  
15061         CLOTHES PEGS RETROSPOT PACK 24   
14096       SET OF 3 WOODEN HEART DECORATIONS  
17949                          POPCORN HOLDER  
15769                          JUMBO BAG OWLS  
14298                 FENG SHUI PILLAR CANDLE  
14088        ANTIQUE GLASS DRESSING TABLE POT  
17841                           CHILLI LIGHTS  
13798                    HOT BATHS METAL SIGN  
16422                 PAPER BUNTING RETROSPOT  
12931           RED TOADSTOOL LED NIGHT LIGHT  
16013                           CHILLI LIGHTS  
15838                 JUMBO BAG RED RETROSPOT  
17389             CREAM SWEETHEART MINI CHEST  

Lets see a interactive map of the countries we do business with:

In [10]:
!pip install plotly
import plotly.express as px

#revenue by country
country_revenue = df.groupby("Country")["revenue"].sum().reset_index()

fig = px.choropleth(
    country_revenue,
    locations="Country",
    locationmode="country names",
    color="revenue",
    color_continuous_scale="reds",
    title="Revenue by Country"
)


fig.show()
Collecting plotly
  Downloading plotly-5.18.0-py3-none-any.whl (15.6 MB)
     |████████████████████████████████| 15.6 MB 30.8 MB/s            
Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from plotly) (20.1)
Collecting tenacity>=6.2.0
  Downloading tenacity-8.2.3-py3-none-any.whl (24 kB)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from packaging->plotly) (1.14.0)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging->plotly) (2.4.7)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.18.0 tenacity-8.2.3
WARNING: You are using pip version 21.3.1; however, version 24.0 is available.
You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.

Below we take a look at our best selling products, as well as the products which bring in the most revenue:

In [11]:
## analysis removing all customer null entries


products123 = tops_cust.groupby('Description')['revenue'].sum().sort_values(ascending=False).head(15)

products1234 = tops_cust.groupby('Description').agg({
    'revenue':'sum',
    'Country': 'max'}).sort_values(by ='revenue',ascending=False).head(15)
print(products1234)

# Plot the pie chart with labels
plt.figure(figsize=(12, 7))
products123.plot(kind="pie", autopct='%1.1f%%')
# Add title and show the chart
plt.title("Top 15 Products by Revenue")
plt.ylabel("")  # Removes default y-axis label
plt.show()

products123.plot(kind="bar")
plt.ylabel('Revenue')
plt.xlabel('Product ID')
                                      revenue         Country
Description                                                  
REGENCY CAKESTAND 3 TIER            132870.40     Unspecified
WHITE HANGING HEART T-LIGHT HOLDER   93823.85  United Kingdom
JUMBO BAG RED RETROSPOT              83236.76  United Kingdom
PARTY BUNTING                        67687.53  United Kingdom
ASSORTED COLOUR BIRD ORNAMENT        56499.22     Unspecified
RABBIT NIGHT LIGHT                   51137.80  United Kingdom
CHILLI LIGHTS                        45936.81  United Kingdom
PAPER CHAIN KIT 50'S CHRISTMAS       41500.48  United Kingdom
BLACK RECORD COVER FRAME             39009.38  United Kingdom
JUMBO BAG PINK POLKADOT              36473.01  United Kingdom
SPOTTY BUNTING                       35056.44  United Kingdom
DOORMAT KEEP CALM AND COME IN        34312.60     Unspecified
WOOD BLACK BOARD ANT WHITE FINISH    34307.06  United Kingdom
SET OF 3 CAKE TINS PANTRY DESIGN     32607.80  United Kingdom
JAM MAKING SET WITH JARS             31658.47     Unspecified
Out[11]:
Text(0.5, 0, 'Product ID')

Notice, the stock items have different pricepoints throughout the year. so it would be a good idea to see which price points proved to generate the most revenue. We need more data, such as profit, to really see the real optimal price. the below graph shows the top 20 products and their optimal pricing based on quantity sold: (see revenue graph in the next cell)

In [12]:
#we noticed prices for same products are different at times, 
#we want to find out what price performed best, to see what prices we can get away with

#FIND MOST SALES PER PRICE POINT
price_performance = df.groupby(["StockCode", "UnitPrice"])["Quantity"].sum().reset_index()
best_prices = price_performance.loc[price_performance.groupby("StockCode")["Quantity"].idxmax()]

# Show first few rows
print(best_prices.head())

# Plot top 20 best-performing prices as a bar chart
plt.figure(figsize=(12, 6))
top_products = best_prices.sort_values(by="Quantity", ascending=False).head(20)
plt.bar(top_products["StockCode"].astype(str), top_products["UnitPrice"], color="skyblue")

plt.xlabel("Product StockCode")
plt.ylabel("Best Performing Price")
plt.title("Best Performing Prices for Top 20 Products")
plt.xticks(rotation=90)
plt.show()
   StockCode  UnitPrice  Quantity
1      10002       0.85       824
5      10080       0.39       301
7      10120       0.21       193
11     10125       0.85       966
15     10133       0.42      2046

Below is the graph filtering the different prices of our top 20 products BY REVENUE. It goes through sales at each pricepoint and finds the pricepoint that led to the greatest revenue for that item. the X axis is the total revenue generated at the optimal pricepoint in the data label.

In [13]:
# Find best price by revenue
price_performance = df.groupby(["StockCode", "UnitPrice"])["Quantity"].sum().reset_index()
price_performance["Revenue"] = price_performance["UnitPrice"] * price_performance["Quantity"]
best_prices = price_performance.loc[price_performance.groupby("StockCode")["Revenue"].idxmax()]

# Select top 20 revenue-generating products
top_products = best_prices.sort_values(by="Revenue", ascending=False).head(20)

# Create the bar chart
plt.figure(figsize=(13, 8))
bars = plt.bar(top_products["StockCode"].astype(str), top_products["Revenue"], color="green")

# Add labels above bars (show price instead of revenue)
for bar, price in zip(bars, top_products["UnitPrice"]):
    yval = bar.get_height()  # Get revenue (height of the bar)
    plt.text(bar.get_x() + bar.get_width()/2, yval + (yval * 0.02), f"${price:.2f}", ha="center", va="bottom", fontsize=10)

# Formatting
plt.xlabel("Product StockCode")
plt.ylabel("Total Revenue")
plt.title("Best Performing Prices for Top 20 Products")
plt.xticks(rotation=90)
plt.show()

curious on that item (22423 stockcode) that produced the most revenue:

In [14]:
df['StockCode'] = df['StockCode'].astype(str)
top_item = df[df['StockCode'] == '22423']

prices = top_item.groupby('UnitPrice')['Quantity'].sum().sort_values(ascending=False)
print(prices.head(50))
UnitPrice
10.95    7846
12.75    3796
24.96     783
25.49     287
12.72     251
4.00       30
5.00       20
9.86        5
32.04       4
0.00      -42
Name: Quantity, dtype: int64
In [ ]:
 

Maximum unit prices for items

In [15]:
ppp=df.groupby('StockCode').agg({'UnitPrice':'max','Description':'first','Quantity':'sum'}).sort_values(by = 'UnitPrice', ascending=False)
print(ppp.head(25))
           UnitPrice                          Description  Quantity
StockCode                                                          
22655         295.00          VINTAGE RED KITCHEN CABINET        39
22656         295.00         VINTAGE BLUE KITCHEN CABINET        18
22826         195.00        LOVE SEAT ANTIQUE WHITE METAL        54
22828         165.00         REGENCY MIRROR WITH SHUTTERS        10
22827         165.00   RUSTIC  SEVENTEEN DRAWER SIDEBOARD        28
22823         125.00        CHEST NATURAL WOOD 20 DRAWERS        24
22833         100.00          HALL CABINET WITH 3 DRAWERS        76
23064          99.96               CINDERELLA CHANDELIER        142
21473          99.96    SWEETHEART CREAM STEEL TABLE RECT        27
22016          83.33  Dotcomgiftshop Gift Voucher £100.00         1
22803          82.50             IVORY EMBROIDERED QUILT         54
21769          79.95          VINTAGE POST OFFICE CABINET         6
21686          78.29           MEDINA STAMPED METAL STOOL        36
84078A         76.55     SET/4 WHITE RETRO STORAGE CUBES        380
22929          65.00               SCHOOL DESK AND CHAIR         16
84632          59.95     DECORATIVE HANGING SHELVING UNIT         5
22503          59.53            CABIN BAG VINTAGE PAISLEY        80
22504          58.29          CABIN BAG VINTAGE RETROSPOT       242
21476          58.29   STEEL SWEETHEART ROUND TABLE CREAM        41
84616          58.29            SILVER ROCCOCO CHANDELIER        11
17003          57.60                  BROCADE RING PURSE      23053
22769          51.02         CHALKBOARD KITCHEN ORGANISER        -5
22764          49.96   RUSTIC WOODEN CABINET, GLASS DOORS         7
23485          49.96        BOTANICAL GARDENS WALL CLOCK        244
84963B         49.95          BLUE PAINTED KASHMIRI CHAIR         3

Below I was curious about a top customer's (14646 on customer pie chart) favorite items:

In [16]:
df['CustomerID'] = df['CustomerID'].astype(str)
t = df[df['CustomerID'] == '14646']
tt = t.groupby('Description').agg({'Quantity':'sum','UnitPrice':'max','revenue':'sum'}).sort_values(by='Quantity',ascending=False)
print(tt.head(50))
                                     Quantity  UnitPrice  revenue
Description                                                      
RABBIT NIGHT LIGHT                       4801       2.08  9568.48
SPACEBOY LUNCH BOX                       4492       1.95  7415.40
PACK OF 72 RETROSPOT CAKE CASES          4104       0.55  1726.80
DOLLY GIRL LUNCH BOX                     4096       1.65  6758.40
ROUND SNACK BOXES SET OF4 WOODLAND       3120       2.55  7956.00
RED TOADSTOOL LED NIGHT LIGHT            2376       1.65  3459.60
WOODLAND CHARLOTTE BAG                   2300       0.72  1656.00
RED RETROSPOT CHARLOTTE BAG              2100       0.72  1512.00
5 HOOK HANGER RED MAGIC TOADSTOOL        2016       1.45  2923.20
JUMBO BAG RED RETROSPOT                  2000       1.79  3468.00
SPACEBOY BIRTHDAY CARD                   1872       0.36   673.92
GINGHAM HEART DECORATION                 1872       0.72  1347.84
MINI PAINT SET VINTAGE                   1728       0.55   950.40
PLASTERS IN TIN SPACEBOY                 1632       1.45  2366.40
ROUND SNACK BOXES SET OF 4 FRUITS        1584       2.55  4039.20
PLASTERS IN TIN WOODLAND ANIMALS         1536       1.45  2227.20
JUMBO BAG WOODLAND ANIMALS               1500       1.79  2629.00
PACK OF 60 SPACEBOY CAKE CASES           1440       0.42   604.80
CHARLOTTE BAG PINK POLKADOT              1400       0.72  1008.00
PACK OF 60 MUSHROOM CAKE CASES           1320       0.42   554.40
CHARLOTTE BAG APPLES DESIGN              1301       0.85   936.85
LUNCH BAG WOODLAND                       1300       1.45  1885.00
LUNCH BAG RED RETROSPOT                  1300       1.45  1885.00
CARD DOLLY GIRL                          1296       0.36   466.56
SPACEBOY BEAKER                          1201       1.25  1297.25
 DOLLY GIRL BEAKER                       1201       1.25  1297.25
60 CAKE CASES DOLLY GIRL DESIGN          1200       0.42   504.00
LUNCH BAG SPACEBOY DESIGN                1200       1.45  1740.00
FOLKART ZINC HEART CHRISTMAS DEC         1152       0.72   829.44
CHARLOTTE BAG DOLLY GIRL DESIGN          1100       0.72   792.00
GUMBALL COAT RACK                        1044       2.10  2192.40
PACK OF 12 TRADITIONAL CRAYONS           1008       0.36   362.88
FOOD CONTAINER SET 3 LOVE HEART          1008       1.65  1663.20
FILIGRIS HEART WITH BUTTERFLY            1008       1.25  1095.84
CHARLOTTE BAG SUKI DESIGN                1002       0.85   721.70
CHARLOTTE BAG VINTAGE ALPHABET           1001       0.85   720.85
STRAWBERRY LUNCH BOX WITH CUTLERY         967       2.55  2033.85
PACK OF 20 NAPKINS RED APPLES             960       0.72   668.16
60 TEATIME FAIRY CAKE CASES               960       0.42   403.20
72 SWEETHEART FAIRY CAKE CASES            960       0.42   403.20
SET/10 RED POLKADOT PARTY CANDLES         912       1.25   975.84
SET/10 BLUE POLKADOT PARTY CANDLES        864       1.06   915.84
SET/10 PINK POLKADOT PARTY CANDLES        864       1.06   915.84
PLASTERS IN TIN VINTAGE PAISLEY           864       1.45  1252.80
10 COLOUR SPACEBOY PEN                    864       0.72   622.08
PACK OF 60 DINOSAUR CAKE CASES            840       0.42   352.80
LUNCH BAG PINK POLKADOT                   820       1.65  1193.00
LUNCH BAG APPLE DESIGN                    801       1.65  1161.65
DOUGHNUT LIP GLOSS                        800       1.04   832.00
SPACEBOY CHILDRENS BOWL                   793       1.25   847.25
In [ ]:
 
Back to Projects