Final Project for Fundamentals of Data Visualization¶
Project Outline:¶
- A brief recap of data, goals, and tasks
- Data cleansing
- Visualization implementation
- Summary of key elements of my design with justification
- A discussion of evaluation approach, test subjects, and evaluation results
- A synthesis of my findings
For the sake of grading, I will run the visualizations above data cleansing when writing my final report. iterations.
Dataset Recap: NHL Goalies¶
A brief recap of your data, goals, and tasks, focusing on those that most directly influence your design
The dataset I chose to use is data on NHL goalies ranging from 10/4/2021 to 1/18/2025, an exported CSV file from sportsreference.com. This data consists of about 25-30 game records per goalie (90 total goalies) with fields (player, date, age, team, loc, opponent, final score, win/loss, minutes played, goals allowed, saves, shots, save%, and goals allowed average, where GAA is GA*60/MinutesPlayed). (NOTE: HIGHER GAA = BAD)
Goals¶
- compare goalie GAA trends over time,
- compare goalie SV% trends over time,
- rank goalies on overall save % (which goalie is best),
- rank goalies on overall GAA (which goalie is worst),
- see how GAA affects SV%,
- see if total minutes played has an effect on SV%.
Tasks¶
Which leads to the core tasks for creating time-series trend analysis charts, a snapshot ranking goalies by average GAA and SV%, a scatterplot of avg GAA and save%, and a scatterplot of minutes vs save %.
Note¶
To make the graphs cleaner (and overall analysis easier) for my friends, classmates, and family, I filtered the dataset for the top 10 goalies by SV%, given that they had at least 20 recorded games.
Visualization Implementation¶
Screenshots of and/or a link to your visualization implementation
import warnings
from altair.utils.deprecation import AltairDeprecationWarning
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=AltairDeprecationWarning)
stack.properties(title="Click on any legend to filter all visuals by player")
Visualization Summary¶
A summary of the key elements of your design and accompanying justification
This design was created using altair. Each graph is interactive in the sense that you can zoom in, filter by player throughout all visuals (click on name within legend), and hover your mouse to look at the tool tips.
It contains 2 bar charts, two scatter plots, and two time-series charts.
Justifications¶
Through viewing the bar charts that are zoomed in on the X-axes, you can clearly take one look and see which goalies have the highest average SV%, and GAA, additionally, you can hover your mouse over the tool tip to see the exact numbers and what team they belong to.
The time series charts in the middle may seem messy at first, but if you filter by clicking a player on the legend, you can clearly get a view of which goalies post a flat, low‐variance curve (steadily strong) and which ones bounce up and down (inconsistent). A “noisy” line suggests susceptibility to hot or cold streaks.
The GAA vs SV% scatter shows the close relationship between GAA and SV%.
The Minutes vs SV% scatter shows the relationship between minutes played and SV% to evaluate if higher minutes leads to worse performance. However it seems that all of these goalies (for the most part) play the full 60 minutes either way, since they are starters.
Final Evaluation¶
Conduct your evaluation based on the plan outlined in your Module 3 discussion post, making sure to conduct your evaluation with at least three people. Include a discussion of your final evaluation approach, including the procedure, people recruited, and results. Note that, due to the difficulty of recruiting experts, you can use colleagues, friends, classmates, or family to evaluate your designs if experts or others from your target population are unavailable.
I recruited 3 hockey-savvy friends for a Think-Aloud evaluation approach over a video conference. I got together with them on a video call, briefed them on the data, let them perform the tasks given as I watched with little input, then I evaluated their metrics and recorded their results.
Briefing:¶
Given these visualizations on this goalie data (the legend is clickable and mouse hoverable to view tooltip data) -- your goal is to complete these 6 tasks:
- Find the top 3 goalies by SV%.
- Find the top 3 goalies by GAA, are they the same? (NOTE: HIGHER GAA = BAD)
- Which team does the goalie with the highest SV% belong to?
- Which team does the goalie with the worst GAA belong to?
- What date did the highest GAA occur and which goalie allowed it?
- Locate any game where a goalie played less than 30 minutes, and had below a 0.8 SV%. What is the date?
Metrics Evaluated:¶
These metrics will give me an overview of how my visualizations work for a new person exploring it, and give me an idea if I need any UI changes.
- Accuracy of answers
- Time to Complete Each Task
- Think-aloud notes on confusing elements
Results:¶
- Friend 1: Answered tasks 1-3 very quickly through the bar charts successfully. Misinterpreted task 4 (quickly) and gave me the lowest GAA, not the WORST GAA, which would be the highest. He knew it was the time series GAA graph because it asked for the date, quickly hovered over the highest GAA value and got the correct answer. For task 6, he initially looked at the time-series graphs as he did in task 5 (since it asked for the date), but quickly realized that the answer would come from the minutes graph, and got the correct response.
- Friend 2: Answered tasks 1-2 very quickly through the bar charts successfully. Hesitated on 3-4, but found it through the tool tip after a few minutes. Completed task 5 very quickly, going to the correct graph, and stated verbally that he knew to use the tool tip after his experience with the previous questions. Question 6 was straight forward and quick.
- Friend 3: Answered tasks 1-4 very quickly through the bar charts successfully. Strugged on task 5, was looking for the date on the GAA bar chart tool tip but it was not there. After moving onto the GAA time-series graph, he quickly located the highest value and read the tool tip successfully. For task 6, he knew which chart to look at right away, but initially struggled to pick a point on the graph because the "30" mark on the x-axis was not showing, once he zoomed in and saw vertical line for 30 minutes, the choice was clear.
Synthesis & Next Steps¶
A synthesis of your findings, including what elements of your approach worked well and what elements you would refine in future iterations.
What Worked¶
The bar charts seemed to be very intuitive to gain information from, and the tool tips were very valuable in them getting the correct answers. Considering they got (mostly) the correct answers, quickly, with little verbal fustration, I would conclude my visualizations a success based on this criteria.
What to Refine¶
I would add a better graph for time-series analysis, with filters for filtering the date, and better filters for choosing a goalie. The legend filtering is OKAY, but the legend is very small and annoying to click. For this type of data and the tasks we were completing, I feel that dashboarding tools would be easier to implement instead of altair or streamlit.
import pandas as pd
import altair as alt
df = pd.read_csv("nhlgoalies.csv")
df.head(3)
Player | Date | Age | Team | Loc | Opp | Result | DEC | MIN | GA | SV | Shots | SV% | GAA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Adin Hill | 1/17/2025 | 28-251 | VEG | @ | CAR | L 2-3 | L | 57:08:00 | 3 | 25 | 28 | 0.893 | 3.15 |
1 | Adin Hill | 1/12/2025 | 28-246 | VEG | NaN | MIN | W 4-1 | W | 59:46:00 | 1 | 15 | 16 | 0.938 | 1.00 |
2 | Adin Hill | 1/9/2025 | 28-243 | VEG | NaN | NYI | L 0-4 | L | 58:30:00 | 3 | 17 | 20 | 0.850 | 3.08 |
df['Date'] = pd.to_datetime(df['Date'])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1544 entries, 0 to 1543 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Player 1544 non-null object 1 Date 1544 non-null datetime64[ns] 2 Age 1544 non-null object 3 Team 1544 non-null object 4 Loc 773 non-null object 5 Opp 1544 non-null object 6 Result 1544 non-null object 7 DEC 1470 non-null object 8 MIN 1544 non-null object 9 GA 1544 non-null int64 10 SV 1544 non-null int64 11 Shots 1544 non-null int64 12 SV% 1543 non-null float64 13 GAA 1544 non-null float64 dtypes: datetime64[ns](1), float64(2), int64(3), object(8) memory usage: 169.0+ KB
df['Loc'] = df['Loc'].notna().map({True: 'away', False: 'home'})
df.head(3)
Player | Date | Age | Team | Loc | Opp | Result | DEC | MIN | GA | SV | Shots | SV% | GAA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Adin Hill | 2025-01-17 | 28-251 | VEG | away | CAR | L 2-3 | L | 57:08:00 | 3 | 25 | 28 | 0.893 | 3.15 |
1 | Adin Hill | 2025-01-12 | 28-246 | VEG | home | MIN | W 4-1 | W | 59:46:00 | 1 | 15 | 16 | 0.938 | 1.00 |
2 | Adin Hill | 2025-01-09 | 28-243 | VEG | home | NYI | L 0-4 | L | 58:30:00 | 3 | 17 | 20 | 0.850 | 3.08 |
df['minutes']= df['MIN'].str.split(':').str[0].astype(int)
df = df.drop(columns={"Result","DEC", "Opp"})
df.head(1)
Player | Date | Age | Team | Loc | MIN | GA | SV | Shots | SV% | GAA | minutes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Adin Hill | 2025-01-17 | 28-251 | VEG | away | 57:08:00 | 3 | 25 | 28 | 0.893 | 3.15 | 57 |
#Filter the DF by top 10 goalies by highest avg SV%, making sure they have played atleast
#20+ games to avoid getting goalies that played only 1 or two games resulting in a 1.0 SV avg.
# count games per goalie
games_played = df['Player'].value_counts()
# pick goalies with at least 20 games
eligible = games_played[games_played >= 20].index
# subset to those goalies
df_eligible = df[df['Player'].isin(eligible)]
top10 = (
df_eligible
.groupby('Player', as_index=False)['SV%']
.mean()
.nlargest(10, 'SV%')
)
df_top10 = df[df['Player'].isin(top10['Player'])]
print("Top 10 Goalies by AVG SV%: ")
df_top10.groupby('Player')[['SV%','GAA']].mean().sort_values(by = "SV%",ascending=False)
Top 10 Goalies by AVG SV%:
SV% | GAA | |
---|---|---|
Player | ||
Connor Hellebuyck | 0.928944 | 1.965556 |
Logan Thompson | 0.923538 | 2.118846 |
Darcy Kuemper | 0.921087 | 2.102174 |
Dustin Wolf | 0.916960 | 2.510000 |
Joey Daccord | 0.916172 | 2.400690 |
Jacob Markström | 0.912941 | 2.198235 |
Linus Ullmark | 0.910870 | 2.312174 |
Filip Gustavsson | 0.910613 | 2.781290 |
Andrei Vasilevskiy | 0.907114 | 2.453429 |
Joseph Woll | 0.906250 | 2.671667 |
#making a scatter plot of SV% and GAA by player
select_legend = alt.selection_multi(fields=["Player"], bind="legend")
chart1= alt.Chart(df_top10).mark_circle().encode(
x = alt.X("GAA:Q"),
y = alt.Y("SV%:Q", scale=alt.Scale(domain=[0.6, 1.0])),
color = alt.Color("Player:N"), # color by player
opacity = alt.condition(
select_legend,
alt.value(1), # full opacity when selected
alt.value(0.2) # dim when not
),
tooltip = ["Player:N", "GAA:Q", "SV%:Q","Team:N"]
).add_selection(
select_legend
).interactive()
#Making a bar chart showing the player with the highest mean SV%
chart2= alt.Chart(df_top10).mark_bar().encode(
y='Player:N',
x=alt.X('mean(SV%):Q',
scale=alt.Scale(domain=[0.88, 0.94])
),
color="Player:N",
opacity = alt.condition(
select_legend,
alt.value(1), # full opacity when selected
alt.value(0.2) # dim when not
),
tooltip=["Player:N","mean(SV%):Q","Team:N"]
).add_selection(
select_legend
).interactive()
#make timeseries of GAA comparing players
# aggregate (if you have multiple entries per date)
df_trends = (
df_top10
.groupby(['Player','Team','Date'], as_index=False)
.agg(GAA=('GAA','mean'))
)
chart3= alt.Chart(df_trends).mark_line(point=True).encode(
x='Date:T',
y='GAA:Q',
color='Player:N',
opacity = alt.condition(
select_legend,
alt.value(1), # full opacity when selected
alt.value(0.2) # dim when not
),
tooltip = ["Player:N", "GAA:Q","Team:N", "Date:T"]
).properties(
title="Seasonal GAA Trends by Goalie"
).add_selection(
select_legend
).interactive()
chart4= alt.Chart(df_top10).mark_bar().encode(
y='Player:N',
x=alt.X('mean(GAA):Q',
),
color="Player:N",
opacity = alt.condition(
select_legend,
alt.value(1), # full opacity when selected
alt.value(0.2) # dim when not
),
tooltip=["Player:N","mean(GAA):Q","Team:N"]
).add_selection(
select_legend
).interactive()
#making a scatter plot of SV% and GAA by player
chart5= alt.Chart(df_top10).mark_circle().encode(
x = alt.X("minutes:Q"),
y = alt.Y("SV%:Q", scale=alt.Scale(domain=[0.6, 1.0])),
color = alt.Color("Player:N"), # color by player
opacity = alt.condition(
select_legend,
alt.value(1), # full opacity when selected
alt.value(0.2) # dim when not
),
tooltip = ["Player:N", "GAA:Q", "SV%:Q","Team:N","Date:T"]
).add_selection(
select_legend
).interactive()
chart5 = chart5.properties(title="Minutes Played vs SV%")
#make timeseries of sv% comparing players
# aggregate (if you have multiple entries per date)
df_trends2 = (
df_top10
.groupby(['Player','Team','Date'], as_index=False)
.agg(SVV=('SV%','mean'))
)
chart6= alt.Chart(df_trends2).mark_line(point=True).encode(
x=alt.X('Date:T',title='Date'),
y=alt.Y('SVV:Q',title='Save %', scale=alt.Scale(domain=[0.7, 1.025])),
color='Player:N',
opacity = alt.condition(
select_legend,
alt.value(1), # full opacity when selected
alt.value(0.2) # dim when not
),
tooltip = ["Player:N", "SVV:Q","Team:N", "Date:T"]
).properties(
title="Seasonal SV% Trends by Goalie"
).add_selection(
select_legend
).interactive()
chart6
# titling charts1&2:
chart1 = chart1.properties(title="GAA vs SV% Scatter")
chart2 = chart2.properties(title="Avg Save %")
chart4 = chart4.properties(title="AVG GAA %")
#showing charts side by side:
combined = alt.hconcat(chart1, chart5).resolve_scale(color='independent')
#showing charts vertically:
combined2 = alt.hconcat(chart3, chart6).resolve_scale(color='independent')
combined3 = alt.hconcat(chart2, chart4).resolve_scale(color='independent')
stack = alt.vconcat(combined, combined2, combined3).resolve_scale(color='independent')
stack.properties(title="Click on any legend to filter all visuals by player")