The last two R tutorials we have worked with a relatively small dataset and visualized goals in various ways. Today, I want to take a small step towards the power R can bring to a data scientist, as well as a curious fan.
The EvolvingWild twins have been a huge help in this series by letting me use their site, which was built almost entirely in R, for acquiring my datasets. They also have a public scraper that I use which can easily acquire the play-by-play data we will be working with today.
Remember that with all of these tutorials, you must have R and RStudio installed on your computer in order to participate. You can find the code in a nicer format on my Github.
Remember to call tidyverse before we start. I can’t tell you how many times I dive right into editing my code and forget to do that.
#if you don’t have it installed already install.packages(“tidyverse”)
#load it in
#load in raw data
raw_data <- read_csv(“https://raw.githubusercontent.com/ShawnEFerris/PastRnak/master/EH_game_log_david_pastrnak_box_score_all_regular_2020-05-12.csv”)
Today’s dataset is all 4015 shots the Bruins took this season. We can load in the dataset using read_csv() and call it pbp_data.
#load in play by play data
pbp_data <- read_csv(“https://raw.githubusercontent.com/ShawnEFerris/PastRnak/master/EH_pbp_query_20192020_2020-05-30.csv”)
Now the last dataset was very small which made it easy to view. Another function you could use is head()
#view head of data
This will show you the first 6 rows and 12 variables. The other 43 variables are listed below it as well as the class.
The next thing we are going to do is filter all of these shot attempts down to just David Pastrnak’s 5v5 unblocked shots. For those who’ve already done the first two series, this should be rather easy.
#filter to just Pastrnak 5v5 unblocked shots
pasta_shots <- pbp_data %>%
filter(event_player_1 == “DAVID.PASTRNAK”,
game_strength_state == “5v5”,
event_type != “BLOCK”)
Let’s take a trip to Evolving-Hockey.com for a moment. When looking at the top 14 Bruins forwards in terms of 5v5 time-on-ice, something interesting pops out. Although David Pastrnak has the second-highest shooting percentage on the team, he is 11th in terms of expected shooting percentage. Why is that?
Shot distance is the most influential factor in an expected goals model, so lets look at the distribution of Pastrnak’s shots using a histogram. I set the bins to 5 foot intervals. It is important to remember that the histogram graphs on midpoints. Given that there are no shots closer than 6 feet, the first bin will have a midpoint of 5 and will represent shots between 2.5 and 7.5 feet. You could manually override this by setting your own x-axis using xlim(), but we will skip over that for now.
#histogram of shot distance
ggplot(pasta_shots, aes(x=event_distance)) +
geom_histogram(binwidth = 5, color=”black”, fill=”#FFB81C”) +
Feel free to add a title and customize your axis labels like we did in the first two tutorials. You can also find his average shot distance if you were curious like myself.
#find average shot distance
His average shot distance was a little over 30 feet which is about half way between the goal line and the blue line. That’s fairly far out. We can also look at distance by event type using geom_density.
#density of shot distance split by event type
ggplot(pasta_shots, aes(x=event_distance, color=event_type)) +
scale_color_manual(values = c(“blue”, “orange”, “black”)) +
His shots come disproportionately closer to the net, which should make sense given that shooting percentage rises the closer to the net a shooter is. What we might want to look at is shooting percentage vs. expected shooting percentage based on distance.
We will first use mutate() and round() to round event distance to 0 decimal places and then sum the goals and expected goals at each foot.
#group by shot distance and goals vs. expected goals
shots_by_distance <- pasta_shots%>%
mutate(event_distance = round(event_distance, 0))%>%
summarise(G = sum(event_type == “GOAL”),
xG = sum(pred_goal))
We can then manually make the curves and color them ourselves using geom_smooth() with method equal to loess and taking out the standard error.
geom_smooth(aes(event_distance, sh_per), method = loess, se=FALSE, , color=”black”)+
geom_smooth(aes(event_distance, xG), method = loess, se=FALSE, color = “#FFB81C”) +
labs(x=”Event Distance”, y=”Unblocked Shooting Percentage”,
title = “Shooting Percentage (Black) vs. Expected Shooting Percentage (Gold)”)+
From this we could likely narrow down that the difference is shooting percentage almost entirely stems from his shots that are withing 30 feet, which is ironically his average shot distance.
This type of data exploration can then lead into film analysis. A curious mind may filter this down to a list of all of his shots within 30 feet and then look back at the video in order to determine is driving this. Are there shots disproportionately off of the rush? Is this a factor of playing with an elite playmaker in Brad Marchand? Data science can, and should, compliment those who look at the qualitative side of the game
For more resources this week, I highly recommend checking out the presentations from this year’s R Studio Conference. In particular, Dani Chu presented on identifying routes using the NFL’s new tracking data and Namita Nandakumar presented on expected goals. Both are employed by NHL Seatttle.