clock menu more-arrow no yes mobile

Filed under:

PastRnak: Learning code with the Bruins

Visualizing Pastrnak’s goal scoring this season

NHL: Detroit Red Wings at Boston Bruins Bob DeChiara-USA TODAY Sports

Introduction

While we sit in quarantine, there is no better time to learn a new skill. There are no sports, no hanging out at a bar, no going to the movies, etc. The only way people are satisfying their sports cravings are through the Internet, just like yourself right now! Visiting blogs, surfing through Elite Prospects, and watching old highlights helps fill that void.

One of the ways I fill the void is through coding in R. It’s a free coding platform that is very popular among data scientists. With R I can collect (or upload), explore, and visualize datasets. It’s like having your own customizable hockey website.

Because I was using R for hockey related things, I thought I would attempt to share some of my knowledge by creating some Bruins themed tutorials. We will specifically explore David Pastrnak’s season, starting simple and then trying to find interesting things.

This tutorial doesn’t require much expertise, just that you have R installed on your computer, as well as R Studio. You can do that here and here. I also recommend Meghan Hall’s article which she introduces R with hockey, as well as her presentation on moving from Excel to R.

The Exercise

In this exercise, we will visualize how David Pastrnak accumulated goals throughout the season. You can find the full code on Github. This exercise will specifically utilize filter(), mutate(), select(), cumsum() from tidyverse in order to manipulate the data. In order to visualize it, we will use geom_line() and geom_area() from ggplot2.

Before we start, we must load in the tidyverse package:

#if you don’t have it installed already

Install.packages(“tidyverse”)

#load it in

Library(“tidyverse”)

Next, we must load in our dataset. I have a csv file on my Github that we will read in using read_csv(). If you choose to download it to your computer, or for future reference, you can use read.csv() for local files.

Raw_data <- read_csv(“https://raw.githubusercontent.com/ShawnEFerris/PastRnak/master/EH_game_log_david_pastrnak_box_score_all_regular_2020-05-12.csv”)

This should give you a dataset with 71 observations and 31 variables. Feel free to look through the data by view(raw_data).

We can’t quite work with this data just yet for a few reasons. If you looked at the data, you should’ve noticed that the 71st observation was a totals row. We’re going to want to take that out using filter(Data != “Total”) An exclamation mark followed by an equal sign means “cannot equal to.”

We will also want to create columns to create game numbers and find out how many goals Pastrnak had on the season following that game, as well as his goals above expected. We can do this by using mutate(), rownumber(), and cumsum(). Mutate creates columns while cumsum() offers a cumulative sum. You can either select a column to cumulatively sum, like goals, or keep count of how many times a string is listed in a column. You can also add, subtract, multiply, and divide these cumulate sums which we will do to find goals above expected.

And finally, we don’t need all 31 variables, but rather the two variables we will create using mutate using select(). We will save this as cleaned_data, and the code looks like this:

cleaned_data <- raw_data%>%

filter(Date != “Total”)%>%

mutate(game_num = row_number(),

goals_at_game = cumsum(G),

goals_above_expected = cumsum(G) - cumsum(ixG))%>%

select(game_num, goals_at_game, goals_above_expected)

Now we have 70 observations of 3 variables, the game number, the number of goals Pastrnak had accumulated on the season following that game, and the goals above expected following that game as well. Now we can finally visualize the data and show off our work.

We will start off by using ggplot(). Every ggplot visualization has to start with this. We can also specify our dataset, x-axis, and y-axis here. Then we can call geom_line() to make a line graph. The code would look like this:

ggplot(cleaned_data, aes(game_num, goals_at_game)) + geom_line()

That’s okay, but I want the line to be a little thicker, so I will specify size=1.25 in geom_line. I also want to label my axes and add a title using labs(). Finally, I want to make it a little bit more fresh by using theme_minimal(). The code looks like this:

ggplot(cleaned_data, aes(game_num, goals_at_game)) + geom_line(size=1.25) +

labs(x=”Game Number”, y=”Goals”, title = “How David Pastrnak Accumulated Goals in 2019-20”) +

theme_minimal()

And the graph looks like this:

Because used ggplot() to specify the dataset and axes, we can easily change the chart type by exchanging geom_line() for something else. For example, we can make a bar graph by calling geom_bar(stat=”identity”). Because I also want to add some style, I will set color=”black” and fill=”#FFB81C” which is the Bruins html gold color code. The code looks like this:

ggplot(cleaned_data, aes(game_num, goals_at_game)) + geom_bar(stat=”identity”, color=”black”, fill=”#FFB81C”) +

labs(x=”Game Number”, y=”Goals”, title = “How David Pastrnak Accumulated Goals in 2019-20”) +

theme_minimal()

And the graph looks like this:

Now David Pastrnak was the league’s best shooter this season. He ended up with over 20 goals more than Evolving Hockey’s expected goals model. We can visualize this as well. I thought an area plot would be cool for this. I will call geom_area() and switch around the fill and color so that the outline is gold and fill is black. However, I want a more transparent, gray look to my area chart, so I will set alpha=0.2. And I didn’t forget to change my y-axis and title! The code looks like this:

ggplot(cleaned_data, aes(game_num, goals_above_expected)) + geom_area(fill=”black”, color=”#FFB81C”, alpha=.2) +

labs(x=”Game Number”, y=”Goals Above Expected”, title = “How David Pastrnak Accumulated Goals from Shooting Talent in 2019-20”) +

theme_minimal()

And the graph looks like this:

Conclusion

This exercise was something that many people could do in Excel quickly, however, I hope this inspires you to learn more about R. As this series goes on, the datasets will require slightly more manipulation, showing the benefits of using R.

Special thanks to Evolving Hockey for the data. They have handy game logs among other tools on their website. And if you need any help, my DM’s are open on Twitter (@shawnferris98) or email me at shawnf1629@gmail.com.