clock menu more-arrow no yes mobile

Filed under:

PastRnak: Looking at goals by opponent

Includes Q&A with Bruins data analyst Josh Pohlkamp-Hartt

NHL: Montreal Canadiens at Boston Bruins Bob DeChiara-USA TODAY Sports

Last week I wrote the first tutorial in the PastRnak series where we learn R with a Bruins twist. If you haven’t already read that, I suggest doing that first. And with all of these tutorials, you must have R and RStudio installed on your computer.

In the fall of 2018, the Bruins began to look for two people to add to their front office. One would be a data engineer, tasked with the thankless job of taking complex datasets and making them useful, among other things. As we hear more an more about player tracking, and looking at the NFL’s new tracking data, this person will be extremely important to the team moving forward. As we announced in February of 2019, Campbell Weaver got that job.

The other job was a data analyst, someone who is largely tasked with analyzing datasets in order to create useful tools for the rest of the front office and coaching staff to use. Josh Pohlkamp-Hartt was hired to fill that position. Josh previously worked for Apple and has a PhD in Statistics from Queens University in Kingston, ON. While limited on what he could say, he was nice enough to answer a few of my questions.

Q: When and why did you start using R?

A: I started using R in undergrad around 2006-2007. This was before RStudio, so I did a lot of work in the RGUI, which was not nearly as functional. The development of the Tidy packages has improved the learnability of R significantly. I was lucky enough to have Hadley come to Apple (former employer) and give us a seminar and show us how to use Tidy. I still use way more apply statements and loops than I probably need to.

Q: Since being hired by the Bruins, how often have you used R?

A: Constantly. We work on data-driven decisions and this is my primary tool for interacting with data.

Q: Why do you use R instead of Excel or Apple Numbers?

A; With R (or Python) the focus is on the actions taken on the data rather than the elements of the data. In Excel/Numbers, the presentation of the data in a table is great for acute exploration. We can do similar in R with View(). We also can visualize the data in many other ways that are more difficult in Excel/Numbers. Another significant advantage is the active research community producing useful packages. If I was to try and use a new method or model, it is usually as simple as reading the package paper on CRAN (and probably some stack overflow…). Overall Excel/Numbers are good tools for data exploration and for more complex inference it requires R or Python. I am pretty sure that to build Neural Nets (an example of a complicated mode) in Excel, you have to call R or Python.

Q: What recommendations would you give to someone trying to learn R?

A: For learning R, it is mostly about repetition. Try to recreate tutorials (like this nice one Shawn has made) and explore data on your own. Tidy Tuesdays from RStudio is a great way to collaborate, learn and explore new data. There are numerous online learning platforms that are all helpful, I would probably start with Hadley’s book on Data Science. These are great references but shouldn’t be gospel. Try things out, make mistakes, get frustrated and learn - that’s how we all got to where we are today.

In this week’s tutorial, we will utilize group_by() and summarise() to look at how many goals David Pastrnak scored against each opponent this season. We will use the same dataset as last time. If this is the first exercise you choose to do, you will need to make sure you install and load tidyverse as well as load in the dataset.

#if you don’t have it installed already install.packages(“tidyverse”)

#load it in


#load in raw data

raw_data <- read_csv(“”)

If you did the exercise from last week, you still need to load tidyverse by running library(tidyverse). Now it is time to clean the data.

As we did last week, we must filter out the totals column by using filter(Date != “Total). Next we will group by opponents. When combining that with the summarise function, we are essentially making rows for each individual opponent, and the values will be the columns we create in summarise. We will create a goals column and a time on ice column by using sum(variable).

goals_by_opponent <- raw_data%>%

filter(Date != “Total”)%>%


summarise(G = sum(G),

TOI = sum(TOI))

After cleaning the data, we can graph the results with a simple bar graph.

ggplot(goals_by_opponent, aes(x = Opponent, y = G)) +


If you run it, that graph is going to look like the one below.

Pretty ugly right? I want to sort the x-axis by using reorder(). It’s very simple. You choose the variable you want to reorder, and then the variable you want to sort by. In this case, reorder(Opponent, G). Next I want a colored-fill based on time on ice. All I have to do is add fill=TOI in the aes(). Finally, I want to flip the coordinates so that the teams fit better. I can do this by using coord_flip(). The rest is adding/changing my labels and choosing a theme (I’m using theme black-white this week). The code is below.

ggplot(goals_by_opponent, aes(x = reorder(Opponent, G),y = G, fill=TOI)) +

geom_bar(stat=”identity”) +

coord_flip() +

labs(x=”Opponent”, y=”Goals”, title=”David Pastrnak’s 2019-20 Goals by Opponent”, fill=”Time on Ice”) +


The graph looks like this:

Lastly, I want to get rid of the small white space between the bars and the teams. I can do this by adding scale_y_continuous(expand = c(0, 0)). The final code looks like this:

ggplot(goals_by_opponent, aes(x = reorder(Opponent, G),y = G, fill=TOI)) +

geom_bar(stat=”identity”) +

coord_flip() +

labs(x=”Opponent”, y=”Goals”, title=”David Pastrnak’s 2019-20 Goals by Opponent”, fill=”Time on Ice”) +

scale_y_continuous(expand = c(0, 0)) +


And the final product looks like this:

I hope you enjoyed this tutorial, the code is up on my Github. My challenge for you is to find out how you can create your own color scale for the fill. The unfortunate part about coding is that there is a lot of Googling and trial and error.