Now that we have obtained and cleaned the data, let's take some time to explore it, gain an understanding of what the different fields mean, and learn how we can use them to create something useful.
If you completed the previous recipe, you should have cleaned and formatted offense and defense datasets in preparation for this recipe.
In order to analyze the data, complete the following steps:
offense
and defense
data frames into a data frame called combined
. This will get all of our data in one place and make it easier for us to do some exploration:combined <- merge(offense, defense, by.x="Team", by.y="Team")
Since some of the offense and defense columns have the same name, we will rename them to avoid confusion later. We'll also get rid of the column from the defense
data frame that shows the number of games because it is redundant now that we have combined data:
colnames(combined)[2] <- "Games" colnames(combined)[3] <- "OffPPG" colnames(combined)[4] <- "OffYPG" colnames(combined)[5] <- "OffPassYPG" colnames(combined)[6] <- "OffRushYPG" combined$G.y <- NULL colnames(combined)[15] <- "DefPPG" colnames(combined)[16] <- "DefYPG" colnames(combined)[17] <- "DefRushYPG" colnames(combined)[18] <- "DefPassYPG"
hist(combined$OffPPG, breaks=10, main="Offensive Points Per Game", xlab="Offensive PPG",ylab="Number of Teams")
The histogram will look like the following diagram:
According to the histogram, most teams score an average of 18 to 28 points per game. There is one team that averages significantly more, and one team that averages significantly less.
The average offensive points scored per game is 23.4, and the standard deviation is 4.36. The highest scoring team averaged 37.9 points per game or 3.32 standard deviations above the mean. The lowest scoring team averaged 15.4 points per game or 1.83 standard deviations below the mean. This is shown through the following commands:
mean(combined$OffPP G) [1] 23.41875 sd(combined$OffPPG) [1] 4.361373 max(combined$OffPPG [1] 37.9 min(combined$OffPPG) [1] 15.4
hist(combined$DefPPG, breaks=10, main="Defensive Points Per Game", xlab="Defensive PPG",ylab="Number of Teams")
This produces the following diagram:
There is a little less variability here, as most teams allow between 20 and 30 points per game. There are only a few teams with very good defenses that limit the offenses that they face to fewer than 20 points per game on average.
hist(combined$"1stD/G", breaks=10, main="Offensive 1st Downs Per Game", xlab="1st Downs/Game",ylab="Number of Teams")
The diagram produced should look like the following:
From this, we can tell that most teams gain between 17 and 20 first downs per game. Again, as in the points per game histogram, there is one team that gets an exceedingly high number of first downs per game. In both cases, offensive points per game and first downs per game, the outlier is the Denver Broncos team.
You can create histograms for any column in your dataset by simply swapping the name of the column from any of the lines of code we just used. Try a few more and see what other insights you find!
ppg <- transform(combined,Team=reorder(Team,combined$OffPPG)) ggplot(ppg,aes(x=Team, y=OffPPG)) + geom_bar(stat='identity',color="black",fill="blue") + coord_flip() + labs(x="Team",y="Avg Points per Game") + ggtitle("Avg Points per Game") + theme(plot.title = element_text(size=18, face="bold"))
This produces the following diagram:
Here, you can see the individual point per game figures visually and in descending order.
ypg <- transform(combined,Team=reorder(Team,-combined$DefYPG)) ggplot(ypg,aes(x=Team, y=DefYPG)) + geom_bar(stat='identity',color="black",fill="blue") + coord_flip() + labs(x="Team",y="Avg Yards Allowed per Game") + ggtitle("Avg Yards Allowed per Game") + theme(plot.title = element_text(size=18, face="bold"))
You can refer to the following diagram:
From these charts, we can get a visual sense of what fans saw throughout the season, specifically the incredible offense of the Denver Broncos team and the unstoppable defense of the Seattle Seahawks team, which ultimately led them to a Super Bowl victory against the Broncos.
Try creating bar charts for a few more fields and see what other insights about the teams you can draw.
ggplot(combined, aes(x=combined$OffYPG, y=combined$OffPPG)) + geom_point(shape=5, size=2) + geom_smooth() + labs(x="Yards per Game",y="Points per Game") + ggtitle("Offense Yards vs. Points per Game") + theme(plot.title = element_text(size=18, face="bold"))
This produces the following diagram:
As you can see, these two variables are positively correlated—as yards per game increases, points per game also usually increases. We can calculate the correlation coefficient with the following code:
cor(combined$OffYPG,combined$OffPPG) [1] 0.7756408
ggplot(combined, aes(x=combined$DefYPG, y=combined$DefPPG)) + geom_point(shape=5, size=2) + geom_smooth() + labs(x="Yards Allowed per Game",y="Points Alloed per Game") + ggtitle("Defense Yards vs. Points per Game") + theme(plot.title = element_text(size=18, face="bold"))
This produces the following graph:
Looking at the scatter plot, there does seem to be some positive correlation here too, although not quite as strong as the previous offense relationship. Let's calculate the correlation for these two variables as well:
cor(combined$DefYPG,combined$DefPPG) [1] 0.6823588
ggplot(combined, aes(x=combined$TOP, y=combined$OffPPG)) + geom_point(shape=5, size=2) + geom_smooth() + labs(x="Time of Possession (Seconds)",y="Points per Game") + ggtitle("Time of Possession vs. Points per Game") + theme(plot.title = element_text(size=18, face="bold"))
This produces the following graph:
Oddly enough, the correlation between these two variables is not as strong as we might have guessed. Apparently, there are teams at different levels of efficiency, some scoring lots of points in very little time, and others scoring relatively few points over longer periods of time. When we calculate the correlation coefficient for these, we find that the value is much lower:
cor(combined$TOP,combined$OffPPG) [1] 0.2530245
When creating histograms in R, an important thing to consider is the number of breaks (columns) you want the histogram to have. Having more breaks gives you a finer level of detail, but having too many defeats the purpose of the histogram, which is to bin values that are close together to compare how often observations occur in a given range of values versus other ranges. In our experience, using 10 bins is usually a good starting point, and then you can adjust it higher or lower as you see fit.
We created the bar charts using the ggplot2
package. We first arranged the data into the desired order using the transform
function and then graphed the resulting data frame. With ggplot2
, you can change just about any feature of the charts that you create, including the outline and fill colors of the bars, how the axes and chart titles look, and much more!
The same is true of the scatter plots we created in this section with ggplot2
. For example, we changed the plots to be hollow diamonds (shape=5
), though we could have chosen from a number of different shapes and sizes for our plots.
Hadley Wickham, the creator of the ggplot2
package, has a great reference website that you can use to figure out how to make your charts and plots look exactly like you want them to look. The site can be found at http://docs.ggplot2.org/current/.
ggplot2
package available at http://docs.ggplot2.org/current/3.22.27.45