Analyzing and understanding football data

Now that we have obtained and cleaned the data, let's take some time to explore it, gain an understanding of what the different fields mean, and learn how we can use them to create something useful.

Getting ready

If you completed the previous recipe, you should have cleaned and formatted offense and defense datasets in preparation for this recipe.

How to do it…

In order to analyze the data, complete the following steps:

  1. The first thing we will do is combine the offense and defense data frames into a data frame called combined. This will get all of our data in one place and make it easier for us to do some exploration:
    combined <- merge(offense, defense, by.x="Team", by.y="Team")
    

    Since some of the offense and defense columns have the same name, we will rename them to avoid confusion later. We'll also get rid of the column from the defense data frame that shows the number of games because it is redundant now that we have combined data:

    colnames(combined)[2] <- "Games"
    colnames(combined)[3] <- "OffPPG"
    colnames(combined)[4] <- "OffYPG"
    colnames(combined)[5] <- "OffPassYPG"
    colnames(combined)[6] <- "OffRushYPG"
    combined$G.y <- NULL
    colnames(combined)[15] <- "DefPPG"
    colnames(combined)[16] <- "DefYPG"
    colnames(combined)[17] <- "DefRushYPG"
    colnames(combined)[18] <- "DefPassYPG"
    
  2. Now, we're ready to start exploring our data! One of the best places to start when exploring data is histograms. Histograms visually show how every column of the data frame is distributed so that you can get a sense of which values are normal, which values are low, and which values are high. First, let's create a histogram of offensive points per game by each team:
    hist(combined$OffPPG, breaks=10, main="Offensive Points Per Game", xlab="Offensive PPG",ylab="Number of Teams")
    

    The histogram will look like the following diagram:

    How to do it…

    According to the histogram, most teams score an average of 18 to 28 points per game. There is one team that averages significantly more, and one team that averages significantly less.

    The average offensive points scored per game is 23.4, and the standard deviation is 4.36. The highest scoring team averaged 37.9 points per game or 3.32 standard deviations above the mean. The lowest scoring team averaged 15.4 points per game or 1.83 standard deviations below the mean. This is shown through the following commands:

    mean(combined$OffPP G)
    [1] 23.41875
    sd(combined$OffPPG)
    [1] 4.361373
    
    max(combined$OffPPG
    [1] 37.9
    min(combined$OffPPG)
    [1] 15.4
    
  3. Next, let's see how points allowed per game, a defensive statistic, are distributed:
    hist(combined$DefPPG, breaks=10, main="Defensive Points Per Game", xlab="Defensive PPG",ylab="Number of Teams")
    

    This produces the following diagram:

    How to do it…

    There is a little less variability here, as most teams allow between 20 and 30 points per game. There are only a few teams with very good defenses that limit the offenses that they face to fewer than 20 points per game on average.

  4. Let's do one more histogram on the number of first downs per game, an offensive statistic:
    hist(combined$"1stD/G", breaks=10, main="Offensive 1st Downs Per Game", xlab="1st Downs/Game",ylab="Number of Teams")
    

    The diagram produced should look like the following:

    How to do it…

    From this, we can tell that most teams gain between 17 and 20 first downs per game. Again, as in the points per game histogram, there is one team that gets an exceedingly high number of first downs per game. In both cases, offensive points per game and first downs per game, the outlier is the Denver Broncos team.

    You can create histograms for any column in your dataset by simply swapping the name of the column from any of the lines of code we just used. Try a few more and see what other insights you find!

  5. The next type of chart we will use is the bar chart. These sometimes look similar to the histogram, but we will use bar charts to see how figures for the different teams compare to each other, whereas we binned the values and counted the frequency (number of teams) that fell into each bin in the case of our histograms. Let's start off by creating a bar chart for offensive points per game:
    ppg <- transform(combined,Team=reorder(Team,combined$OffPPG))
    ggplot(ppg,aes(x=Team, y=OffPPG)) +
      geom_bar(stat='identity',color="black",fill="blue") + coord_flip() + labs(x="Team",y="Avg Points per Game") + 
      ggtitle("Avg Points per Game") + theme(plot.title = element_text(size=18, face="bold"))
    

    This produces the following diagram:

    How to do it…

    Here, you can see the individual point per game figures visually and in descending order.

  6. Next, let's try another bar graph for defense yards allowed per game:
    ypg <- transform(combined,Team=reorder(Team,-combined$DefYPG))
    ggplot(ypg,aes(x=Team, y=DefYPG)) +
      geom_bar(stat='identity',color="black",fill="blue") + coord_flip() + labs(x="Team",y="Avg Yards Allowed per Game") + 
      ggtitle("Avg Yards Allowed per Game") + theme(plot.title = element_text(size=18, face="bold"))
    

    You can refer to the following diagram:

    How to do it…

    From these charts, we can get a visual sense of what fans saw throughout the season, specifically the incredible offense of the Denver Broncos team and the unstoppable defense of the Seattle Seahawks team, which ultimately led them to a Super Bowl victory against the Broncos.

    Try creating bar charts for a few more fields and see what other insights about the teams you can draw.

  7. The final type of graph we will use in this section is the scatter plot. These graphs are good at showing relationships and correlations between two different variables visually. For example, let's see how offensive yards and offensive points per game are related:
    ggplot(combined, aes(x=combined$OffYPG, y=combined$OffPPG)) +
      geom_point(shape=5, size=2) + geom_smooth() + 
      labs(x="Yards per Game",y="Points per Game") + ggtitle("Offense Yards vs. Points per Game") + 
      theme(plot.title = element_text(size=18, face="bold"))
    

    This produces the following diagram:

    How to do it…

    As you can see, these two variables are positively correlated—as yards per game increases, points per game also usually increases. We can calculate the correlation coefficient with the following code:

    cor(combined$OffYPG,combined$OffPPG) [1] 0.7756408
  8. Let's look at whether the same is true for defense for yards allowed and points allowed per game. Theoretically, if a defense is able to limit the number of yards an offense gains, it should correlate strongly with the number of points (or lack of points) the offense is able to score:
    ggplot(combined, aes(x=combined$DefYPG, y=combined$DefPPG)) +
      geom_point(shape=5, size=2) + geom_smooth() + 
      labs(x="Yards Allowed per Game",y="Points Alloed per Game") + ggtitle("Defense Yards vs. Points per Game") + 
      theme(plot.title = element_text(size=18, face="bold"))
    

    This produces the following graph:

    How to do it…

    Looking at the scatter plot, there does seem to be some positive correlation here too, although not quite as strong as the previous offense relationship. Let's calculate the correlation for these two variables as well:

    cor(combined$DefYPG,combined$DefPPG)
    [1] 0.6823588
    
  9. Let's try one more correlation. One can postulate that the longer a team is on offense, the more points per game they are likely to score. To test whether this is true, we can scatter plot time of possession and offensive points per game:
    ggplot(combined, aes(x=combined$TOP, y=combined$OffPPG)) +
      geom_point(shape=5, size=2) + geom_smooth() + 
      labs(x="Time of Possession (Seconds)",y="Points per Game") + ggtitle("Time of Possession vs. Points per Game") + 
      theme(plot.title = element_text(size=18, face="bold"))
    

    This produces the following graph:

    How to do it…

    Oddly enough, the correlation between these two variables is not as strong as we might have guessed. Apparently, there are teams at different levels of efficiency, some scoring lots of points in very little time, and others scoring relatively few points over longer periods of time. When we calculate the correlation coefficient for these, we find that the value is much lower:

    cor(combined$TOP,combined$OffPPG)
    [1] 0.2530245
    

How it works…

When creating histograms in R, an important thing to consider is the number of breaks (columns) you want the histogram to have. Having more breaks gives you a finer level of detail, but having too many defeats the purpose of the histogram, which is to bin values that are close together to compare how often observations occur in a given range of values versus other ranges. In our experience, using 10 bins is usually a good starting point, and then you can adjust it higher or lower as you see fit.

We created the bar charts using the ggplot2 package. We first arranged the data into the desired order using the transform function and then graphed the resulting data frame. With ggplot2, you can change just about any feature of the charts that you create, including the outline and fill colors of the bars, how the axes and chart titles look, and much more!

The same is true of the scatter plots we created in this section with ggplot2. For example, we changed the plots to be hollow diamonds (shape=5), though we could have chosen from a number of different shapes and sizes for our plots.

There's more…

Hadley Wickham, the creator of the ggplot2 package, has a great reference website that you can use to figure out how to make your charts and plots look exactly like you want them to look. The site can be found at http://docs.ggplot2.org/current/.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.27.45