Acquiring and cleaning football data

There are many places on the Internet to obtain football data, including various websites that track schedules, scores, and statistics. When looking for datasets, the main things we take into consideration are the usefulness, quality, and format of the data. For this project, we will pull data from http://sports.yahoo.com/ as it is a well-established source and has the necessary stats in a relatively well-organized format, which will only require some light cleaning.

Getting ready

If you've installed and loaded the packages listed in the Introduction section of this chapter and set your working directory to the location where you want to save your files, you should have everything you need to continue.

How to do it…

Perform the following steps to acquire and clean the data:

  1. The first thing we will do is acquire offensive data for each team for a season. Since the last complete season, at the time of writing this book, is the 2013 season, it is the one we will use. So, we will set the year variable to 2013:
    year <- 2013
    
  2. Next, we will embed the year into the URL where the data is located and assign the entire URL string to the url variable:
    url <- paste("http://sports.yahoo.com/nfl/stats/byteam?group=Offense&cat=Total&conference=NFL&year=season_",year,"&sort=530&old_category=Total&old_group=Offense")
    
  3. Now that we have the complete URL, we can pull the data from it:
    offense <- readHTMLTable(url, encoding = "UTF-8", colClasses="character")[[7]]
    

    This will create a data frame named offense that will contain the offensive stats for all 32 teams, as shown in the following screenshot:

    How to do it…
  4. The first thing we notice when we take a look at the data that we've just pulled is that it needs a little bit of cleaning up. There are a lot of blank columns, and we want to make sure that the fields we've pulled are formatted correctly. So, let's get rid of the blank columns and then assign data types to each of the remaining columns, using the following command:
    offense <- offense[,-c(2,4,6,8,10,12,14,16,18,20,22,24,26,28)]
    offense[,1] <- as.character(offense[,1])
    offense[,2:13] <- apply(offense[,2:13],2,as.numeric)
    offense[,14] <- as.numeric(substr(offense[,14], 1, 2))*60 + as.numeric(substr(offense[,14], 4, 6))
    

    Note

    The last column, labeled TOP, is the time of possession, or the average amount of time the team was on offense per game. It was previously formatted as minutes:seconds, so we just changed it so that it reflects the total number of seconds per game during which the team's offense has possession of the ball.

    Now, our offense data is clean and formatted properly, as shown in the following screenshot:

    How to do it…
  5. Next, let's do the same thing with the data for defense. As with offense, we will start by embedding the year into the URL where we can obtain the defense data:
    url <- paste("http://sports.yahoo.com/nfl/stats/byteam?group=Defense&cat=Total&conference=NFL&year=season_",year,"&sort=530&old_category=Total&old_group=Defense")
    
  6. Next, we will pass this URL string into the readHTMLTable function to pull the data:
    defense <- readHTMLTable(url, encoding = "UTF-8", colClasses="character")[[7]]
    

    Note

    When scraping data from a web page using the readHTMLTable function, it will initially read the entire page. We add [[7]] at the end because the table we want to pull data from is the seventh element in the page. For fun, try changing this number to see what the other page elements look like when they are read by the function.

    The following screenshot shows the data:

    How to do it…
  7. Just as our offense data needed to be cleaned, so does the defense data. We will use the exact same commands to do this, just substituting the name offense for defense. Also, note that since the time of possession does not apply to defense, it is not included in the defense data:
    defense <- defense[,-c(2,4,6,8,10,12,14,16,18,20,22,24,26)]
    defense[,1] <- as.character(defense[,1])
    defense[,2:13] <- apply(defense[,2:13],2,as.numeric)
    

    Now, our defense data is also clean and formatted, as shown in the following screenshot:

    How to do it…

How it works…

The paste() function in R is used to concatenate two strings together. For those that are new to manipulating data, concatenation means joining two things together. We used this function because we wanted to embed the year that we want to pull into the URL for the web page. This lets us change from year to year by simply changing the value of the year variable. Try changing the value to 2012 or 2011 and then rerunning the steps in this recipe. It will automatically pull the stats for the year that you chose, assuming that the data is available.

Another useful R function used in this section is apply(). We use it to format several columns as numeric with a single line of code. The apply() function can do this with many mathematical operations as well, and not just for changing field types. For example, if we wanted to take the mean for columns 2 through 13 in the defense data frame after converting them to numeric types, we would use the following command:

means <- apply(defense[,2:13],2,mean)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.65.247