There are many places on the Internet to obtain football data, including various websites that track schedules, scores, and statistics. When looking for datasets, the main things we take into consideration are the usefulness, quality, and format of the data. For this project, we will pull data from http://sports.yahoo.com/ as it is a well-established source and has the necessary stats in a relatively well-organized format, which will only require some light cleaning.
If you've installed and loaded the packages listed in the Introduction section of this chapter and set your working directory to the location where you want to save your files, you should have everything you need to continue.
Perform the following steps to acquire and clean the data:
year
variable to 2013
:year <- 2013
url
variable:url <- paste("http://sports.yahoo.com/nfl/stats/byteam?group=Offense&cat=Total&conference=NFL&year=season_",year,"&sort=530&old_category=Total&old_group=Offense")
offense <- readHTMLTable(url, encoding = "UTF-8", colClasses="character")[[7]]
This will create a data frame named offense
that will contain the offensive stats for all 32 teams, as shown in the following screenshot:
offense <- offense[,-c(2,4,6,8,10,12,14,16,18,20,22,24,26,28)] offense[,1] <- as.character(offense[,1]) offense[,2:13] <- apply(offense[,2:13],2,as.numeric) offense[,14] <- as.numeric(substr(offense[,14], 1, 2))*60 + as.numeric(substr(offense[,14], 4, 6))
The last column, labeled TOP, is the time of possession, or the average amount of time the team was on offense per game. It was previously formatted as minutes:seconds, so we just changed it so that it reflects the total number of seconds per game during which the team's offense has possession of the ball.
Now, our offense data is clean and formatted properly, as shown in the following screenshot:
url <- paste("http://sports.yahoo.com/nfl/stats/byteam?group=Defense&cat=Total&conference=NFL&year=season_",year,"&sort=530&old_category=Total&old_group=Defense")
readHTMLTable
function to pull the data:defense <- readHTMLTable(url, encoding = "UTF-8", colClasses="character")[[7]]
When scraping data from a web page using the readHTMLTable
function, it will initially read the entire page. We add [[7]]
at the end because the table we want to pull data from is the seventh element in the page. For fun, try changing this number to see what the other page elements look like when they are read by the function.
The following screenshot shows the data:
offense
for defense
. Also, note that since the time of possession does not apply to defense, it is not included in the defense data:defense <- defense[,-c(2,4,6,8,10,12,14,16,18,20,22,24,26)] defense[,1] <- as.character(defense[,1]) defense[,2:13] <- apply(defense[,2:13],2,as.numeric)
Now, our defense data is also clean and formatted, as shown in the following screenshot:
The paste()
function in R is used to concatenate two strings together. For those that are new to manipulating data, concatenation means joining two things together. We used this function because we wanted to embed the year that we want to pull into the URL for the web page. This lets us change from year to year by simply changing the value of the year
variable. Try changing the value to 2012
or 2011
and then rerunning the steps in this recipe. It will automatically pull the stats for the year that you chose, assuming that the data is available.
Another useful R function used in this section is apply()
. We use it to format several columns as numeric with a single line of code. The apply()
function can do this with many mathematical operations as well, and not just for changing field types. For example, if we wanted to take the mean for columns 2 through 13 in the defense data frame after converting them to numeric types, we would use the following command:
means <- apply(defense[,2:13],2,mean)
paste()
function available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/paste.htmlapply()
function available at https://stat.ethz.ch/R-manual/R-devel/library/base/html/apply.htmlXML
package available at http://cran.r-project.org/web/packages/XML/XML.pdf3.147.65.247