We already have all the data we need to identify the parent state of each airport. The dt
dataset includes the geo-coordinates of the locations, and we managed to render the states as polygons with the map
function. Actually, this latter function can return the underlying dataset without rendering a plot:
> str(map_data <- map('state', plot = FALSE, fill = TRUE)) List of 4 $ x : num [1:15599] -87.5 -87.5 -87.5 -87.5 -87.6 ... $ y : num [1:15599] 30.4 30.4 30.4 30.3 30.3 ... $ range: num [1:4] -124.7 -67 25.1 49.4 $ names: chr [1:63] "alabama" "arizona" "arkansas" "california" ... - attr(*, "class")= chr "map"
So we have around 16,000 points describing the boundaries of the US states, but this map data is more detailed than we actually need (see for example the name of the polygons starting with Washington):
> grep('^washington', map_data$names, value = TRUE) [1] "washington:san juan island" "washington:lopez island" [3] "washington:orcas island" "washington:whidbey island" [5] "washington:main"
In short, the non-connecting parts of a state are defined as separate polygons. To this end, let's save a list of the state names without the string after the colon:
> states <- sapply(strsplit(map_data$names, ':'), '[[', 1)
We will use this list as the basis of aggregation from now on. Let's transform this map
dataset into another class of object, so that we can use the powerful features of the sp
package. We will use the maptools
package to do this transformation:
> library(maptools) > us <- map2SpatialPolygons(map_data, IDs = states, + proj4string = CRS("+proj=longlat +datum=WGS84"))
An alternative way of getting the state polygons might be to directly load those instead of transforming from other data formats as described earlier. To this end, you may find the raster
package especially useful to download free map shapefiles
from gadm.org
via the getData
function. Although these maps are way too detailed for such a simple task, you can always simplify those—for example, with the gSimplify
function of the rgeos
package.
So we have just created an object called us
, which includes the polygons of map_data
for each state with the given projection. This object can be shown on a map just like we did previously, although you should use the general plot
method instead of the map
function:
> plot(us)
Besides this, however, the sp
package supports so many powerful features! For example, it's very easy to identify the overlay polygons of the provided points via the over
function. As this function name conflicts with the one found in the grDevices
package, it's better to refer to the function along with the namespace using a double colon:
> library(sp) > dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], dt, + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(sp::over(us, dtp)) 'data.frame': 49 obs. of 8 variables: $ Dest : chr "BHM" "PHX" "XNA" "LAX" ... $ N : int 2736 5096 1172 6064 164 NA NA 2699 3085 7886 ... $ Cancelled: int 39 29 34 33 1 NA NA 35 11 141 ... $ Distance : int 562 1009 438 1379 926 NA NA 1208 787 689 ... $ TimeVar : num 10.1 13.61 9.47 15.16 13.82 ... $ ArrDelay : num 8.696 2.166 6.896 8.321 -0.451 ... $ lon : num -86.8 -112.1 -94.3 -118.4 -107.9 ... $ lat : num 33.6 33.4 36.3 33.9 38.5 ...
What happened here? First, we passed the coordinates and the whole dataset to the SpatialPointsDataFrame
function, which stored our data as spatial points with the given longitude and latitude values. Next, we called the over
function to left-join the values of dtp
to the US states.
An alternative way of identifying the state of a given airport is to ask for more detailed information from the Google Maps API. By changing the default output
argument of the geocode
function, we can get all address components for the matched spatial object, which of course includes the state as well. Look for example at the following code snippet:
geocode('LAX','all')$results[[1]]$address_components
Based on this, you might want to get a similar output for all airports and filter the list for the short name of the state. The rlist
package would be extremely useful in this task, as it offers some very convenient ways of manipulating lists in R.
The only problem here is that we matched only one airport to the states, which is definitely not okay. See for example the fourth column in the earlier output: it shows LAX
as the matched airport for California
(returned by states[4]
), although there are many others there as well.
To overcome this issue, we can do at least two things. First, we can use the returnList
argument of the over
function to return all matched rows of dtp
, and we will then post-process that data:
> str(sapply(sp::over(us, dtp, returnList = TRUE), + function(x) sum(x$Cancelled))) Named int [1:49] 51 44 34 97 23 0 0 35 66 149 ... - attr(*, "names")= chr [1:49] "alabama" "arizona" "arkansas" ...
So we created and called an anonymous function that will sum
up the Cancelled
values of the data.frame
in each element of the list returned by over
.
Another, probably cleaner, approach is to redefine dtp
to only include the related values and pass a function to over
to do the summary:
> dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], + dt[, 'Cancelled', drop = FALSE], + proj4string = CRS("+proj=longlat +datum=WGS84")) > str(cancels <- sp::over(us, dtp, fn = sum)) 'data.frame': 49 obs. of 1 variable: $ Cancelled: int 51 44 34 97 23 NA NA 35 66 149 ...
Either way, we have a vector to merge back to the US state names:
> val <- cancels$Cancelled[match(states, row.names(cancels))]
And to update all missing values to zero (as the number of cancelled flights in a state without any airport is not missing data, but exactly zero for sure):
> val[is.na(val)] <- 0
18.216.171.107