Finding polygon overlays of point data

We already have all the data we need to identify the parent state of each airport. The dt dataset includes the geo-coordinates of the locations, and we managed to render the states as polygons with the map function. Actually, this latter function can return the underlying dataset without rendering a plot:

> str(map_data <- map('state', plot = FALSE, fill = TRUE))
List of 4
 $ x    : num [1:15599] -87.5 -87.5 -87.5 -87.5 -87.6 ...
 $ y    : num [1:15599] 30.4 30.4 30.4 30.3 30.3 ...
 $ range: num [1:4] -124.7 -67 25.1 49.4
 $ names: chr [1:63] "alabama" "arizona" "arkansas" "california" ...
 - attr(*, "class")= chr "map"

So we have around 16,000 points describing the boundaries of the US states, but this map data is more detailed than we actually need (see for example the name of the polygons starting with Washington):

> grep('^washington', map_data$names, value = TRUE)
[1] "washington:san juan island" "washington:lopez island"
[3] "washington:orcas island"    "washington:whidbey island"
[5] "washington:main"

In short, the non-connecting parts of a state are defined as separate polygons. To this end, let's save a list of the state names without the string after the colon:

> states <- sapply(strsplit(map_data$names, ':'), '[[', 1)

We will use this list as the basis of aggregation from now on. Let's transform this map dataset into another class of object, so that we can use the powerful features of the sp package. We will use the maptools package to do this transformation:

> library(maptools)
> us <- map2SpatialPolygons(map_data, IDs = states,
+    proj4string = CRS("+proj=longlat +datum=WGS84"))

Note

An alternative way of getting the state polygons might be to directly load those instead of transforming from other data formats as described earlier. To this end, you may find the raster package especially useful to download free map shapefiles from gadm.org via the getData function. Although these maps are way too detailed for such a simple task, you can always simplify those—for example, with the gSimplify function of the rgeos package.

So we have just created an object called us, which includes the polygons of map_data for each state with the given projection. This object can be shown on a map just like we did previously, although you should use the general plot method instead of the map function:

> plot(us)
Finding polygon overlays of point data

Besides this, however, the sp package supports so many powerful features! For example, it's very easy to identify the overlay polygons of the provided points via the over function. As this function name conflicts with the one found in the grDevices package, it's better to refer to the function along with the namespace using a double colon:

> library(sp)
> dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')], dt,
+   proj4string = CRS("+proj=longlat +datum=WGS84"))
> str(sp::over(us, dtp))
'data.frame':  49 obs. of  8 variables:
 $ Dest     : chr  "BHM" "PHX" "XNA" "LAX" ...
 $ N        : int  2736 5096 1172 6064 164 NA NA 2699 3085 7886 ...
 $ Cancelled: int  39 29 34 33 1 NA NA 35 11 141 ...
 $ Distance : int  562 1009 438 1379 926 NA NA 1208 787 689 ...
 $ TimeVar  : num  10.1 13.61 9.47 15.16 13.82 ...
 $ ArrDelay : num  8.696 2.166 6.896 8.321 -0.451 ...
 $ lon      : num  -86.8 -112.1 -94.3 -118.4 -107.9 ...
 $ lat      : num  33.6 33.4 36.3 33.9 38.5 ...

What happened here? First, we passed the coordinates and the whole dataset to the SpatialPointsDataFrame function, which stored our data as spatial points with the given longitude and latitude values. Next, we called the over function to left-join the values of dtp to the US states.

Note

An alternative way of identifying the state of a given airport is to ask for more detailed information from the Google Maps API. By changing the default output argument of the geocode function, we can get all address components for the matched spatial object, which of course includes the state as well. Look for example at the following code snippet:

geocode('LAX','all')$results[[1]]$address_components

Based on this, you might want to get a similar output for all airports and filter the list for the short name of the state. The rlist package would be extremely useful in this task, as it offers some very convenient ways of manipulating lists in R.

The only problem here is that we matched only one airport to the states, which is definitely not okay. See for example the fourth column in the earlier output: it shows LAX as the matched airport for California (returned by states[4]), although there are many others there as well.

To overcome this issue, we can do at least two things. First, we can use the returnList argument of the over function to return all matched rows of dtp, and we will then post-process that data:

> str(sapply(sp::over(us, dtp, returnList = TRUE),
+   function(x) sum(x$Cancelled)))
 Named int [1:49] 51 44 34 97 23 0 0 35 66 149 ...
 - attr(*, "names")= chr [1:49] "alabama" "arizona" "arkansas" ...

So we created and called an anonymous function that will sum up the Cancelled values of the data.frame in each element of the list returned by over.

Another, probably cleaner, approach is to redefine dtp to only include the related values and pass a function to over to do the summary:

> dtp <- SpatialPointsDataFrame(dt[, c('lon', 'lat')],
+    dt[, 'Cancelled', drop = FALSE],
+    proj4string = CRS("+proj=longlat +datum=WGS84"))
> str(cancels <- sp::over(us, dtp, fn = sum))
'data.frame':  49 obs. of  1 variable:
 $ Cancelled: int  51 44 34 97 23 NA NA 35 66 149 ...

Either way, we have a vector to merge back to the US state names:

> val <- cancels$Cancelled[match(states, row.names(cancels))]

And to update all missing values to zero (as the number of cancelled flights in a state without any airport is not missing data, but exactly zero for sure):

> val[is.na(val)] <- 0
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.171.107