Chapter 13. Data Around Us

Spatial data, also known as geospatial data, identifies geographic locations, such as natural or constructed features around us. Although all observations have some spatial content, such as the location of the observation, but this is out of most data analysis tools' range due to the complex nature of spatial information; alternatively, the spatiality might not be that interesting (at first sight) in the given research topic.

On the other hand, analyzing spatial data can reveal some very important underlying structures of the data, and it is well worth spending time visualizing the differences and similarities between close or far data points.

In this chapter, we are going to help with this and will use a variety of R packages to:

  • Retrieve geospatial information from the Internet
  • Visualize points and polygons on a map
  • Compute some spatial statistics

Geocoding

As in the previous chapters, we will use the hflights dataset to demonstrate how one can deal with data bearing spatial information. To this end, let's aggregate our dataset, just like we did in Chapter 12, Analyzing Time-series, but instead of generating daily data, let's view the aggregated characteristics of the airports. For the sake of performance, we will use the data.table package again as introduced in Chapter 3, Filtering and Summarizing Data and Chapter 4, Restructuring Data:

> library(hflights)
> library(data.table)
> dt <- data.table(hflights)[, list(
+     N         = .N,
+     Cancelled = sum(Cancelled),
+     Distance  = Distance[1],
+     TimeVar   = sd(ActualElapsedTime, na.rm = TRUE),
+     ArrDelay  = mean(ArrDelay, na.rm = TRUE)) , by = Dest]

So we have loaded and then immediately transformed the hfights dataset to a data.table object. At the same time, we aggregated by the destination of the flights to compute:

  • The number of rows
  • The number of cancelled flights
  • The distance
  • The standard deviation of the elapsed time of the flights
  • The arithmetic mean of the delays

The resulting R object looks like this:

> str(dt)
Classes 'data.table' and 'data.frame': 116 obs. of 6 variables:
 $ Dest     : chr  "DFW" "MIA" "SEA" "JFK" ...
 $ N        : int  6653 2463 2615 695 402 6823 4893 5022 6064 ...
 $ Cancelled: int  153 24 4 18 1 40 40 27 33 28 ...
 $ Distance : int  224 964 1874 1428 3904 305 191 140 1379 862 ...
 $ TimeVar  : num  10 12.4 16.5 19.2 15.3 ...
 $ ArrDelay : num  5.961 0.649 9.652 9.859 10.927 ...
 - attr(*, ".internal.selfref")=<externalptr>

So we have 116 observations all around the world and five variables describing those. Although this seems to be a spatial dataset, we have no geospatial identifiers that a computer can understand per se, so let's fetch the geocodes of these airports from the Google Maps API via the ggmap package. First, let's see how it works when we are looking for the geo-coordinates of Houston:

> library(ggmap)
> (h <- geocode('Houston, TX'))
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Houston,+TX&sensor=false
       lon      lat
1 -95.3698 29.76043

So the geocode function can return the matched latitude and longitude of the string we sent to Google. Now let's do the very same thing for all flight destinations:

> dt[, c('lon', 'lat') := geocode(Dest)]

Well, this took some time as we had to make 116 separate queries to the Google Maps API. Please note that Google limits you to 2,500 queries a day without authentication, so do not run this on a large dataset. There is a helper function in the package, called geocodeQueryCheck, which can be used to check the remaining number of free queries for the day.

Some of the methods and functions that we plan to use in some later sections of this chapter do not support data.table, so let's fall back to the traditional data.frame format and also print the structure of the current object:

> str(setDF(dt))
'data.frame':  116 obs. of  8 variables:
 $ Dest     : chr  "DFW" "MIA" "SEA" "JFK" ...
 $ N        : int  6653 2463 2615 695 402 6823 4893 5022 6064 ...
 $ Cancelled: int  153 24 4 18 1 40 40 27 33 28 ...
 $ Distance : int  224 964 1874 1428 3904 305 191 140 1379 862 ...
 $ TimeVar  : num  10 12.4 16.5 19.2 15.3 ...
 $ ArrDelay : num  5.961 0.649 9.652 9.859 10.927 ...
 $ lon      : num  -97 136.5 -122.3 -73.8 -157.9 ...
 $ lat      : num  32.9 34.7 47.5 40.6 21.3 ...

This was pretty quick and easy, wasn't it? Now that we have the longitude and latitude values of all the airports, we can try to show these points on a map.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.221.144