Joining geometries with tabular data

In this section, we are going to join an attribute table of a spatial vector layer with plain tabular data (as opposed to joining with the attribute table of another layer based on spatial location). In spatial analysis practice, we often have, at hand, a spatial layer and supplementary tabular data as separate objects, while we would like to work with them in combination. For instance, in the present example, USA Census data regarding county population sizes (a CSV file) will be linked to the county layer defining county geometries, in order to calculate county population densities. To do the latter, we need to know both the population size (from the census table) and area size (from the vector layer) for each county, and the only way to do that is to join both datasets.

The intermediate steps we will perform are as follows:

  1. Read the USA Census data.
  2. Subset the portion of the data we are interested in.
  3. Prepare a common key to join the census data with the county layer.
  4. Join!
  5. Calculate the population density.

Our first step is to read the USA Census data into R.

Note

The USA Census data was downloaded from the United States Census Bureau available at https://www.census.gov/popest/data/counties/totals/2012/CO-EST2012-alldata.html.

The CSV file, CO-EST2012-Alldata.csv, is provided on the book's website in its original form. The following expression reads its contents and assigns it to a data.frame object named dat:

> dat = read.csv("C:\Data\CO-EST2012-Alldata.csv")

The expression colnames(dat) reveals that we have as many as 52 variables in this table. To save space, only the first 15 values are printed here:

> colnames(dat)[1:15]
 [1] "SUMLEV"            "REGION"            "DIVISION"         
 [4] "STATE"             "COUNTY"            "STNAME"           
 [7] "CTYNAME"           "CENSUS2010POP"     "ESTIMATESBASE2010"
[10] "POPESTIMATE2010"   "POPESTIMATE2011"   "POPESTIMATE2012"  
[13] "NPOPCHG_2010"      "NPOPCHG_2011"      "NPOPCHG_2012"

As the accompanying file CO-EST2012-alldata.pdf informs, the first seven variables are geographical identifiers such as STATE (state FIPS code) or CTYNAME (county name), while the other 45 are measured variables related to demography, such as CENSUS2010POP (April 4, 2010 resident Census 2010 population) or BIRTHS2011 (births between July 1, 2010 to June 30, 2011). For our example, we will use only one of the measured variables (CENSUS2010POP), which is found in the 8th column. When joining the data with the county layer, however, we will also require the STATE column (the fourth column) and COUNTY column (the fifth column). Using the latter two columns, we will create a FIPS code column in dat that we will use to join dat to the attribute table of county.

We could subset the three columns of interest by simply using the expression dat[,c(4,5,8)]. However, it is recommended you refer to the column names themselves rather than to their numeric indices, to make the code more general (so that it will still work if the column order is altered):

> selected_cols = c("STATE", "COUNTY", "CENSUS2010POP")
> dat = dat[, colnames(dat) %in% selected_cols]

Note that here, we first created a vector with the column names of interest (selected_cols), used it to create a logical vector pointing to the indices of the respective columns (colnames(dat) %in% selected_cols), and finally used this logical vector as the column index of dat to create the subset.

The first few rows of dat now appear as follows:

> head(dat)
  STATE COUNTY CENSUS2010POP
1     1      0       4779736
2     1      1         54571
3     1      3        182265
4     1      5         27457
5     1      7         22915
6     1      9         57322

For convenience, we can convert the column names to lowercase using the tolower function:

> colnames(dat) = tolower(colnames(dat))
> colnames(dat)
[1] "state"         "county"        "census2010pop"

Examining the table will also reveal that it contains subtotal entries, for entire states, in addition to the county entries. These entries are marked with the value 0 in the county column. Since we are interested in counties, not states, these entries need to be removed:

> dat = dat[dat$county != 0, ]

With the first and second steps now complete, we are going to create a key with which the county layer and the dat table will be joined. The least problematic option is to use the FIPS codes as a key since county names may be slightly different among datasets.

We already have a FIPS column in the county layer. For example, the FIPS codes of the first 10 features are as follows:

> county$FIPS[1:10]
 [1] "09005" "09003" "09013" "09015" "06093" "06015" "06049"
 [8] "09011" "09001" "09007"

These five-digit codes encompass the state code (digits 1 to 2) and county code (digits 3 to 5). As we can see, when the state or county code is fewer than two or three digits, it is preceded by zeros. For example, the first county polygon has the FIPS code 09005, which means its state code is 9 (state of Connecticut) and county code is 5 (Litchfield County). However, in the dat table, the state and county FIPS codes are kept as separate numeric values without leading zeros (9 and 5). In order to get matching values with the county layer, we need to perform the following steps:

  1. Add leading zeros in the state column to get uniform two-digit codes.
  2. Add leading zeros in the county column to get uniform three-digit codes.
  3. Paste the state and county columns together to get the FIPS codes.

The first two steps can be performed using the formatC function. This function can deal with several formatting tasks including the one we need—padding the values of a vector with leading zeros, to obtain a common character length. In this particular case, we need to specify three arguments: the vector to work upon, the required final character count (the width parameter), and the format modifier "0" (the flag parameter, where "0" marks the pad with leading zeros scenario; see ?formatC):

> dat$state = formatC(dat$state, width = 2, flag = "0")
> dat$county = formatC(dat$county, width = 3, flag = "0")

We can examine the first few values of the modified state and county columns to make sure the expected outcome was obtained:

> dat$state[1:10]
 [1] "01" "01" "01" "01" "01" "01" "01" "01" "01" "01"
> dat$county[1:10]
 [1] "001" "003" "005" "007" "009" "011" "013" "015" "017" "019"

Note that the numeric values have been automatically converted to characters since a numeric value cannot have leading zeros.

Finally, we will paste the state and county codes into county FIPS codes and assign them to a new column called FIPS:

> dat$FIPS = paste0(dat$state, dat$county)

Now, we have in dat$FIPS an identical format as seen in county$FIPS:

> dat$FIPS[1:10]
 [1] "01001" "01003" "01005" "01007" "01009" "01011" "01013"
 [8] "01015" "01017" "01019"

We are ready to move on to the fourth step—joining the attribute table of county with the dat table. In fact, we are going to use a subset of dat, containing only two columns: FIPS (since the join will be based upon it) and census2010pop (since these are the data we are interested in) because the state and county columns are of no use to us at this stage. The join operation is done using the join function from the plyr package (see Chapter 3, Working with Tables):

> library(plyr)
> county@data = join(county@data,
+ dat[, colnames(dat) %in% c("FIPS", "census2010pop")],
+ by = "FIPS")

Tip

Using county@data as the first argument in join and the type of join that is being used ("left", by default), ensures that all entries in county@data are preserved in their original order (regardless of whether they have a match in dat). This is extremely important since the rows of county@data correspond to the county polygons. Thus, changing the order of rows would result in a discrepancy between the attribute table and the spatial features. Concerning vector layers, operations such as join(county@data,x,type="left") are safe to perform, while manual modification of the @data component involving altering the row order (for example, deleting a single row) should be generally avoided.

As a result, we now have the matching census2010pop entries in the attribute table of county:

> head(county@data)
       NAME_1     NAME_2 TYPE_2  FIPS      area census2010pop
1 Connecticut Litchfield County 09005  2451.876        189927
2 Connecticut   Hartford County 09003  1941.110        894014
3 Connecticut    Tolland County 09013  1077.789        152691
4 Connecticut    Windham County 09015  1350.476        118428
5  California   Siskiyou County 06093 16416.572         44900
6  California  Del Norte County 06015  2626.707         28610

We can check and see that only one entry in the county layer could not be matched with a census2010pop entry from dat (and thus, has NA in the census2010pop column). This entry corresponds to Clifton Forge City, Virginia:

> county@data[is.na(county$census2010pop),
+ c("NAME_1", "NAME_2")]
       NAME_1             NAME_2
2591 Virginia Clifton Forge City

Our fifth and final step will be to calculate population densities for each county, by dividing the population size (the census2010pop column) by county area (the area column). The result can be assigned to a new column, named density, in the attribute table of county:

> county$density = county$census2010pop / county$area

Examining the attribute table shows that the new column, holding average population density per km2, has indeed been added to the county layer:

> head(county@data)
       NAME_1     NAME_2 TYPE_2  FIPS      area census2010pop
1 Connecticut Litchfield County 09005  2451.876        189927
2 Connecticut   Hartford County 09003  1941.110        894014
3 Connecticut    Tolland County 09013  1077.789        152691
4 Connecticut    Windham County 09015  1350.476        118428
5  California   Siskiyou County 06093 16416.572         44900
6  California  Del Norte County 06015  2626.707         28610
     density
1  77.461920
2 460.568482
3 141.670638
4  87.693548
5   2.735041
6  10.891965

Preparing a map of population densities per county is postponed until we reach Chapter 9, Advanced Visualization of Spatial Data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.157.6