Adding geographical information

The main purpose of this chapter is to look at the geographical distribution of wages across the US. Mapping this out requires us to first have a map. Fortunately, maps of the US, both at the state-and county-levels, are available in the maps package, and the data required to make the maps can be extracted. We will align our employment data with the map data in this recipe so that the correct data is represented at the right location on the map.

Getting ready

We already have the area dataset imported into R, so we are ready to go.

How to do it…

The following steps will guide you through the process of creating your first map in R:

  1. Let's first look at the data in area:
    head (area)
    

    The output is shown in the following screenshot:

    How to do it…

    We see that there is something called area_fips here. Federal Information Processing Standards (FIPS) codes are used by the Census Bureau to designate counties and other geographical areas in the US.

  2. We want to capitalize all the names, according to the conventions. We'll write a small function to do this:
    simpleCap <-function(x){
    if(!is.na(x)){
    s <- strsplit(x,' ')[[1]]
    paste(toupper(substring(s,1,1)), substring(s,2), 
    sep='', collapse=' ')
    } else {NA}
    }
    
  3. The maps package contains two datasets that we will use; they are county.fips and state.fips. We will first do some transformations. If we look at county.fips, we notice that the FIPS code there is missing a leading 0 on the left for some of the codes. All the codes in our employment data comprise five digits:
    > data(county.fips)
    > head(county.fips)
      fips        polyname
    1 1001 alabama,autauga
    2 1003 alabama,baldwin
    3 1005 alabama,barbour
    4 1007    alabama,bibb
    5 1009  alabama,blount
    6 1011 alabama,bullock
    
  4. The stringr package will help us out here:
    county.fips$fips <- str_pad(county.fips$fips, width=5, pad="0")
    
  5. We want to separate the county names from the polyname column in county.fips. We'll get the state names from state.fips in a minute:
    county.fips$polyname <- as.character(county.fips$polyname)
    county.fips$county <- sapply(
      gsub('[a-z ]+,([a-z ]+)','\1',county.fips$polyname),
      simpleCap)
    county.fips <- unique(county.fips)
    
  6. The state.fips data involves a lot of details:
    > data(state.fips)
    

    The output is shown in the following screenshot:

    How to do it…
  7. We'll again pad the fips column with a 0, if necessary, so that they have two digits, and capitalize the state names from polyname to create a new state column. The code is similar to the one we used for the county.fips data:
    state.fips$fips <- str_pad(state.fips$fips, width=2, pad="0", 
    side='left')
    state.fips$state <- as.character(state.fips$polyname)
    state.fips$state <- gsub("([a-z ]+):[a-z ']+",'\1',state.fips$state)
    state.fips$state <- sapply(state.fips$state, simpleCap)
    
  8. We make sure that we have unique rows. We need to be careful here, since we only need to have uniqueness in the fips and state values, and not in the other code:
    mystatefips <-unique(state.fips[,c('fips','abb','state')])
    

    The unique function, when applied to a data.frame object, returns the unique rows of the object. You might be used to using unique on a single vector to find the unique elements in the vector.

  9. We get a list of the lower 48 state names. We will filter our data to look only at these states:
    lower48 <- setdiff(unique(state.fips$state),c('Hawaii','Alaska'))
    

    Note

    The setdiff set operation looks for all the elements in the first set that are not in the second set.

  10. Finally, we put all this information together into a single dataset, myarea:
    myarea <- merge(area, county.fips, by.x='area_fips',by.y='fips', all.x=T)
    myarea$state_fips <- substr(myarea$area_fips, 1,2)
    myarea <- merge(myarea, mystatefips,by.x='state_fips',by.y='fips', all.x=T)
    
  11. Lastly, we join the geographical information with our dataset, and filter it to keep only data on the lower 48 states:
    ann2012full <- left_join(ann2012full, myarea)
    ann2012full <- filter(ann2012full, state %in% lower48)
    
  12. We now store the final dataset in an R data (rda) file on disk. This provides an efficient storage mechanism for R objects:
    save(ann2012full, file='data/ann2014full.rda',compress=T)
    

How it works…

The 12 steps of this recipe covered quite a bit of material, so let's dive into some of the details, starting with step 2. The simpleCap function is an example of a function in R. We use functions to encapsulate repeated tasks, reducing code duplication and ensuring that errors have a single point of origin. If we merely repeat code, changing the input values manually, we can easily make errors in transcription, break hidden assumptions, or accidentally overwrite important variables. Further, if we want to modify the code, we have to do it manually at every duplicate location. This is tedious and error-prone, and so we make functions, a best practice that we strongly encourage you to follow.

The simpleCap function uses three functions: strsplit, toupper and substring. The strsplit function splits strings (or a vector of strings) whenever it finds the string fragment to split on (in our case, ' ' or a space). The substring function extracts substrings from strings between the character locations specified. Specifying only one character location implies extracting from this location to the end of the string. The toupper function changes the case of a string from lowercase to uppercase. The reverse operation is done by tolower.

From step 3, packages often have example data bundled with them. county.fips and state.fips are examples of datasets that have been bundled into the maps package.

The stringr package, used in step 4, is another package by Dr. Wickham, which provides string manipulation functions. Here, we use str_pad, which pads a string with a character (here, 0) to give the string a particular width.

In step 5, we use the inbuilt regular expression (regex) capabilities in R. We won't talk about regular expressions too much here. The gsub function looks for the first pattern and substitutes the second pattern in the string specified as third. Here, the pattern we're looking for comprises one or more letters or spaces ([a-z ]+), then a comma, and then one or more letters or spaces. The second set of letters and spaces is what we want to keep, so we put parentheses around it. The \1 pattern says to replace the entire pattern with the first pattern we used parentheses around. This replacement happens for every element of the polyname field.

Since we want capitalization for every element in polyname, we can use a for loop, but choose to use the more efficient sapply instead. Every element in polyname is passed through the simpleCap function, and is thus capitalized in step 7.

In step 10, we join the area, county.fips, and mystatefips datasets together. We use the merge function rather than left_join, since the variables we want to join on have different names for different data.frame objects. The merge function in the R standard library allows this flexibility. To ensure a left join, we specify all.x=TRUE.

In step 11, we join the myarea data frame to our ann2014full dataset. We then use the filter function to subset the data, restricting it to data from the lower 48 states. The filter function is from the dplyr package. We'll speak about the functionalities in dplyr in the next recipe.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.253.62