The main purpose of this chapter is to look at the geographical distribution of wages across the US. Mapping this out requires us to first have a map. Fortunately, maps of the US, both at the state-and county-levels, are available in the maps
package, and the data required to make the maps can be extracted. We will align our employment data with the map data in this recipe so that the correct data is represented at the right location on the map.
The following steps will guide you through the process of creating your first map in R:
area
:head (area)
The output is shown in the following screenshot:
We see that there is something called area_fips
here. Federal Information Processing Standards (FIPS) codes are used by the Census Bureau to designate counties and other geographical areas in the US.
simpleCap <-function(x){ if(!is.na(x)){ s <- strsplit(x,' ')[[1]] paste(toupper(substring(s,1,1)), substring(s,2), sep='', collapse=' ') } else {NA} }
maps
package contains two datasets that we will use; they are county.fips
and state.fips
. We will first do some transformations. If we look at county.fips
, we notice that the FIPS code there is missing a leading 0
on the left for some of the codes. All the codes in our employment data comprise five digits:> data(county.fips) > head(county.fips) fips polyname 1 1001 alabama,autauga 2 1003 alabama,baldwin 3 1005 alabama,barbour 4 1007 alabama,bibb 5 1009 alabama,blount 6 1011 alabama,bullock
stringr
package will help us out here:county.fips$fips <- str_pad(county.fips$fips, width=5, pad="0")
polyname
column in county.fips
. We'll get the state names from state.fips
in a minute:county.fips$polyname <- as.character(county.fips$polyname) county.fips$county <- sapply( gsub('[a-z ]+,([a-z ]+)','\1',county.fips$polyname), simpleCap) county.fips <- unique(county.fips)
state.fips
data involves a lot of details:> data(state.fips)
The output is shown in the following screenshot:
fips
column with a 0
, if necessary, so that they have two digits, and capitalize the state names from polyname
to create a new state
column. The code is similar to the one we used for the county.fips
data:state.fips$fips <- str_pad(state.fips$fips, width=2, pad="0", side='left') state.fips$state <- as.character(state.fips$polyname) state.fips$state <- gsub("([a-z ]+):[a-z ']+",'\1',state.fips$state) state.fips$state <- sapply(state.fips$state, simpleCap)
fips
and state
values, and not in the other code:mystatefips <-unique(state.fips[,c('fips','abb','state')])
The unique function
, when applied to a data.frame
object, returns the unique rows of the object. You might be used to using unique
on a single vector to find the unique elements in the vector.
lower48 <- setdiff(unique(state.fips$state),c('Hawaii','Alaska'))
myarea
:myarea <- merge(area, county.fips, by.x='area_fips',by.y='fips', all.x=T) myarea$state_fips <- substr(myarea$area_fips, 1,2) myarea <- merge(myarea, mystatefips,by.x='state_fips',by.y='fips', all.x=T)
ann2012full <- left_join(ann2012full, myarea) ann2012full <- filter(ann2012full, state %in% lower48)
rda
) file on disk. This provides an efficient storage mechanism for R objects:save(ann2012full, file='data/ann2014full.rda',compress=T)
The 12 steps of this recipe covered quite a bit of material, so let's dive into some of the details, starting with step 2. The simpleCap
function is an example of a function in R. We use functions to encapsulate repeated tasks, reducing code duplication and ensuring that errors have a single point of origin. If we merely repeat code, changing the input values manually, we can easily make errors in transcription, break hidden assumptions, or accidentally overwrite important variables. Further, if we want to modify the code, we have to do it manually at every duplicate location. This is tedious and error-prone, and so we make functions, a best practice that we strongly encourage you to follow.
The simpleCap
function uses three functions: strsplit
, toupper
and substring
. The strsplit
function splits strings (or a vector of strings) whenever it finds the string fragment to split on (in our case, ' '
or a space). The substring
function extracts substrings from strings between the character locations specified. Specifying only one character location implies extracting from this location to the end of the string. The toupper
function changes the case of a string from lowercase to uppercase. The reverse operation is done by tolower
.
From step 3, packages often have example data bundled with them. county.fips
and state.fips
are examples of datasets that have been bundled into the maps
package.
The stringr
package, used in step 4, is another package by Dr. Wickham, which provides string manipulation functions. Here, we use str_pad
, which pads a string with a character (here, 0
) to give the string a particular width.
In step 5, we use the inbuilt regular expression (regex) capabilities in R. We won't talk about regular expressions too much here. The gsub
function looks for the first pattern and substitutes the second pattern in the string specified as third. Here, the pattern we're looking for comprises one or more letters or spaces ([a-z ]+
), then a comma, and then one or more letters or spaces. The second set of letters and spaces is what we want to keep, so we put parentheses around it. The \1
pattern says to replace the entire pattern with the first pattern we used parentheses around. This replacement happens for every element of the polyname
field.
Since we want capitalization for every element in polyname
, we can use a for
loop, but choose to use the more efficient sapply
instead. Every element in polyname
is passed through the simpleCap
function, and is thus capitalized in step 7.
In step 10, we join the area
, county.fips
, and mystatefips
datasets together. We use the merge
function rather than left_join
, since the variables we want to join on have different names for different data.frame
objects. The merge
function in the R standard library allows this flexibility. To ensure a left join, we specify all.x=TRUE
.
In step 11, we join the myarea
data frame to our ann2014full
dataset. We then use the filter
function to subset the data, restricting it to data from the lower 48 states. The filter
function is from the dplyr
package. We'll speak about the functionalities in dplyr
in the next recipe.
stringr
library available at http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf18.191.253.62