We created datasets that contain the data we need to visualize average pay and employment by county and state. In this recipe, we will visualize the geographical distribution of pay by shading the appropriate areas of the map with a color that maps to a particular value or range of values. This is commonly referred to as a chloropleth map; this visualization type has become increasingly popular over the last few years as it has become much simpler to make such maps, especially online. Other geographic visualizations will overlay a marker or some other shape to denote data; there is no need to fill specific shapes with geographically meaningful boundaries.
After the last recipe, you should be ready to use the datasets we created to visualize geographical distributions. We will use the ggplot2
package to generate our visualizations. We will also use the RColorBrewer
package, which provides "palettes" of colors that are visually appealing. If you don't currently have RColorBrewer
, install it using install.packages('RColorBrewer', repos='http://cran.r-project.org')
.
The following steps walk you through the creation of this geospatial data visualization:
ggplot2
package provides a convenient function, map_data
, to extract this from data bundled in the maps
package:library(ggplot2) library(RColorBrewer) state_df <- map_data('state') county_df <- map_data('county')
transform_mapdata <- function(x){ names(x)[5:6] <- c('state','county') for(u in c('state','county'){ x[,u] <- sapply(x[,u],simpleCap) } return(x) } state_df <- transform_mapdata(state_df) county_df <- transform_mapdata(county_df)
data.frame
objects, state_df
and county_df
, contain the latitude and longitude of points. These are our primary graphical data and need to be joined with the data we created in the previous recipe, which contains what is in effect the color information for the map:chor <- left_join(state_df, d.state, by='state') ggplot(chor, aes(long,lat,group=group))+ geom_polygon(aes(fill=wage))+geom_path(color='black',size=0.2)+ scale_fill_brewer(palette='PuRd') + theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
This gives us the following figure that depicts the distribution of average annual pay by state:
chor <- left_join(county_df, d.cty) ggplot(chor, aes(long,lat, group=group))+ geom_polygon(aes(fill=wage))+ geom_path( color='white',alpha=0.5,size=0.2)+ geom_polygon(data=state_df, color='black',fill=NA)+ scale_fill_brewer(palette='PuRd')+ labs(x='',y='', fill='Avg Annual Pay')+ theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
This produces the following figure showing the geographical distribution of average annual pay by county:
It is evident from the preceding figure that there are well-paying jobs in western North Dakota, Wyoming, and northwestern Nevada, most likely driven by new oil exploration opportunities in these areas. The more obvious urban and coastal areas also show up quite nicely.
Let's dive into the explanation of how the preceding 4 steps work. The map_data
function is provided by ggplot2
to extract map data from the maps
package. In addition to county and state, it can also extract data for the france
, italy
, nz
, usa
, world
, and world2
maps provided by the maps
package.
The columns that contain state and county information in county_df
and state_df
are originally named region
and subregion
. In step 2, we need to change their names to state
and county
, respectively, to make joining this data with our employment data easier. We also capitalize the names of the states and counties to conform to the way we formatted the data in our employment
dataset.
For the creation of the map in step 3, we create the plotting dataset by joining state_df
and d.state
using the name of the state. We then use ggplot
to draw the map of the US and fill in each state with a color corresponding to the level of wage and the discretized average annual pay created in the previous recipe. To elaborate, we establish that the data for the plot comes from chor
, and we draw polygons (geom_polygon
) based on the latitude and longitude of the borders of each state, filling them with a color depending on how high wage is, and then we draw the actual boundaries of the states (geom_path
) in black. We specify that we will use a color palette that starts at white, goes through purple, and has red corresponding to the highest level of wage. The remainder of the code is formatted by specifying labels and removing axis annotations and ticks from the plot.
For step 4, the code is essentially the same as step 3, except that we draw polygons for the boundaries of the counties rather than the states. We add a layer to draw the state boundaries in black (geom_polygon(data=state_df, color='black', fill=NA)
), in addition to the county boundaries in white.
ggplot2
documentation available at http://www.ggplot2.org18.188.216.249