Visualizing geographical distributions of pay

We created datasets that contain the data we need to visualize average pay and employment by county and state. In this recipe, we will visualize the geographical distribution of pay by shading the appropriate areas of the map with a color that maps to a particular value or range of values. This is commonly referred to as a chloropleth map; this visualization type has become increasingly popular over the last few years as it has become much simpler to make such maps, especially online. Other geographic visualizations will overlay a marker or some other shape to denote data; there is no need to fill specific shapes with geographically meaningful boundaries.

Getting ready

After the last recipe, you should be ready to use the datasets we created to visualize geographical distributions. We will use the ggplot2 package to generate our visualizations. We will also use the RColorBrewer package, which provides "palettes" of colors that are visually appealing. If you don't currently have RColorBrewer, install it using install.packages('RColorBrewer', repos='http://cran.r-project.org').

How to do it…

The following steps walk you through the creation of this geospatial data visualization:

  1. We first need to get some data on the map itself. The ggplot2 package provides a convenient function, map_data, to extract this from data bundled in the maps package:
    library(ggplot2)
    library(RColorBrewer)
    state_df <- map_data('state')
    county_df <- map_data('county')
    
  2. We now do a bit of transforming to make this data conform to our data:
    transform_mapdata <- function(x){
       names(x)[5:6] <- c('state','county')
       for(u in c('state','county'){
         x[,u] <- sapply(x[,u],simpleCap)
       }
       return(x)
    }
    state_df <- transform_mapdata(state_df)
    county_df <- transform_mapdata(county_df)
    
  3. The data.frame objects, state_df and county_df, contain the latitude and longitude of points. These are our primary graphical data and need to be joined with the data we created in the previous recipe, which contains what is in effect the color information for the map:
    chor <- left_join(state_df, d.state, by='state')
    ggplot(chor, aes(long,lat,group=group))+
    geom_polygon(aes(fill=wage))+geom_path(color='black',size=0.2)+ scale_fill_brewer(palette='PuRd') +
    theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
    

    This gives us the following figure that depicts the distribution of average annual pay by state:

    How to do it…
  4. We can similarly create a visualization of the average annual pay by county, which will give us a much more granular information about the geographical distribution of wages:
    chor <- left_join(county_df, d.cty)
    ggplot(chor, aes(long,lat, group=group))+
      geom_polygon(aes(fill=wage))+
      geom_path( color='white',alpha=0.5,size=0.2)+
      geom_polygon(data=state_df, color='black',fill=NA)+
      scale_fill_brewer(palette='PuRd')+
      labs(x='',y='', fill='Avg Annual Pay')+
      theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
    

    This produces the following figure showing the geographical distribution of average annual pay by county:

    How to do it…

    It is evident from the preceding figure that there are well-paying jobs in western North Dakota, Wyoming, and northwestern Nevada, most likely driven by new oil exploration opportunities in these areas. The more obvious urban and coastal areas also show up quite nicely.

How it works…

Let's dive into the explanation of how the preceding 4 steps work. The map_data function is provided by ggplot2 to extract map data from the maps package. In addition to county and state, it can also extract data for the france, italy, nz, usa, world, and world2 maps provided by the maps package.

The columns that contain state and county information in county_df and state_df are originally named region and subregion. In step 2, we need to change their names to state and county, respectively, to make joining this data with our employment data easier. We also capitalize the names of the states and counties to conform to the way we formatted the data in our employment dataset.

For the creation of the map in step 3, we create the plotting dataset by joining state_df and d.state using the name of the state. We then use ggplot to draw the map of the US and fill in each state with a color corresponding to the level of wage and the discretized average annual pay created in the previous recipe. To elaborate, we establish that the data for the plot comes from chor, and we draw polygons (geom_polygon) based on the latitude and longitude of the borders of each state, filling them with a color depending on how high wage is, and then we draw the actual boundaries of the states (geom_path) in black. We specify that we will use a color palette that starts at white, goes through purple, and has red corresponding to the highest level of wage. The remainder of the code is formatted by specifying labels and removing axis annotations and ticks from the plot.

For step 4, the code is essentially the same as step 3, except that we draw polygons for the boundaries of the counties rather than the states. We add a layer to draw the state boundaries in black (geom_polygon(data=state_df, color='black', fill=NA)), in addition to the county boundaries in white.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.216.249