Visualizing our data

When dealing with geographic data, as we are here, it is immensely valuable to be able to plot that information. One way of doing that is with something called a choropleth map. A choropleth is essentially a geographic heat map. We are going to build a choropleth to create a heat map of average rental price by ZIP code.

The first thing we will need to do this is the ZIP code. Unfortunately for us, our dataset does not contain ZIP code information. We do, however, have the address for the properties. With a little help from the Google Maps API, we can retrieve this information.

Currently, the Google Maps API is a paid API. The rates are reasonable, 1,000 calls for $5, but they also give you a credit of $200 each month (at the time of writing). They also allow you to sign up for a free trial before they will start billing you, and they won't bill unless you explicitly give them the okay to do so. Since there really is no free alternative out there, we'll go ahead and sign up for an account. I'll walk you through the steps in the following:

  1. The first step is to go to the Google Maps API page at https://developers.google.com/maps/documentation/geocoding/intro:

  1. Click on GET STARTED in the upper right-hand corner. You'll next be prompted to create a project. Give it any name you like:

Creating a project
  1. Then you will enable billing:

  1. Next, you will enable your API keys:

  1. Once this is completed and you have your API keys, head back to the front page to enable the Geolocation API. Click on APIs in the left-hand side pane:

  1. And then, under Unused APIs, click Geolocation API:

Once all of this is complete, and you have your API keys, pip install Google Maps. That can be done from your command line with pip install -U googlemaps.

Let's continue on now with this API in our Jupyter Notebook. We'll import our new mapping API and test it out:

import googlemaps 
 
gmaps = googlemaps.Client(key='YOUR_API_KEY_GOES_HERE') 
 
ta = df.loc[3,['address']].values[0] + ' ' 
+ df.loc[3,['neighborhood']].values[0].split(', ')[-1] 
 
ta 

The preceding code results in the following output:

Okay, so essentially, all we did in the final bit of code was to import and initialize our googlemaps client, as well as use piece together from one of our apartments as usable address. Let's now pass in that address to the Google Maps API:

geocode_result = gmaps.geocode(ta) 
 
geocode_result 

The preceding code generates the following output:

Remember, we are looking to extract just the ZIP code here. The ZIP code is embedded in the JSON, but it will take a bit of work to extract due to the formatting of this response JSON object. Let's do that now:

for piece in geocode_result[0]['address_components']: 
    if 'postal_code' in piece['types'] : 
        print(piece['short_name']) 

The preceding code results in the following output:

It looks like we're getting the information we want. There is one caveat, however. Looking deeper into the address column, we can see that occasionally, a full address is not given. This will result in no ZIP code coming back. We'll just have to deal with that later. For now, let's build a function to retrieve the ZIP codes that we can do as follows:

import re 
def get_zip(row): 
    try: 
        addy = row['address'] + ' ' + row['neighborhood'].split(', ')[-1] 
        print(addy) 
        if re.match('^d+sw', addy): 
            geocode_result = gmaps.geocode(addy) 
            for piece in geocode_result[0]['address_components']: 
                if 'postal_code' in piece['types']: 
                    return piece['short_name'] 
                else: 
                    pass 
        else: 
            return np.nan 
    except: 
        return np.nan 
 
 
 
df['zip'] = df.apply(get_zip, axis=1) 

There's a fair bit of code in the preceding snippet, so let's talk about what's going on here.

First, at the bottom, you see that we are running an apply method on our DataFrame. Because we have set axis=1, each row of the df DataFrame will be passed into our function. Within the function, we are piecing together an address to call with the Google Maps Geolocation API. We are using regex to limit our calls to only those that start with a street number. We then iterate over the JSON response to parse out the ZIP code. If we find a ZIP code, we return it, otherwise we return a np.nan, or null value. Note that this function will take some time to run as we have to make many hundreds of calls and then parse out the response.

Once that completes, we will have a DataFrame that now has the ZIP code for those properties that had a proper address provided. Let's take a look and see how many that actually is:

df[df['zip'].notnull()].count() 

The preceding code generated the following output:

So, we lost quite a bit of our data, but nevertheless, what we have now is more useful in many ways, so we will continue on.

First, since it takes so long to retrieve all the ZIP code data, let's now store what we have so that we can always retrieve it later if necessary, and not have to make all those API calls again. We do that with the following code:

df.to_csv('apts_with_zip.csv') 

Let's also store just the data with the ZIP code information in a new DataFrame. We will call that one zdf:

zdf = df[df['zip'].notnull()].copy() 

Finally, let's do an aggregation by ZIP code to see what the average rental price is by ZIP:

zdf_mean = zdf.groupby('zip')['rent'].mean().to_frame('avg_rent') 
.sort_values(by='avg_rent', ascending=False).reset_index() 
zdf_mean 

The preceding code generates the following output:

We can see this jibes with our earlier finding that the Lincoln Center area had the highest mean rental prices, since 10069 is in the Lincoln Center region.

Let's now move on to visualizing this information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.192.120