Geocoding the addresses

Now we know how to read and write data, let's now loop over the addresses from the file and store the results into another .csv file. For that, we'll create another function that loops over the addresses and geocodes them one by one. One reason to do so is to make the code more robust; currently, if something goes wrong with a specific request (say, the address is not found), our geocode function will raise an error, halting the whole process and, potentially, leading to the loss of all previously geocoded data. Arguably, a better way would be to keep the script running and store the dataset, reporting issues and the corresponding rows of the original dataset separately. So, let's catch errors and append them and the corresponding rows to another list. If there are no issues but no results either, we'll print the address and go to the next one.

The process will take some time—at least a second for each row, and then some. To keep us informed and entertained while we wait, let's add a progress bar. For that, we'll use another popular library, tqdm, that does exactly that. The library is very easy to use. To get our progress bar, the only thing we need is to loop over a tqdm object, initiated with our original iterable as an argument. Take a look at the following:

>>> from tqdm import tqdm:
>>> collection = ['Apple', 'Banana', 'Orange]

>>> for fruit in tqdm(collection):
>>>     pass

100%|██████████| 3/3 [00:1<00:00, 0.20s/it]

Here, we have a collection of three strings. To add a progress bar over a loop, we initiate a tqdm object to our collection and loop over it instead, as if it was the original collection. Easy!

Now, let's break the function we want to write into chunks and go over it. First of all, we declare the function itself, specifying the data; the column property in each dictionary to use as an address; and lastly, a Boolean argument, verbose (that is, if we want the function to be verbose on what is happening under the hood). After a docstring, we create two lists—one for the good geocode, and one for the errored values:

def geocode_bulk(data, column='address', verbose=False):
    '''assuming data is an iterable of dicts, will attempt
        to geocode each, treating {column} as an address. 
        Returns 2 iterables - result and errored rows'''
    result, errors = [], []

Now we can build the loop. As we planned, let's wrap data into a tqdm object to get the loop. Within the loop, we'll try to run the geocode and check the result. If no result is found, and we're in verbose mode, the event can be printed and the row can be added to the results list. If Nominatim found something, we can merge the first result with our initial information and also store it in our results:

try:
    search = nominatim_geocode(row[column], limit=1)
    if len(search) == 0: # no location found:
        result.append(row)
        if verbose:
            print(f"Can't find anything for {row[column]}")

As we don't want to lose all our progress because of an error, we use except so that errors for a particular address (troubles with the internet connection, for example) will lead to an empty result for this specific address but won't cause the whole of the code to fail. In this case, we'll add an error message to the record and pass it to the errors list:

except Exception as e:
    if verbose:
        print(e)
    row['error'] = e
    errors.append(row)

Finally, we report the total number of errors, if in verbose mode, and return two lists. Here is the function as a whole:

from tqdm import tqdm

def geocode_bulk(data, column='address', verbose=False):
    '''assuming data is an iterable of dicts, will attempt to 
       geocode each, treating {column} as an address. 
       Returns 2 iterables - result and errored rows'''
    result, errors = [], []

    for row in tqdm(data):
        try:
            search = nominatim_geocode(row[column], limit=1)
            if len(search) == 0: # no location found:
                result.append(row)
                if verbose:
                    print(f"Can't find anything for {row[column]}")
                    
            else:
                info = search[0] # most "important" result
                info.update(row) # merge two dicts
                result.append(info) 
        except Exception as e:
            if verbose:
                print(e)
            row['error'] = e
            errors.append(row)
    
    if len(errors) > 0 and verbose:
        print(f'{len(errors)}/{len(data)} rows failed')

    return result, errors

Shall we try it out? It seems that it is working: it took us 13 seconds to geocode the cities by their name:

result, errors = geocode_bulk(cities, column='name', verbose=True)

100%|██████████| 10/10 [00:14<00:00,  1.40s/it]

As a result, we now have two lists: one with successfully geocoded addresses—including the latitude and longitude of each—and another with problematic entries. If there are any errored rows, we can make one more attempt to geocode them or investigate the causes and tweak either the code or the data.

The code we just wrote is rather opinionated, as we made many assumptions. For example, it uses only one column for geocoding, takes only the first geocode result, and can be very verbose; you might want to tailor it to your own needs or write different versions for different projects.

Let's now talk about how to store those useful functions so that we can use them (and we will) in the future.

Table of Contents for Geocoding the addresses

Create new playlist

Sign In

Sign Up

Table of Contents for
Geocoding the addresses