Parsing locations

Let's start with the location column. As you remember, data in this column is supposed to represent the location where the battle took place. In many cases, the value was stored as Wikipedia GeoMarker, which includes latitude/longitude coordinates. Here is what the raw value of this marker looks like:

>>> battles['Location'].iloc[10]
'Warsaw, Poland52°13′48″N 21°00′39″Eufeff / ufeff52.23000°N 21.01083°Eufeff / 52.23000; 21.01083Coordinates: 52°13′48″N 21°00′39″Eufeff / ufeff52.23000°N 21.01083°Eufeff / 52.23000; 21.01083'

Note that this geotag has both a nice latitude/longitude pair (with minutes and seconds), as well as its float representation, which is easier to use. In fact, the very same coordinates are repeated at the very end in their most simple form—and that's what we'll extract.

Let's write our first pattern. Usually, it is easiest to write a draft pattern, which will match our example string, and then work from there—adopting, relaxing, and tightening the pattern, where needed. Here is our attempt—a slash, and then two groups, each containing either numeric characters or a period (which we have to escape with a slash):

pattern = r'/ ([d|.]+); ([d|.]+)'
It is usually easier to tailor the pattern in an interactive way. Our favorite tool for the job is Pythex (https://pythex.org/), an online console for interactive regex testing, tailored specifically for Python-flavored regex (yes, there are some differences).

Let's test this pattern:

battles.head(10).Location.str.extract(pattern)

It works! You may want to go over addresses and check that ones with no numbers extracted indeed do not have it. We can store the results in two new columns:

battles[['Latitude', 'Longitude']] = battles.Location.str.extract(pattern)

Note that both columns are still strings, but now they can be converted into floats:

for col in  'Latitude', 'Longitude':
battles[col] = battles[col].astype(float)

Still, many locations did not have coordinates to start with. But how many? Let's check the percentage of empty cells in Latitude:

>>> 100 * (battles['Lattitude'].isnull().sum() / len(battles))
78.2312925170068

That is, 78% of our locations are empty—too many! Other cells don't have any coordinates, but most of them do have an address as a string. Let's try to geocode them using the nominatim_geocode function that we wrote in earlier in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.32.76