How it works...

After reading the data, we decide how many variables there are in the dataset. Here, we chose to split the Geolocation column into four variables, but we could have just chosen two for latitude and longitude and used a negative sign to differentiate between west/east and south/north.

There are a few ways to parse the Geolocation column with the methods from the str accessor. The easiest way is to use the split method. We pass it a simple regular expression defined by any character (the period) and a space. When a space follows any character, a split is made, and a new column is formed. The first occurrence of this pattern takes place at the end of the latitude. A space follows the degree character, and a split is formed. The splitting characters are discarded and not kept in the resulting columns. The next split matches the comma and space following directly after the latitude direction.

A total of three splits are made, resulting in four columns. The second line in step 2 provides them with meaningful names. Even though the resulting latitude and longitude columns appear to be floats, they are not. They were originally parsed from an object column and therefore remain object data types. Step 3 uses a dictionary to map the column names to their new types.

Instead of using a dictionary, which would require a lot of typing if you had many column names, you can use the function to_numeric to attempt to convert each column to either integer or float. To apply this function iteratively over each column, use the apply method with the following:

>>> geolocations.apply(pd.to_numeric, errors='ignore')

Step 4 concatenates the city to the front of this new DataFrame to complete the process of making tidy data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.14.245