Working with real data

Let's now try using pandas on real data. In Chapter 7, Scraping Data from the Web with Beautiful Soup 4, we collected a huge dataset of WWII battles and operations—including casualties, armies, dates, and locations. We never explored what is inside the dataset, though, and usually, this kind of data requires intensive processing. Now, let's see what we'll be able to do with this data.

As you may recall, we stored the dataset as a nested .json file. pandas can read from JSON files of different structures, but it won't understand nested data points. At this point, the task for us is straightforward (you may think of writing a recursive function, for example), so we won't discuss this much. If you want, you can check the 0_json_to_table.ipynb notebook in this chapter's folder on GitHub at the following link: https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications/tree/master/Chapter11. The only new operation there is the pandas.io.json.json_normalize function, which expects an array of dictionaries, representing rows, and flattens their nested properties, concatenating keys (in our case, nested belligerents, casualties, strengths, and leader elements). We stored the resulting data as a set of CSVs, representing different theaters of war (see Chapter11/data/...csv in the repository). Note that no additional processing, with the exception of unnesting, was undertaken.

With this done, we can now look closer at the data we collected. Let's dive into one of the CSV files and see what we're working with:

df = pd.read_csv('./data/Eastern Front.csv')

This will read the report and present the data.

Table of Contents for Working with real data

Create new playlist

Sign In

Sign Up

Table of Contents for
Working with real data