Beginning with pandas

Of course, not all data—and data analysis—is numeric. To address that gap, and inspired by the R language's dataframe objects, another package—pandas—was created by Wes McKinney in 2008. While it heavily relies on NumPy for numeric computations, its core interface objects are dataframes (2-dimensional multitype tables) and series (1-dimensional arrays). Dataframes, in comparison to NumPy matrices, don't require all data to be of the same type. On the contrary, they allow you to mix numeric values with Boolean, strings, DateTimes, and any other arbitrary Python objects. It does require (and enforce), however, the data type to be uniform vertically—within the same columns. Compared to NumPy, it also allows dataframe columns and rows to have arbitrary numeric or string names—or even hierarchical, multilevel indices.

Also, pandas allows simple grouping and aggregation of data, merging tables à la SQL, time-based transformations, plotting, and many, many other tools. It also makes reading and writing to dozens of different formats—from CSV file to SQL database, to JSON and HDFS/Arrow binaries—a breeze. As a result, it is extremely powerful for data analysis and remains the de facto standard for most data analysis, period.

Let's showcase pandas with a simple example:

Here, we'll read a CSV file with geocoded cities we created in Chapter 6, First Script – Geocoding with Web API:

>>> import pandas as pd

>>> df = pd.read_csv('../Chapter06/geocoded.csv')
>>> len(df)  # number of rows in the table
10

Next, we'll filter data to only the cities in the Eastern Hemisphere (positive longitude):

>>> eastern = df[df.lon > 0 ] # those with non-negative longitude (easter hemisphere)
>>> len(eastern)
8

Now, we calculate how many cities there are and what their median population is for each country:

>>> result = eastern.groupby('country').agg({'population':'mean', 'icon':'count'})
>>> result.rename(columns={'icon':'cities'}, inplace=True)
>>> result

                    population cities
country 
China               22.6825    2
India               25.2725    2
Indonesia           32.2700    1
Japan               38.0500    1
Philippines          24.6500    1
South Korea         24.2100    1

We'll finally store our results as a new CSV file:

result.to_csv('aggregation.csv')

That's it! Note that pandas also plays well with Jupyter—all tables are nicely rendered as HTML tables!

Working with pandas requires the same type of thinking as with NumPy—you should try to avoid loops at all costs. Most of the time, there are predefined ways to do what you want, written by others. The resulting code may be somewhat less readable and expressive than pure Python—but will be way faster.

One of the most popular spin-offs from pandas is the geopandas package, which offers a pandas-like interface for geospatial visualization and analysis. It represents collections of geospatial objects (points, lines, or polygons) as a special kind of dataframe. We'll work with geopandas in Chapter 12, Data Exploration and Visualization.

So far, we've reviewed two fundamental packages—NumPy and pandas. Both of them provide serious power in reading, processing, and operating on data—be it numeric arrays or tables of different data types. On top of those complex and fast data structures, yet another layer of packages allows the running of complex algorithms—packages such as SciPy, SimPy, and scikit-learn. You can think of them as a bunch of textboxes on core mathematical, physical, and general-purpose scientific equations and models, all brought to life as a set of Python packages.

Table of Contents for Beginning with pandas

Create new playlist

Sign In

Sign Up

Table of Contents for
Beginning with pandas