Beginning with pandas

Of course, not all dataand data analysisis numeric. To address that gap, and inspired by the R language's dataframe objects, another packagepandaswas created by Wes McKinney in 2008. While it heavily relies on NumPy for numeric computations, its core interface objects are dataframes (2-dimensional multitype tables) and series (1-dimensional arrays). Dataframes, in comparison to NumPy matrices, don't require all data to be of the same type. On the contrary, they allow you to mix numeric values with Boolean, strings, DateTimes, and any other arbitrary Python objects. It does require (and enforce), however, the data type to be uniform verticallywithin the same columns. Compared to NumPy, it also allows dataframe columns and rows to have arbitrary numeric or string names—or even hierarchical, multilevel indices.

Also, pandas allows simple grouping and aggregation of data, merging tables à la SQL, time-based transformations, plotting, and many, many other tools. It also makes reading and writing to dozens of different formats—from CSV file to SQL database, to JSON and HDFS/Arrow binaries—a breeze. As a result, it is extremely powerful for data analysis and remains the de facto standard for most data analysis, period.

Let's showcase pandas with a simple example:

  1. Here, we'll read a CSV file with geocoded cities we created in Chapter 6First Script – Geocoding with Web API:
>>> import pandas as pd

>>> df = pd.read_csv('../Chapter06/geocoded.csv')
>>> len(df) # number of rows in the table
10
  1. Next, we'll filter data to only the cities in the Eastern Hemisphere (positive longitude):
>>> eastern = df[df.lon > 0 ] # those with non-negative longitude (easter hemisphere)
>>> len(eastern)
8
  1. Now, we calculate how many cities there are and what their median population is for each country:
>>> result = eastern.groupby('country').agg({'population':'mean', 'icon':'count'})
>>> result.rename(columns={'icon':'cities'}, inplace=True)
>>> result

population cities
country
China 22.6825 2
India 25.2725 2
Indonesia 32.2700 1
Japan 38.0500 1
Philippines 24.6500 1
South Korea 24.2100 1
  1. We'll finally store our results as a new CSV file:
result.to_csv('aggregation.csv')

That's it! Note that pandas also plays well with Jupyterall tables are nicely rendered as HTML tables!

Working with pandas requires the same type of thinking as with NumPyyou should try to avoid loops at all costs. Most of the time, there are predefined ways to do what you want, written by others. The resulting code may be somewhat less readable and expressive than pure Pythonbut will be way faster. 

One of the most popular spin-offs from pandas is the geopandas package, which offers a pandas-like interface for geospatial visualization and analysis. It represents collections of geospatial objects (points, lines, or polygons) as a special kind of dataframe. We'll work with geopandas in Chapter 12, Data Exploration and Visualization.

So far, we've reviewed two fundamental packagesNumPy and pandas. Both of them provide serious power in reading, processing, and operating on data—be it numeric arrays or tables of different data types. On top of those complex and fast data structures, yet another layer of packages allows the running of complex algorithmspackages such as SciPy, SimPy, and scikit-learn. You can think of them as a bunch of textboxes on core mathematical, physical, and general-purpose scientific equations and models, all brought to life as a set of Python packages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.194.172