Chapter 8. Combining and Reshaping Data

In Chapter 7, Tidying Up Your Data we examined how to clean up our data in order to get it ready for analysis. Everything that we did focused upon working within the data of a single DataFrame or Series object, and keeping the same structure of data within those objects. Once the data is tidied up, it will be likely that we will then need to use this data either to combine multiple sets of data, or to reorganize the structure of the data by moving data in and out of indexes.

This chapter has two general categories of topics: combination and reshaping of data. The first two sections will cover the capabilities provided by pandas to combine the data from multiple pandas objects together. Combination of data in pandas is performed by concatenating two sets of data, where data is combined simply along either axes but without regard to relationships in the data. Or data can be combined using relationships in the data by using a pandas capability referred to as merging, which provides join operations that are similar to those in many relational databases.

The remaining sections will examine the three primary means reshaping data in pandas. These will examine the processes of pivoting, stacking and unstacking, and melting of data. Pivoting allows us to restructure pandas data similarly to how spreadsheets pivot data by creating new index levels and moving data into columns based upon values (or vice-versa). Stacking and unstacking are similar to pivoting, but allow us to pivot data organized with multiple levels of indexes. And finally, melting allows us to restructure data into unique ID-variable-measurement combinations that are or required for many statistical analyses.

Specifically, in this chapter we will examine the following concepts of combining and reshaping pandas data:

  • Concatenation
  • Merging and joining
  • Pivots
  • Stacking/unstacking
  • Melting
  • The potential performance benefits of stacked data

Setting up the IPython notebook

To utilize the examples in this chapter we will need to include the following imports and settings.

In [1]:
   # import pandas, numpy and datetime
   import numpy as np
   import pandas as pd
   import datetime

   # Set some pandas options for controlling output
   pd.set_option('display.notebook_repr_html', False)
   pd.set_option('display.max_columns', 10)
   pd.set_option('display.max_rows', 10)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.2.157