In Chapter 7, Tidying Up Your Data we examined how to clean up our data in order to get it ready for analysis. Everything that we did focused upon working within the data of a single DataFrame
or Series
object, and keeping the same structure of data within those objects. Once the data is tidied up, it will be likely that we will then need to use this data either to combine multiple sets of data, or to reorganize the structure of the data by moving data in and out of indexes.
This chapter has two general categories of topics: combination and reshaping of data. The first two sections will cover the capabilities provided by pandas to combine the data from multiple pandas objects together. Combination of data in pandas is performed by concatenating two sets of data, where data is combined simply along either axes but without regard to relationships in the data. Or data can be combined using relationships in the data by using a pandas capability referred to as merging, which provides join operations that are similar to those in many relational databases.
The remaining sections will examine the three primary means reshaping data in pandas. These will examine the processes of pivoting, stacking and unstacking, and melting of data. Pivoting allows us to restructure pandas data similarly to how spreadsheets pivot data by creating new index levels and moving data into columns based upon values (or vice-versa). Stacking and unstacking are similar to pivoting, but allow us to pivot data organized with multiple levels of indexes. And finally, melting allows us to restructure data into unique ID-variable-measurement combinations that are or required for many statistical analyses.
Specifically, in this chapter we will examine the following concepts of combining and reshaping pandas data:
To utilize the examples in this chapter we will need to include the following imports and settings.
In [1]: # import pandas, numpy and datetime import numpy as np import pandas as pd import datetime # Set some pandas options for controlling output pd.set_option('display.notebook_repr_html', False) pd.set_option('display.max_columns', 10) pd.set_option('display.max_rows', 10)
3.141.2.157