Organizing data with pandas

The pandas library is built on the top of the numpy library. Therefore, in order to use pandas, you need to import numpy first. The pandas library introduces many additional data structures and functions. Let's start our pandas tour with the panda Series object. This is a one-dimensional array, like a numpy array; however, you can define an explicitly named index, and refer to those names to retrieve the data, not just to the positional index. Therefore, a pandas Series object already looks like a tuple in the relational model, or a row in a table. The following code imports both packages, numpy and pandas. Then it defines a simple pandas Series, without an explicit index. The Series looks like a simple single-dimensional array, and you can refer to elements through the positional index:

import numpy as np 
import pandas as pd 
ser1 = pd.Series([1, 2, 3, 4]) 
ser1[1:3]

Here are the results. I retrieved the second and the third element, position 1 and 2, with a zero-based positional index, as shown here:

    1    2
    2    3

Now I will create a Series with an explicitly named index, as shown in the following code:

ser1 = pd.Series([1, 2, 3, 4], 
                 index = ['a', 'b', 'c', 'd']) 
ser1['b':'c']

As you can see from the last example, you can refer to elements using the names of the index, which serve as column names in a SQL Server row. And here is the result:

    b    2
    c    3

You can create a Series from the dictionary object as well, as the following code shows:

dict1 = {'a': 1, 
         'b': 2, 
         'c': 3, 
         'd': 4} 
ser1 = pd.Series(dict1)

Imagine you have multiple Series with the same structure stacked vertically. This is the pandas DataFrame object. It looks and behaves like the R data frame. You use the pandas DataFrame object to store and analyze tabular data from relational sources, or to export the result to the tabular destinations, like SQL Server. The following code creates two dictionaries, then a pandas Series from each one, and then a pandas DataFrame from both Series:

age_dict = {'John': 35, 'Mick': 75, 'Diane': 42} 
age = pd.Series(age_dict) 
weight_dict = {'John': 88.3, 'Mick': 72.7, 'Diane': 57.1} 
weight = pd.Series(weight_dict) 
people = pd.DataFrame({'age': age, 'weight': weight}) 
people

The resulting data frame is:

           age  weight
    Diane   42    57.1
    John    35    88.3
    Mick    75    72.7

You can see that the Series were joined by the common index values. The result looks like a table and is therefore suitable for exchanging data and results with relational systems.

You can do a lot of manipulations with a data frame. For the beginning, I am extracting some meta-data, and creating a projection by extracting a single column, as shown in the following code:

people.index 
people.columns 
people['age']

Here are the results:

    Index(['Diane', 'John', 'Mick'], dtype='object')
    Index(['age', 'weight'], dtype='object')
    Diane    42
    John     35
    Mick     75

You can add a calculated column to a data frame. For example, the following code adds a column to the people data frame I am using here:

people['WeightDivAge'] = people['weight'] / people['age'] 
people

The result is:

           age  weight  WeightDivAge
    Diane   42    57.1      1.359524
    John    35    88.3      2.522857
    Mick    75    72.7      0.969333

You can transform columns to rows with the T function of the pandas DataFrame, as shown in the following code:

people.T

You can see the transformed data frame from the result:

                      Diane       John       Mick
    age           42.000000  35.000000  75.000000
    weight        57.100000  88.300000  72.700000
    WeightDivAge   1.359524   2.522857   0.969333

You can refer to the elements of a data frame by position or by column and index names. A data frame is still a matrix and not a relational table, so the zero-based positional index is also accessible. You can even use expressions in order to filter a data frame. Here are some examples:

people.iloc[0:2, 0:2] 
people.loc['Diane':'John', 'age':'weight'] 
people.loc[people.age > 40, ['age', 'WeightDivAge']]

The results of the previous operations are:

           age  weight
    Diane   42    57.1
    John    35    88.3
           age  weight
    Diane   42    57.1
    John    35    88.3
           age  WeightDivAge
    Diane   42      1.359524
    Mick    75      0.969333

From many other operations possible with data frames, let me just expose the joins. You can join two data frames with the pd.merge() method. The following code creates two data frames and joins them using the same index:

df1 = pd.DataFrame({'Person': ['Mary', 'Dejan', 'William', 'Milos'], 
                    'BirthYear': [1978, 1962, 1993, 1982]}) 
df2 = pd.DataFrame({'Person': ['Mary', 'Milos', 'Dejan', 'William'], 
                    'Group': ['Accounting', 'Development', 'Training', 'Training']}) 
pd.merge(df1, df2)

Here is the result:

       BirthYear   Person        Group
    0       1978     Mary   Accounting
    1       1962    Dejan     Training
    2       1993  William     Training
    3       1982    Milos  Development

Note you could always use an explicit key name, or even different key names, one from the left and one from the right data frame involved in the join, as long as the key values match. Here is an example of using an explicit key name:

pd.merge(df1, df2, on = 'Person')

The result is, of course, the same as the previous one. This was a one-to-one join. You can also do one-to-many joins, as the following example shows:

df3 = pd.DataFrame({'Group': ['Accounting', 'Development', 'Training'], 
                    'Supervisor': ['Carol', 'Pepe', 'Shasta']}) 
pd.merge(df2, df3)

The joined result is:

             Group   Person Supervisor
    0   Accounting     Mary      Carol
    1  Development    Milos       Pepe
    2     Training    Dejan     Shasta
    3     Training  William     Shasta

Finally, you can also perform many-to-many joins, as the last example in this section shows:

df4 = pd.DataFrame({'Group': ['Accounting', 'Accounting', 
                              'Development', 'Development', 
                              'Training'], 
                    'Skills': ['math', 'spreadheets',  
                               'coding', 'architecture', 
                               'presentation']}) 
pd.merge(df2, df4)

The result of this example is:

             Group   Person        Skills
    0   Accounting     Mary          math
    1   Accounting     Mary   spreadheets
    2  Development    Milos        coding
    3  Development    Milos  architecture
    4     Training    Dejan  presentation
    5     Training  William  presentation

I guess this chapter was quite terse so far, and also exhaustive. Unfortunately, all of this knowledge is needed to get to the data frames, which were our target from the beginning of the chapter. Therefore, it is finally time to do some more interesting things, namely analyze the data.

Table of Contents for Organizing data with pandas

Create new playlist

Sign In

Sign Up

Table of Contents for
Organizing data with pandas