Merging and joining data

pandas allows the merging of pandas objects with database-like join operations using the pd.merge() function and the .merge() method of a DataFrame object. These joins are high performance and are performed in memory. A merge combines the data of two pandas objects by finding matching values in one or more columns or row indexes. It then returns a new object that represents a combination of the data from both based on relational-database-like join semantics applied to those values.

Merges are useful as they allow us to model a single DataFrame for each type of data (one of the rules of having tidy data) but to be able to relate data in different DataFrame objects using values existing in both sets of data.

An overview of merges

A practical and probably canonical example would be that of looking up customer names from orders. To demonstrate this in pandas, we will use the following two DataFrame objects, where one represents a list of customer details, and the other represents the orders made by customers and what day the order was made. They will be related to each other using the CustomerID columns in each:

In [21]:
   # these are our customers
   customers = {'CustomerID': [10, 11],
                'Name': ['Mike', 'Marcia'],
                'Address': ['Address for Mike',
                            'Address for Marcia']}
   customers = pd.DataFrame(customers)
   customers

Out[21]:
              Address  CustomerID    Name
   0    Address for Mike          10    Mike
   1  Address for Marcia          11  Marcia

In [22]:
   # and these are the orders made by our customers
   # they are related to customers by CustomerID
   orders = {'CustomerID': [10, 11, 10],
             'OrderDate': [datetime.date(2014, 12, 1),
                           datetime.date(2014, 12, 1),
                           datetime.date(2014, 12, 1)]}
   orders = pd.DataFrame(orders)
   orders

Out[22]:
      CustomerID   OrderDate
   0          10  2014-12-01
   1          11  2014-12-01
   2          10  2014-12-01

Now suppose we would like to ship the orders to the customers. We would need to merge the orders data with the customers detail data to determine the address for each order. In pandas, this can be easily performed with the following statement:

In [23]:
   # merge customers and orders so we can ship the items
   customers.merge(orders)

Out[23]:
                 Address  CustomerID    Name   OrderDate
   0    Address for Mike          10    Mike  2014-12-01
   1    Address for Mike          10    Mike  2014-12-01
   2  Address for Marcia          11  Marcia  2014-12-01

pandas has done something magical for us here by being able to accomplish this with such a simple piece of code. What pandas has done is realized that our customers and orders objects both have a column named CustomerID. With this knowledge, it uses common values found in that column of both DataFrame objects to relate the data in both and form the merged data based on inner join semantics.

To be even more detailed, what pandas has specifically done is the following:

  1. Determines the columns in both customers and orders with common labels. These columns are treated as the keys to perform the join.
  2. It creates a new DataFrame whose columns are the labels from the keys identified in step 1, followed by all of the non-key labels from both objects.
  3. It matches values in the key columns of both DataFrame objects.
  4. It then creates a row in the result for each set of matching labels.
  5. It then copies the data from those matching rows from each source object into that respective row and columns of the result.
  6. It assigns a new Int64Index to the result.

The join in a merge can use values from multiple columns. To demonstrate, the following creates two DataFrame objects and performs the merge where pandas decides to use the values in the key1 and key2 columns of both objects:

In [24]:
   # data to be used in the remainder of this section's examples
   left_data = {'key1': ['a', 'b', 'c'], 
               'key2': ['x', 'y', 'z'],
               'lval1': [ 0, 1, 2]}
   right_data = {'key1': ['a', 'b', 'c'],
                 'key2': ['x', 'a', 'z'], 
                 'rval1': [ 6, 7, 8 ]}
   left = pd.DataFrame(left_data, index=[0, 1, 2])
   right = pd.DataFrame(right_data, index=[1, 2, 3])
   left

Out[24]:
     key1 key2  lval1
   0    a    x      0
   1    b    y      1
   2    c    z      2

In [25]:
   right

Out[25]:
     key1 key2  rval1
   1    a    x      6
   2    b    a      7
   3    c    z      8

In [26]:
 
  # demonstrate merge without specifying columns to merge
   # this will implicitly merge on all common columns
   left.merge(right)

Out[26]:
     key1 key2  lval1  rval1
   0    a    x      0      6
   1    c    z      2      8

This merge identifies key1 and key2 columns in common in both DataFrame objects and hence uses them for the merge. The matching tuples of values in both DataFrame objects for these columns are (a, x) and (c, z) and therefore this results in two rows of values.

To explicitly specify which column use to relate the objects, use the on parameter. The following performs a merge using only the values in the key1 column of both DataFrame objects:

In [27]:
   # demonstrate merge using an explicit column
   # on needs the value to be in both DataFrame objects
   left.merge(right, on='key1')

Out[27]:
     key1 key2_x  lval1 key2_y  rval1
   0    a      x      0      x      6
   1    b      y      1      a      7
   2    c      z      2      z      8

Comparing this result to the previous example, as only the values in the key1 column were used to relate the data in the two objects, the result now has three rows as there are matching a, b, and c values in that single column of both objects.

The on parameter can also be given a list of column names. The following reverts to using both the key1 and key2 columns, resulting in being identical the earlier example where those two columns where implicitly identified by pandas:

In [28]:
   # merge explicitly using two columns
   left.merge(right, on=['key1', 'key2'])

Out[28]:
     key1 key2  lval1  rval1
   0    a    x      0      6
   1    c    z      2      8

The columns specified with on need to exist in both DataFrame objects. If you would like to merge based on columns with different names in each object, you can use the left_on and right_on parameters, passing the name or names of columns to each respective parameter.

To perform a merge with the labels of the row indexes of the two DataFrame objects, use the left_index=True and right_index=True parameters (both need to be specified):

In [29]:
   # join on the row indices of both matrices
   pd.merge(left, right, left_index=True, right_index=True)

Out[29]:
     key1_x key2_x  lval1 key1_y key2_y  rval1
   1      b      y      1      a      x      6
   2      c      z      2      b      a      7

This has identified that the index labels in common are 1 and 2, so the resulting DataFrame has two rows with these values and labels in the index. pandas then creates a column in the result for every column in both objects and then copies the values.

As both DataFrame objects had a column with an identical name, key, the columns in the result have the _x and _y suffixes appended to them to identify the DataFrame they originated from. _x is for left and _y for right. You can specify these suffixes using the suffixes parameter and passing a two-item sequence.

Specifying the join semantics of a merge operation

The default type of join performed by pd.merge() is an inner join. To use another join method, the method of join to be used can be specified using the how parameter of the pd.merge() function (or the .merge() method). The valid options are:

  • inner: This is the intersection of keys from both DataFrame objects
  • outer: This is the union of keys from both DataFrame objects
  • left: This only uses keys from the left DataFrame
  • right: This only uses keys from the right DataFrame

As we have seen, an inner join is the default and will return a merge of the data from both DataFrame objects only where the values match.

An outer join contrasts, in that it will return both the merge of the matched rows and the unmatched values from both the left and right DataFrame objects, but with NaN filled in the unmatched portion. The following code demonstrates an outer join:

In [30]:
   # outer join, merges all matched data, 
   # and fills unmatched items with NaN
   left.merge(right, how='outer')

Out[30]:
     key1 key2  lval1  rval1
   0    a    x      0      6
   1    b    y      1    NaN
   2    c    z      2      8
   3    b    a    NaN      7

A left join will return the merge of the rows that satisfy the join of the values in the specified columns, and also returns the unmatched rows from only left:

In [31]:
   # left join, merges all matched data, and only fills unmatched 
   # items from the left dataframe with NaN filled for the 
   # unmatched items in the result 
   # rows with labels 0 and 2 
   # match on key1 and key2 the row with label 1 is from left

   left.merge(right, how='left')

Out[31]:
     key1 key2  lval1  rval1
   0    a    x      0      6
   1    b    y      1    NaN
   2    c    z      2      8

A right join will return the merge of the rows that satisfy the join of the values in the specified columns, and also returns the unmatched rows from only right:

In [32]:
   # right join, merges all matched data, and only fills unmatched
   # item from the right with NaN filled for the unmatched items
   # in the result 
   # rows with labels 0 and 1 match on key1 and key2
   # the row with label 2 is from right
   left.merge(right, how='right')

Out[32]:
     key1 key2  lval1  rval1
   0    a    x      0      6
   1    c    z      2      8
   2    b    a    NaN      7

The pandas library also provides a .join() method that can be used to perform a join using the index labels of the two DataFrame objects (instead of values in columns). Note that if the columns in the two DataFrame objects do not have unique column names, you must specify suffixes using the lsuffix and rsuffix parameters (automatic suffixing is not performed). The following code demonstrates both the join and specification of suffixes:

In [33]:
   # join left with right (default method is outer)
   # and since these DataFrame objects have duplicate column names
   # we just specify lsuffix and rsuffix
   left.join(right, lsuffix='_left', rsuffix='_right')

Out[33]:
     key1_left key2_left  lval1 key1_right key2_right  rval1
   0         a         x      0        NaN        NaN    NaN
   1         b         y      1          a          x      6
   2         c         z      2          b          a      7

The default type of join performed is an outer join. Note that this differs from the default of the .merge() method, which defaults to inner. To change to an inner join, specify how='inner', as is demonstrated in the following example:

In [34]:
   # join left with right with an inner join
   left.join(right, lsuffix='_left', rsuffix='_right', how='inner')

Out[34]:
     key1_left key2_left  lval1 key1_right key2_right  rval1
   1         b         y      1          a          x      6
   2         c         z      2          b          a      7

Notice that this is roughly equivalent to the earlier result from In[29] except with the result having columns with slightly different names.

It is also possible to perform right and left joins, but they lead to results similar to previous examples, so they will be omitted for brevity.

Pivoting

Data is often stored in a stacked format, which is also referred to as record format; this is common in databases, .csv files, and Excel spreadsheets. In a stacked format, the data is often not normalized and has repeated values in many columns, or values that should logically exists in other tables (violating another concept of tidy data).

Take the following data, which represents a stream of data from an accelerometer on a mobile device (provided with the data from the sample code):

In [35]:
   # read in accellerometer data
   sensor_readings = pd.read_csv("data/accel.csv")
   sensor_readings

Out[35]:
       interval axis  reading
   0          0    X      0.0
   1          0    Y      0.5
   2          0    Z      1.0
   3          1    X      0.1
   4          1    Y      0.4
   ..       ...  ...      ...
   7          2    Y      0.3
   8          2    Z      0.8
   9          3    X      0.3
   10         3    Y      0.2
   11         3    Z      0.7

   [12 rows x 3 columns]

An issue with this data as it is organized is: how does one go about determining the readings for a specific axis? This can be naively done with Boolean selections:

In [36]:
   # extract X-axis readings
   sensor_readings[sensor_readings['axis'] == 'X']

Out[36]:
      interval axis  reading
   0         0    X      0.0
   3         1    X      0.1
   6         2    X      0.2
   9         3    X      0.3

An issue here is what if you want to know the values for all axes at a given time, not just the x axis? You can perform a selection for each value of the axis, but that is repetitive code and does not handle the scenario of new axis values being inserted into DataFrame without a change to the code.

A better representation would be where columns represent the unique variable values. To convert to this form, use the DataFrame objects' .pivot() function:

In [37]:
   # pivot the data.  Interval becomes the index, the columns are
   # the current axes values, and use the readings as values
   sensor_readings.pivot(index='interval', 
                        columns='axis', 
                        values='reading')

Out[37]:
   axis        X    Y    Z
   interval               
   0         0.0  0.5  1.0
   1         0.1  0.4  0.9
   2         0.2  0.3  0.8
   3         0.3  0.2  0.7

This has taken all of the distinct values from the axis column, and pivoted them into columns on the new DataFrame, while filling in values for the new columns from the appropriate rows and columns of the original DataFrame. This new DataFrame demonstrates that it is now very easy to identify the X, Y and Z sensor readings at each time interval.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.237.24