Joining DataFrames

To demonstrate joining, we will use two CSV files: dest.csv and tips.csv. The use case behind it is that we are running a taxi company. Every time a passenger is dropped off at his or her destination, we add a row to the dest.csv file with the employee number of the driver and the destination:

EmpNr,Dest
5,The Hague
3,Amsterdam
9,Rotterdam

Sometimes drivers get a tip, so we want that registered in the tips.csv file (if this doesn't seem realistic, please feel free to come up with your own story):

EmpNr,Amount
5,10
9,5
7,2.5

Database-like joins in pandas can be done with either the merge() function or the join() DataFrame method. The join() method joins on indices by default, which might not be what you want. In SQL—a relational database query language—we have the inner join, left outer join, right outer join, and full outer join.

Note

An inner join selects rows from two tables, if and only if values match, for columns specified in the join condition. Outer joins do not require a match and can potentially return more rows. Please refer to the following Wikipedia page on joins: http://en.wikipedia.org/wiki/Join_%28SQL%29.

All these join types are supported by pandas, but we will only take a look at inner joins and full outer joins.

  • A join on the employee number with the merge() function is performed as follows:
    print "Merge() on key
    ", pd.merge(dests, tips, on='EmpNr')

    This gives an inner join as the outcome:

    Merge() on key
       EmpNr       Dest  Amount
    0      5  The Hague      10
    1      9  Rotterdam       5
    
    [2 rows x 3 columns]
    
  • Joining with the join() method requires providing suffixes for the left and right operands:
    print "Dests join() tips
    ", dests.join(tips, lsuffix='Dest', rsuffix='Tips')

    This method call joins index values so that the result is different from a SQL inner join:

    Dests join() tips
       EmpNrDest       Dest  EmpNrTips  Amount
    0          5  The Hague          5    10.0
    1          3  Amsterdam          9     5.0
    2          9  Rotterdam          7     2.5
    
    [3 rows x 4 columns]
    
  • An even more explicit way to execute an inner join with merge() is as follows:
    print "Inner join with merge()
    ", pd.merge(dests, tips, how='inner')

    The output is as follows:

    Inner join with merge()
       EmpNr       Dest  Amount
    0      5  The Hague      10
    1      9  Rotterdam       5
    
    [2 rows x 3 columns]
    

    To make this a full outer join requires only a small change:

    print "Outer join
    ", pd.merge(dests, tips, how='outer')

    The outer join adds rows with NaN values:

    Outer join
       EmpNr       Dest  Amount
    0      5  The Hague    10.0
    1      3  Amsterdam     NaN
    2      9  Rotterdam     5.0
    3      7        NaN     2.5
    
    [4 rows x 3 columns]
    

In a relational database query, these values would have been set to NULL. The demo code is in the join_demo.py file of this book's code bundle:

import pandas as pd
from numpy.random import seed
from numpy.random import rand
from numpy.random import random_integers
import numpy as np

seed(42)

df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot',
   'cold', 'hot', 'cold'],
   'Food' : ['soup', 'soup', 'icecream', 'chocolate',
   'icecream', 'icecream', 'soup'],
   'Price' : 10 * rand(7), 'Number' : random_integers(1, 9, size=(7,))})

print "df :3
", df[:3]
print "Concat Back together
", pd.concat([df[:3], df[3:]])

print "Appending rows
", df[:3].append(df[5:])

dests = pd.read_csv('dest.csv')
print "Dests
", dests

tips = pd.read_csv('tips.csv')
print "Tips
", tips

print "Merge() on key
", pd.merge(dests, tips, on='EmpNr')
print "Dests join() tips
", dests.join(tips, lsuffix='Dest', rsuffix='Tips')

print "Inner join with merge()
", pd.merge(dests, tips, how='inner')
print "Outer join
", pd.merge(dests, tips, how='outer')
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.37.196