pandas allows the merging of pandas objects with database-like join operations using the pd.merge()
function and the .merge()
method of a DataFrame
object. These joins are high performance and are performed in memory. A merge combines the data of two pandas objects by finding matching values in one or more columns or row indexes. It then returns a new object that represents a combination of the data from both based on relational-database-like join semantics applied to those values.
Merges are useful as they allow us to model a single DataFrame
for each type of data (one of the rules of having tidy data) but to be able to relate data in different DataFrame
objects using values existing in both sets of data.
A practical and probably canonical example would be that of looking up customer names from orders. To demonstrate this in pandas, we will use the following two DataFrame
objects, where one represents a list of customer details, and the other represents the orders made by customers and what day the order was made. They will be related to each other using the CustomerID
columns in each:
In [21]: # these are our customers customers = {'CustomerID': [10, 11], 'Name': ['Mike', 'Marcia'], 'Address': ['Address for Mike', 'Address for Marcia']} customers = pd.DataFrame(customers) customers Out[21]: Address CustomerID Name 0 Address for Mike 10 Mike 1 Address for Marcia 11 Marcia In [22]: # and these are the orders made by our customers # they are related to customers by CustomerID orders = {'CustomerID': [10, 11, 10], 'OrderDate': [datetime.date(2014, 12, 1), datetime.date(2014, 12, 1), datetime.date(2014, 12, 1)]} orders = pd.DataFrame(orders) orders Out[22]: CustomerID OrderDate 0 10 2014-12-01 1 11 2014-12-01 2 10 2014-12-01
Now suppose we would like to ship the orders to the customers. We would need to merge the orders
data with the customers
detail data to determine the address for each order. In pandas, this can be easily performed with the following statement:
In [23]: # merge customers and orders so we can ship the items customers.merge(orders) Out[23]: Address CustomerID Name OrderDate 0 Address for Mike 10 Mike 2014-12-01 1 Address for Mike 10 Mike 2014-12-01 2 Address for Marcia 11 Marcia 2014-12-01
pandas has done something magical for us here by being able to accomplish this with such a simple piece of code. What pandas has done is realized that our customers
and orders
objects both have a column named CustomerID
. With this knowledge, it uses common values found in that column of both DataFrame
objects to relate the data in both and form the merged data based on inner join semantics.
To be even more detailed, what pandas has specifically done is the following:
customers
and orders
with common labels. These columns are treated as the keys to perform the join.DataFrame
whose columns are the labels from the keys identified in step 1, followed by all of the non-key labels from both objects.DataFrame
objects.Int64Index
to the result.The join in a merge can use values from multiple columns. To demonstrate, the following creates two DataFrame
objects and performs the merge where pandas decides to use the values in the key1
and key2
columns of both objects:
In [24]: # data to be used in the remainder of this section's examples left_data = {'key1': ['a', 'b', 'c'], 'key2': ['x', 'y', 'z'], 'lval1': [ 0, 1, 2]} right_data = {'key1': ['a', 'b', 'c'], 'key2': ['x', 'a', 'z'], 'rval1': [ 6, 7, 8 ]} left = pd.DataFrame(left_data, index=[0, 1, 2]) right = pd.DataFrame(right_data, index=[1, 2, 3]) left Out[24]: key1 key2 lval1 0 a x 0 1 b y 1 2 c z 2 In [25]: right Out[25]: key1 key2 rval1 1 a x 6 2 b a 7 3 c z 8 In [26]: # demonstrate merge without specifying columns to merge # this will implicitly merge on all common columns left.merge(right) Out[26]: key1 key2 lval1 rval1 0 a x 0 6 1 c z 2 8
This merge identifies key1
and key2
columns in common in both DataFrame
objects and hence uses them for the merge. The matching tuples of values in both DataFrame
objects for these columns are (a
, x
) and (c
, z
) and therefore this results in two rows of values.
To explicitly specify which column use to relate the objects, use the on
parameter. The following performs a merge using only the values in the key1
column of both DataFrame
objects:
In [27]: # demonstrate merge using an explicit column # on needs the value to be in both DataFrame objects left.merge(right, on='key1') Out[27]: key1 key2_x lval1 key2_y rval1 0 a x 0 x 6 1 b y 1 a 7 2 c z 2 z 8
Comparing this result to the previous example, as only the values in the key1
column were used to relate the data in the two objects, the result now has three rows as there are matching a
, b,
and c
values in that single column of both objects.
The on
parameter can also be given a list of column names. The following reverts to using both the key1
and key2
columns, resulting in being identical the earlier example where those two columns where implicitly identified by pandas:
In [28]: # merge explicitly using two columns left.merge(right, on=['key1', 'key2']) Out[28]: key1 key2 lval1 rval1 0 a x 0 6 1 c z 2 8
The columns specified with on
need to exist in both DataFrame
objects. If you would like to merge based on columns with different names in each object, you can use the left_on
and right_on
parameters, passing the name or names of columns to each respective parameter.
To perform a merge with the labels of the row indexes of the two DataFrame
objects, use the left_index=True
and right_index=True
parameters (both need to be specified):
In [29]: # join on the row indices of both matrices pd.merge(left, right, left_index=True, right_index=True) Out[29]: key1_x key2_x lval1 key1_y key2_y rval1 1 b y 1 a x 6 2 c z 2 b a 7
This has identified that the index labels in common are 1
and 2
, so the resulting DataFrame
has two rows with these values and labels in the index. pandas then creates a column in the result for every column in both objects and then copies the values.
As both DataFrame
objects had a column with an identical name, key
, the columns in the result have the _x
and _y
suffixes appended to them to identify the DataFrame
they originated from. _x
is for left and _y
for right. You can specify these suffixes using the suffixes
parameter and passing a two-item sequence.
The default type of join performed by pd.merge()
is an inner
join. To use another join method, the method of join to be used can be specified using the how parameter of the pd.merge()
function (or the .merge()
method). The valid options are:
inner
: This is the intersection of keys from both DataFrame
objectsouter
: This is the union of keys from both DataFrame
objectsleft
: This only uses keys from the left DataFrame
right
: This only uses keys from the right DataFrame
As we have seen, an inner join is the default and will return a merge of the data from both DataFrame
objects only where the values match.
An outer join contrasts, in that it will return both the merge of the matched rows and the unmatched values from both the left and right DataFrame
objects, but with NaN
filled in the unmatched portion. The following code demonstrates an outer join:
In [30]: # outer join, merges all matched data, # and fills unmatched items with NaN left.merge(right, how='outer') Out[30]: key1 key2 lval1 rval1 0 a x 0 6 1 b y 1 NaN 2 c z 2 8 3 b a NaN 7
A left join will return the merge of the rows that satisfy the join of the values in the specified columns, and also returns the unmatched rows from only left
:
In [31]: # left join, merges all matched data, and only fills unmatched # items from the left dataframe with NaN filled for the # unmatched items in the result # rows with labels 0 and 2 # match on key1 and key2 the row with label 1 is from left left.merge(right, how='left') Out[31]: key1 key2 lval1 rval1 0 a x 0 6 1 b y 1 NaN 2 c z 2 8
A right join will return the merge of the rows that satisfy the join of the values in the specified columns, and also returns the unmatched rows from only right
:
In [32]: # right join, merges all matched data, and only fills unmatched # item from the right with NaN filled for the unmatched items # in the result # rows with labels 0 and 1 match on key1 and key2 # the row with label 2 is from right left.merge(right, how='right') Out[32]: key1 key2 lval1 rval1 0 a x 0 6 1 c z 2 8 2 b a NaN 7
The pandas library also provides a .join()
method that can be used to perform a join using the index labels of the two DataFrame
objects (instead of values in columns). Note that if the columns in the two DataFrame
objects do not have unique column names, you must specify suffixes using the lsuffix
and rsuffix
parameters (automatic suffixing is not performed). The following code demonstrates both the join and specification of suffixes:
In [33]: # join left with right (default method is outer) # and since these DataFrame objects have duplicate column names # we just specify lsuffix and rsuffix left.join(right, lsuffix='_left', rsuffix='_right') Out[33]: key1_left key2_left lval1 key1_right key2_right rval1 0 a x 0 NaN NaN NaN 1 b y 1 a x 6 2 c z 2 b a 7
The default type of join performed is an outer join. Note that this differs from the default of the .merge()
method, which defaults to inner. To change to an inner join, specify how='inner'
, as is demonstrated in the following example:
In [34]: # join left with right with an inner join left.join(right, lsuffix='_left', rsuffix='_right', how='inner') Out[34]: key1_left key2_left lval1 key1_right key2_right rval1 1 b y 1 a x 6 2 c z 2 b a 7
Notice that this is roughly equivalent to the earlier result from In[29]
except with the result having columns with slightly different names.
It is also possible to perform right and left joins, but they lead to results similar to previous examples, so they will be omitted for brevity.
Data is often stored in a stacked format, which is also referred to as record format; this is common in databases, .csv
files, and Excel spreadsheets. In a stacked format, the data is often not normalized and has repeated values in many columns, or values that should logically exists in other tables (violating another concept of tidy data).
Take the following data, which represents a stream of data from an accelerometer on a mobile device (provided with the data from the sample code):
In [35]: # read in accellerometer data sensor_readings = pd.read_csv("data/accel.csv") sensor_readings Out[35]: interval axis reading 0 0 X 0.0 1 0 Y 0.5 2 0 Z 1.0 3 1 X 0.1 4 1 Y 0.4 .. ... ... ... 7 2 Y 0.3 8 2 Z 0.8 9 3 X 0.3 10 3 Y 0.2 11 3 Z 0.7 [12 rows x 3 columns]
An issue with this data as it is organized is: how does one go about determining the readings for a specific axis? This can be naively done with Boolean selections:
In [36]: # extract X-axis readings sensor_readings[sensor_readings['axis'] == 'X'] Out[36]: interval axis reading 0 0 X 0.0 3 1 X 0.1 6 2 X 0.2 9 3 X 0.3
An issue here is what if you want to know the values for all axes at a given time, not just the x axis? You can perform a selection for each value of the axis, but that is repetitive code and does not handle the scenario of new axis values being inserted into DataFrame
without a change to the code.
A better representation would be where columns represent the unique variable values. To convert to this form, use the DataFrame
objects' .pivot()
function:
In [37]: # pivot the data. Interval becomes the index, the columns are # the current axes values, and use the readings as values sensor_readings.pivot(index='interval', columns='axis', values='reading') Out[37]: axis X Y Z interval 0 0.0 0.5 1.0 1 0.1 0.4 0.9 2 0.2 0.3 0.8 3 0.3 0.2 0.7
This has taken all of the distinct values from the axis column, and pivoted them into columns on the new DataFrame
, while filling in values for the new columns from the appropriate rows and columns of the original DataFrame
. This new DataFrame
demonstrates that it is now very easy to identify the X
, Y
and Z
sensor readings at each time interval.
18.118.12.232