Merging and joining

There are various functions that can be used to merge and join pandas' data structures, which include the following functions:

  • concat
  • append

The concat function

The concat function is used to join multiple pandas' data structures along a specified axis and possibly perform union or intersection operations along other axes. The following command explains the concat function:

concat(objs, axis=0, , join='outer', join_axes=None, ignore_index=False,
       keys=None, levels=None, names=None, verify_integrity=False)

The synopsis of the elements of concat function are as follows:

  • The objs function: A list or dictionary of Series, DataFrame, or Panel objects to be concatenated.
  • The axis function: The axis along which the concatenation should be performed. 0 is the default value.
  • The join function: The type of join to perform when handling indexes on other axes. The 'outer' function is the default.
  • The join_axes function: This is used to specify exact indexes for the remaining indexes instead of doing outer/inner join.
  • The keys function: This specifies a list of keys to be used to construct a MultiIndex.

For an explanation of the remaining options, please refer to the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html.

Here is an illustration of the workings of concat using our stock price examples from earlier chapters:

In [53]: stockDataDF=pd.read_csv('./tech_stockprices.csv').set_index(['Symbol']);stockDataDF
Out[53]:
     Closing price  EPS  Shares Outstanding(M) P/E Market Cap(B) Beta
Symbol
    AAPL   501.53    40.32  892.45         12.44   447.59    0.84
    AMZN   346.15    0.59   459.00         589.80  158.88    0.52
    FB     61.48     0.59   2450.00        104.93  150.92    NaN
    GOOG   1133.43   36.05  335.83         31.44   380.64    0.87
    TWTR   65.25    -0.30   555.20         NaN     36.23     NaN
    YHOO   34.90     1.27   1010.00        27.48   35.36     0.66

We now take various slices of the data:

In [83]: A=stockDataDF.ix[:4, ['Closing price', 'EPS']]; A
Out[83]:  Closing price  EPS
          Symbol
          AAPL     501.53      40.32
          AMZN     346.15     0.59
          FB      61.48     0.59
          GOOG    1133.43    36.05

In [84]: B=stockDataDF.ix[2:-2, ['P/E']];B
Out[84]:         P/E
        Symbol
        FB     104.93
        GOOG   31.44

In [85]: C=stockDataDF.ix[1:5, ['Market Cap(B)']];C
Out[85]:         Market Cap(B)
        Symbol
        AMZN   158.88
        FB     150.92
        GOOG   380.64
        TWTR   36.23

Here, we perform a concatenation by specifying an outer join, which concatenates and performs a union on all the three data frames, and includes entries that do not have values for all the columns by inserting NaN for such columns:

In [86]: pd.concat([A,B,C],axis=1) # outer join
Out[86]:  Closing price  EPS    P/E   Market Cap(B)
       AAPL   501.53     40.32  NaN   NaN
       AMZN   346.15     0.59   NaN   158.88
       FB     61.48      0.59   104.93 150.92
       GOOG   1133.43    36.05  31.44 380.64
       TWTR   NaN        NaN    NaN    36.23 

We can also specify an inner join that does the concatenation, but only includes rows that contain values for all the columns in the final data frame by throwing out rows with missing columns, that is, it takes the intersection:

In [87]: pd.concat([A,B,C],axis=1, join='inner') # Inner join
Out[87]:        Closing price  EPS  P/E   Market Cap(B)
         Symbol
         FB      61.48    0.59 104.93  150.92
         GOOG    1133.43   36.05   31.44   380.64

The third case enables us to use the specific index from the original DataFrame to join on:

In [102]: pd.concat([A,B,C], axis=1, join_axes=[stockDataDF.index])
Out[102]:       Closing price  EPS    P/E   Market Cap(B)
         Symbol
         AAPL   501.53     40.32  NaN   NaN
         AMZN   346.15     0.59   NaN   158.88
         FB     61.48      0.59  104.93 150.92
         GOOG   1133.43    36.05  31.44 380.64
         TWTR   NaN        NaN    NaN    36.23
         YHOO   NaN        NaN    NaN    NaN

In this last case, we see that the row for YHOO was included even though it wasn't contained in any of the slices that were concatenated. In this case, however, the values for all the columns are NaN. Here is another illustration of concat, but this time, it is on random statistical distributions. Note that in the absence of an axis argument, the default axis of concatenation is 0:

In[135]: np.random.seed(100)
        normDF=pd.DataFrame(np.random.randn(3,4));normDF
Out[135]:    0    1      2    3
         0  -1.749765  0.342680  1.153036  -0.252436
         1   0.981321  0.514219  0.221180  -1.070043
         2  -0.189496  0.255001 -0.458027   0.435163

In [136]: binomDF=pd.DataFrame(np.random.binomial(100,0.5,(3,4)));binomDF
Out[136]:    0  1  2  3
          0  57  50  57     50
          1  48  56  49     43
          2  40  47  49     55

In [137]: poissonDF=pd.DataFrame(np.random.poisson(100,(3,4)));poissonDF
Out[137]:  0  1  2  3
         0  93  96  96  89
         1  76  96  104  103
         2  96  93  107   84

In [138]: rand_distribs=[normDF,binomDF,poissonDF]
In [140]: rand_distribsDF=pd.concat(rand_distribs,keys=['Normal', 'Binomial', 'Poisson']);rand_distribsDF
Out[140]:         0        1       2          3
  Normal     0  -1.749765   0.342680  1.153036  -0.252436
       1   0.981321   0.514219  0.221180  -1.070043
       2  -0.189496   0.255001 -0.458027   0.435163
  Binomial 0   57.00       50.00     57.00      50.00
       1   48.00       56.00     49.00      43.00
       2   40.00       47.00     49.00      55.00
  Poisson  0   93.00       96.00     96.00      89.00
       1   76.00       96.00    104.00     103.00
       2   96.00       93.00    107.00      84.00

Using append

The append function is a simpler version of concat that concatenates along axis=0. Here is an illustration of its use where we slice out the first two rows and first three columns of the stockData DataFrame:

In [145]: stockDataA=stockDataDF.ix[:2,:3]
            stockDataA
Out[145]:  Closing price  EPS   Shares Outstanding(M)
  Symbol
  AAPL     501.53   40.32   892.45
  AMZN     346.15   0.59   459.00

And the remaining rows:

In [147]: stockDataB=stockDataDF[2:]
         stockDataB
Out[147]:
     Closing price EPS Shares Outstanding(M)  P/E  Market Cap(B) Beta
Symbol
FB   61.48         0.59  2450.00          104.93 150.92   NaN
GOOG   1133.43    36.05   335.83          31.44  380.64   0.87
TWTR     65.25    -0.30   555.20           NaN     36.23   NaN
YHOO     34.90  1.27  1010.00       27.48  35.36   0.66

Now, we use append to combine the two data frames from the preceding commands:

In [161]:stockDataA.append(stockDataB)
Out[161]:
         Beta Closing price EPS MarketCap(B) P/E    Shares Outstanding(M)
         Symbol
         AMZN  NaN    346.15    0.59  NaN   NaN    459.00
         GOOG  NaN    1133.43   36.05  NaN   NaN    335.83
         FB    NaN    61.48     0.59  150.92 104.93 2450.00
         YHOO  27.48  34.90     1.27  35.36   0.66   1010.00
         TWTR  NaN    65.25    -0.30  36.23   NaN    555.20
         AAPL  12.44  501.53    40.32  0.84   447.59 892.45

In order to maintain the order of columns similar to the original DataFrame, we can apply the reindex_axis function:

In [151]: stockDataA.append(stockDataB).reindex_axis(stockDataDF.columns, axis=1)
Out[151]:
          Closing price EPS Shares Outstanding(M)  P/E Market Cap(B) Beta
          Symbol
          AAPL   501.53  40.32  892.45         NaN  NaN      NaN
          AMZN   346.15   0.59  459.00         NaN  NaN      NaN
          FB     61.48     0.59  2450.00       104.93  150.92      NaN
          GOOG   1133.43  36.05  335.83        31.44  380.64     0.87
          TWTR   65.25  -0.30  555.20         NaN   36.23      NaN
          YHOO   34.90     1.27  1010.00       27.48  35.36     0.66

Note that for the first two rows, the value of the last two columns is NaN, since the first DataFrame only contained the first three columns. The append function does not work in places, but it returns a new DataFrame with the second DataFrame appended to the first.

Appending a single row to a DataFrame

We can append a single row to a DataFrame by passing a series or dictionary to the append method:

In [152]: 
algos={'search':['DFS','BFS','Binary Search','Linear'],
        'sorting': ['Quicksort','Mergesort','Heapsort','Bubble Sort'],
       'machine learning':['RandomForest','K Nearest Neighbor','Logistic Regression','K-Means Clustering']}
algoDF=pd.DataFrame(algos);algoDF
Out[152]: machine learning    search      sorting
        0    RandomForest        DFS      Quicksort
        1    K Nearest Neighbor   BFS      Mergesort
        2    Logistic Regression  Binary Search Heapsort
        3    K-Means Clustering   Linear       Bubble Sort

In [154]: 
moreAlgos={'search': 'ShortestPath'  , 'sorting': 'Insertion Sort',
            'machine learning': 'Linear Regression'}
    algoDF.append(moreAlgos,ignore_index=True)
Out[154]: machine learning    search      sorting
       0    RandomForest        DFS      Quicksort
       1    K Nearest Neighbor    BFS      Mergesort
       2    Logistic Regression Binary Search Heapsort
       3    K-Means Clustering  Linear       Bubble Sort
       4    Linear Regression   ShortestPath  Insertion Sort

In order for this to work, you must pass the ignore_index=True argument so that the index [0,1,2,3] in algoDF is ignored.

SQL-like merging/joining of DataFrame objects

The merge function is used to obtain joins of two DataFrame objects similar to those used in SQL database queries. The DataFrame objects are analogous to SQL tables. The following command explains this:

merge(left, right, how='inner', on=None, left_on=None,
      right_on=None, left_index=False, right_index=False, 
      sort=True, suffixes=('_x', '_y'), copy=True)

Following is the synopsis of merge function:

  • The left argument: This is the first DataFrame object
  • The right argument: This is the second DataFrame object
  • The how argument: This is the type of join and can be inner, outer, left, or right. The default is inner.
  • The on argument: This shows the names of columns to join on as join keys.
  • The left_on and right_on arguments : This shows the left and right DataFrame column names to join on.
  • The left_index and right_index arguments: This has a Boolean value. If this is True, use the left or right DataFrame index/row labels to join on.
  • The sort argument: This has a Boolean value. The default True setting results in a lexicographical sorting. Setting the default value to False may improve performance.
  • The suffixes argument: The tuple of string suffixes to be applied to overlapping columns. The defaults are '_x' and '_y'.
  • The copy argument: The default True value causes data to be copied from the passed DataFrame objects.

The source of the preceding information can be found at http://pandas.pydata.org/pandas-docs/stable/merging.html.

Let us start to examine the use of merge by reading the U.S. stock index data into a DataFrame:

In [254]: USIndexDataDF=pd.read_csv('./us_index_data.csv')
          USIndexDataDF
Out[254]:    TradingDate  Nasdaq   S&P 500  Russell 2000  DJIA
         0   2014/01/30   4123.13  1794.19       1139.36  15848.61
         1   2014/01/31   4103.88  1782.59   1130.88  15698.85
         2   2014/02/03   3996.96  1741.89   1094.58  15372.80
         3   2014/02/04   4031.52  1755.20   1102.84  15445.24
         4   2014/02/05   4011.55  1751.64   1093.59  15440.23
         5   2014/02/06   4057.12  1773.43   1103.93  15628.53

The source of this information can be found at http://finance.yahoo.com.

We can obtain slice1 of the data for rows 0 and 1 and the Nasdaq and S&P 500 columns by using the following command:

In [255]: slice1=USIndexDataDF.ix[:1,:3]
            slice1
Out[255]:   TradingDate  Nasdaq         S&P 500
         0       2014/01/30  4123.13   1794.19
         1       2014/01/31  4103.88   1782.59

We can obtain slice2 of the data for rows 0 and 1 and the Russell 2000 and DJIA columns by using the following command:

In [256]: slice2=USIndexDataDF.ix[:1,[0,3,4]]
       slice2
Out[256]:   TradingDate  Russell 2000    DJIA
         0       2014/01/30  1139.36     15848.61
         1       2014/01/31  1130.88     15698.85

We can obtain slice3 of the data for rows 1 and 2 and the Nasdaq and S&P 500 columns by using the following command:

In [248]: slice3=USIndexDataDF.ix[[1,2],:3]
          slice3
Out[248]:   TradingDate      Nasdaq    S&P 500
         1  2014/01/31       4103.88   1782.59
         2  2014/02/03       3996.96   1741.89

We can now merge slice1 and slice2 as follows:

In [257]: pd.merge(slice1,slice2)
Out[257]:   TradingDate  Nasdaq	S&P 500  Russell 2000  DJIA
         0  2014/01/30   4123.13  1794.19   1139.36     15848.61
         1  2014/01/31     4103.88  1782.59   1130.88     15698.85

As you can see, this results in a combination of the columns in slice1 and slice2. Since the on argument was not specified, the intersection of the columns in slice1 and slice2 was used which is TradingDate as the join column, and the rest of the columns from slice1 and slice2 were used to produce the output.

Note that in this case, passing a value for how has no effect on the result since the values of the TradingDate join key match for slice1 and slice2.

We now merge slice3 and slice2 specifying inner as the value of the how argument:

In [258]: pd.merge(slice3,slice2,how='inner')
Out[258]:   TradingDate  Nasdaq	     S&P 500  Russell 2000   DJIA
         0  2014/01/31   4103.88   1782.59    1130.88      15698.85

The slice3 argument has values 2014/01/31 and 2014/02/03 unique values for TradingDate, and slice2 has values 2014/01/30 and 2014/01/31 unique values for TradingDate.

The merge function uses the intersection of these values, which is 2014/01/31. This results in the single row result. Here, we specify outer as the value of the how argument:

In [269]: pd.merge(slice3,slice2,how='outer')
Out[269]:   TradingDate  Nasdaq     S&P 500  Russell 2000  DJIA
         0  2014/01/31  4103.88   1782.59   1130.88    15698.85
         1  2014/02/03  3996.96   1741.89   NaN          NaN
         2  2014/01/30      NaN   NaN   1139.36    15848.61

Specifying outer uses all the keys (union) from both DataFrames, which gives the three rows specified in the preceding output. Since not all the columns are present in the two DataFrames, the columns from the other DataFrame are NaN for each row in a DataFrame that is not part of the intersection.

Now, we specify how='left' as shown in the following command:

In [271]: pd.merge(slice3,slice2,how='left')
Out[271]:  TradingDate  Nasdaq   S&P 500  Russell 2000   DJIA
         0  2014/01/31   4103.88  1782.59  1130.88         15698.85
         1  2014/02/03  3996.96   1741.89  NaN         NaN

Here, we see that the keys from the left DataFrame slice3 are used for the output. For columns that are not available in slice3, that is Russell 2000 and DJIA, NaN are used for the row with TradingDate as 2014/02/03. This is equivalent to a SQL left outer join.

We specify how='right' in the following command:

In [270]: pd.merge(slice3,slice2,how='right')
Out[270]:   TradingDate  Nasdaq   S&P 500  Russell 2000  DJIA
         0  2014/01/31  4103.88   1782.59  1130.88  15698.85
         1  2014/01/30  NaN      NaN  1139.36  15848.61

This is the corollary to the how='left' keys from the right DataFrame slice2 that are used. Therefore, rows with TradingDate as 2014/01/31 and 2014/01/30 are in the result. For columns that are not in slice2Nasdaq and S&P 500NaN are used.

This is equivalent to a SQL right outer join. For a simple explanation of how SQL joins work, please refer to http://bit.ly/1yqR9vw.

The join function

The DataFrame.join function is used to combine two DataFrames that have different columns with nothing in common. Essentially, this does a longitudinal join of two DataFrames. Here is an example:

In [274]: slice_NASD_SP=USIndexDataDF.ix[:3,:3]
          slice_NASD_SP
Out[274]:   TradingDate  Nasdaq  S&P 500
         0  2014/01/30    4123.13  1794.19
         1  2014/01/31    4103.88  1782.59
         2  2014/02/03    3996.96  1741.89
         3  2014/02/04    4031.52  1755.20

In [275]: slice_Russ_DJIA=USIndexDataDF.ix[:3,3:]
          slice_Russ_DJIA
Out[275]:   Russell 2000   DJIA
      0    1139.36       15848.61
      1    1130.88       15698.85
      2    1094.58       15372.80
      3    1102.84       15445.24

Here, we call the join operator, as follows:

In [276]: slice_NASD_SP.join(slice_Russ_DJIA)
Out[276]: TradingDate  Nasdaq  S&P 500  Russell 2000   DJIA
        0  2014/01/30  4123.13  1794.19   1139.36    15848.61
        1  2014/01/31  4103.88  1782.59   1130.88    15698.85
        2  2014/02/03  3996.96  1741.89   1094.58    15372.80
        3  2014/02/04  4031.52  1755.20   1102.84    15445.24

In this case, we see that the result is a combination of the columns from the two Dataframes. Let us see what happens when we try to use join with two DataFrames that have a column in common:

In [272]: slice1.join(slice2)
------------------------------------------------------------
Exception                  Traceback (most recent call last)
...

Exception: columns overlap: Index([u'TradingDate'], dtype=object)

This results in an exception due to overlapping columns. You can find more information on using merge, concat, and join operations in the official documentation page at http://pandas.pydata.org/pandas-docs/stable/merging.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.228.138