Concatenating data

Concatenation in pandas is the process of either adding rows to the end of an existing Series or DataFrame object or adding additional columns to a DataFrame. In pandas, concatenation is performed via the pandas function pd.concat(). The function will perform the operation on a specific axis and as we will see, will also perform any required set logic involved in aligning along that axis.

The general syntax to concatenate data is to pass a list of objects to pd.concat(). The following performs a concatenation of two Series objects:

In [2]:
   # two Series objects to concatenate
   s1 = pd.Series(np.arange(0, 3))
   s2 = pd.Series(np.arange(5, 8))
   s1

Out[2]:
   0    0
   1    1
   2    2
   dtype: int64

In [3]:
   s2

Out[3]:
   0    5
   1    6
   2    7
   dtype: int64

In [4]:
   # concatenate them
   pd.concat([s1, s2])

Out[4]:
   0    0
   1    1
   2    2
   0    5
   1    6
   2    7
   dtype: int64

Two DataFrame objects can also be similarly concatenated.

In [5]:
   # create two DataFrame objects to concatenate
   # using the same index labels and column names, 
   # but different values
   df1 = pd.DataFrame(np.arange(9).reshape(3, 3), 
                      columns=['a', 'b', 'c'])
   #df2 has 9 .. 18
   df2 = pd.DataFrame(np.arange(9, 18).reshape(3, 3), 
                      columns=['a', 'b', 'c'])
   df1

Out[5]:
      a  b  c
   0  0  1  2
   1  3  4  5
   2  6  7  8

In [6]:
   df2

Out[6]:
       a   b   c
   0   9  10  11
   1  12  13  14
   2  15  16  17

In [7]:
   # do the concat
   pd.concat([df1, df2])

Out[7]:
       a   b   c
   0   0   1   2
   1   3   4   5
   2   6   7   8
   0   9  10  11
   1  12  13  14
   2  15  16  17

The process of concatenating the two DataFrame objects will first identify the set of columns formed by aligning the labels in the columns, effectively determining the union of the column names. The resulting DataFrame object will then consist of that set of columns, and columns with identical names will not be duplicated.

Rows will be then be added to the result, in the order of the each of the objects passed to pd.concat(). If a column in the result does not exist in the object being copied, NaN values will be filled in those locations. Duplicate row index labels can occur.

The following demonstrates the alignment of two DataFrame objects during concatenation that both have columns in common (a and c) and also have distinct columns (b in df1, and d in df2):

In [8]:
   # demonstrate concatenating two DataFrame objects with
   # different columns
   df1 = pd.DataFrame(np.arange(9).reshape(3, 3), 
                      columns=['a', 'b', 'c'])
   df2 = pd.DataFrame(np.arange(9, 18).reshape(3, 3), 
                      columns=['a', 'c', 'd'])
   df1

Out[8]:
      a  b  c
   0  0  1  2
   1  3  4  5
   2  6  7  8

In [9]:
   df2

Out[9]:
       a   c   d
   0   9  10  11
   1  12  13  14
   2  15  16  17

In [10]:
   # do the concat, NaN values will be filled in for
   # the d column for df1 and b column for df2
   pd.concat([df1, df2])

Out[10]:
       a   b   c   d
   0   0   1   2 NaN
   1   3   4   5 NaN
   2   6   7   8 NaN
   0   9 NaN  10  11
   1  12 NaN  13  14
   2  15 NaN  16  17

It is possible to give each group of data in the result its own name using the keys parameter. This creates a hierarchical index on the DataFrame object that lets you refer to each group of data independently via the DataFrame objects' .ix property. This is convenient if you later need to determine where data in the concatenated DataFrame object came from.

The following sample demonstrates this concept by assigning names to each original DataFrame object and then retrieving the rows that originated in the df2 object, which are keyed with the label 'df2'. The following code demonstrates this labeling and also retrieves just the rows that originated in df2:

In [11]:
   # concat the two objects, but create an index using the
   # given keys 
   c = pd.concat([df1, df2], keys=['df1', 'df2'])
   # note the labeling of the rows in the output
   c

Out[11]:
           a   b   c   d
   df1 0   0   1   2 NaN
       1   3   4   5 NaN
       2   6   7   8 NaN
   df2 0   9 NaN  10  11
       1  12 NaN  13  14
       2  15 NaN  16  17

In [12]:
   # we can extract the data originating from
   # the first or second source DataFrame
   c.ix['df2']

Out[12]:
       a   b   c   d
   0   9 NaN  10  11
   1  12 NaN  13  14
   2  15 NaN  16  17

The pd.concat() function also allows you to specify the axis on which to apply the concatenation. The following concatenates the two DataFrame objects along the columns axis:

In [13]:
   # concat df1 and df2 along columns
   # aligns on row labels, has duplicate columns
   pd.concat([df1, df2], axis=1)

Out[13]:
      a  b  c   a   c   d
   0  0  1  2   9  10  11
   1  3  4  5  12  13  14
   2  6  7  8  15  16  17

Note that the result now contains duplicate columns. The concatenation first aligns by the row index labels of each DataFrame object, and then fills in the columns from the first DataFrame object and then the second. The columns are not aligned and result in duplicate values.

The same rules of alignment and filling of NaN values apply in this case, except that they are applied to the rows' index labels. The following demonstrates a concatenation along the columns axis with two DataFrame objects that have row index labels in common (2 and 3) along with disjoint rows (0 in df1 and 4 in df3). Additionally, some of the columns in df3 overlap with df1 (a) as well as being disjoint (d):

In [14]:
   # a new DataFrame to merge with df1
   # this has two common row labels (2, 3) 
   # common columns (a) and one disjoint column
   # in each (b in df1 and d in df2)
   df3 = pd.DataFrame(np.arange(20, 26).reshape(3, 2), 
                      columns=['a', 'd'], 
                      index=[2, 3, 4])
   df3

Out[14]:
       a   d
   2  20  21
   3  22  23
   4  24  25

In [15]:
   # concat them. Alignment is along row labels
   # columns first from df1 and then df3, with duplicates.
   # NaN filled in where those columns do not exist in the source
   pd.concat([df1, df3], axis=1)

Out[15]:
       a   b   c   a   d
   0   0   1   2 NaN NaN
   1   3   4   5 NaN NaN
   2   6   7   8  20  21
   3 NaN NaN NaN  22  23
   4 NaN NaN NaN  24  25

A concatenation of two or more DataFrame objects actually performs an outer join operation along the index labels on the axis opposite to the one specified. This makes the result of the concatenation similar to having performed a union of those index labels, and then data is filled based on the alignment of those labels to the source objects.

The type of join can be changed to an inner join and can be performed by specifying join='inner' as the parameter. The inner join then logically performs an intersection instead of a union. The following demonstrates this and results in a single row because 2 is the only row index label in common:

In [16]:
   # do an inner join instead of outer
   # results in one row
   pd.concat([df1, df3], axis=1, join='inner')

Out[16]:
      a  b  c   a   d
   2  6  7  8  20  21

It is also possible to use label groups of data along the columns using the keys parameter when applying the concatenation along axis=1.

In [17]:
   # add keys to the columns
   df = pd.concat([df1, df2], 
                  axis=1,
                  keys=['df1', 'df2'])
   df

Out[17]:
     df1       df2        
       a  b  c   a   c   d
   0   0  1  2   9  10  11
   1   3  4  5  12  13  14
   2   6  7  8  15  16  17

The different groups can be accessed using the .ix process and slicing:

In [18]:
   # retrieve the data that originated from the 
   # DataFrame with key 'df2'
   df.ix[:, 'df2']

Out[18]:
      a   c   d
   0   9  10  11
   1  12  13  14
   2  15  16  17

A DataFrame (and Series) object also contains an .append() method, which will concatenate the two specified DataFrame objects along the row index labels.

In [19]:
   # append does a concatenate along axis=0 
   # duplicate row index labels can result
   df1.append(df2)

Out[19]:
       a   b   c   d
   0   0   1   2 NaN
   1   3   4   5 NaN
   2   6   7   8 NaN
   0   9 NaN  10  11
   1  12 NaN  13  14
   2  15 NaN  16  17

As with a concatenation on axis=1, the index labels in the rows are copied without consideration of the creation of duplicates, and the columns labels are joined in a manner which ensures no duplicate column name is included in the result. If you would like to ensure that the resulting index does not have duplicates but preserves all of the rows, you can use the ignore_index=True parameter. This essentially returns the same result except with new Int64Index:

In [20]:
   # remove duplicates in the result index by ignoring the 
   # index labels in the source DataFrame objects
   df1.append(df2, ignore_index=True)

Out[20]:
       a   b   c   d
   0   0   1   2 NaN
   1   3   4   5 NaN
   2   6   7   8 NaN
   3   9 NaN  10  11
   4  12 NaN  13  14
   5  15 NaN  16  17
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.172.200