Concatenation in pandas is the process of either adding rows to the end of an existing Series
or DataFrame
object or adding additional columns to a DataFrame
. In pandas, concatenation is performed via the pandas function pd.concat()
. The function will perform the operation on a specific axis and as we will see, will also perform any required set logic involved in aligning along that axis.
The general syntax to concatenate data is to pass a list of objects to pd.concat()
. The following performs a concatenation of two Series
objects:
In [2]: # two Series objects to concatenate s1 = pd.Series(np.arange(0, 3)) s2 = pd.Series(np.arange(5, 8)) s1 Out[2]: 0 0 1 1 2 2 dtype: int64 In [3]: s2 Out[3]: 0 5 1 6 2 7 dtype: int64 In [4]: # concatenate them pd.concat([s1, s2]) Out[4]: 0 0 1 1 2 2 0 5 1 6 2 7 dtype: int64
Two DataFrame
objects can also be similarly concatenated.
In [5]: # create two DataFrame objects to concatenate # using the same index labels and column names, # but different values df1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'c']) #df2 has 9 .. 18 df2 = pd.DataFrame(np.arange(9, 18).reshape(3, 3), columns=['a', 'b', 'c']) df1 Out[5]: a b c 0 0 1 2 1 3 4 5 2 6 7 8 In [6]: df2 Out[6]: a b c 0 9 10 11 1 12 13 14 2 15 16 17 In [7]: # do the concat pd.concat([df1, df2]) Out[7]: a b c 0 0 1 2 1 3 4 5 2 6 7 8 0 9 10 11 1 12 13 14 2 15 16 17
The process of concatenating the two DataFrame
objects will first identify the set of columns formed by aligning the labels in the columns, effectively determining the union of the column names. The resulting DataFrame
object will then consist of that set of columns, and columns with identical names will not be duplicated.
Rows will be then be added to the result, in the order of the each of the objects passed to pd.concat().
If a column in the result does not exist in the object being copied, NaN
values will be filled in those locations. Duplicate row index labels can occur.
The following demonstrates the alignment of two DataFrame
objects during concatenation that both have columns in common (a
and c
) and also have distinct columns (b
in df1
, and d
in df2
):
In [8]: # demonstrate concatenating two DataFrame objects with # different columns df1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'c']) df2 = pd.DataFrame(np.arange(9, 18).reshape(3, 3), columns=['a', 'c', 'd']) df1 Out[8]: a b c 0 0 1 2 1 3 4 5 2 6 7 8 In [9]: df2 Out[9]: a c d 0 9 10 11 1 12 13 14 2 15 16 17 In [10]: # do the concat, NaN values will be filled in for # the d column for df1 and b column for df2 pd.concat([df1, df2]) Out[10]: a b c d 0 0 1 2 NaN 1 3 4 5 NaN 2 6 7 8 NaN 0 9 NaN 10 11 1 12 NaN 13 14 2 15 NaN 16 17
It is possible to give each group of data in the result its own name using the keys
parameter. This creates a hierarchical index on the DataFrame
object that lets you refer to each group of data independently via the DataFrame
objects' .ix
property. This is convenient if you later need to determine where data in the concatenated DataFrame
object came from.
The following sample demonstrates this concept by assigning names to each original DataFrame
object and then retrieving the rows that originated in the df2
object, which are keyed with the label 'df2'
. The following code demonstrates this labeling and also retrieves just the rows that originated in df2
:
In [11]: # concat the two objects, but create an index using the # given keys c = pd.concat([df1, df2], keys=['df1', 'df2']) # note the labeling of the rows in the output c Out[11]: a b c d df1 0 0 1 2 NaN 1 3 4 5 NaN 2 6 7 8 NaN df2 0 9 NaN 10 11 1 12 NaN 13 14 2 15 NaN 16 17 In [12]: # we can extract the data originating from # the first or second source DataFrame c.ix['df2'] Out[12]: a b c d 0 9 NaN 10 11 1 12 NaN 13 14 2 15 NaN 16 17
The pd.concat()
function also allows you to specify the axis on which to apply the concatenation. The following concatenates the two DataFrame
objects along the columns axis:
In [13]: # concat df1 and df2 along columns # aligns on row labels, has duplicate columns pd.concat([df1, df2], axis=1) Out[13]: a b c a c d 0 0 1 2 9 10 11 1 3 4 5 12 13 14 2 6 7 8 15 16 17
Note that the result now contains duplicate columns. The concatenation first aligns by the row index labels of each DataFrame
object, and then fills in the columns from the first DataFrame
object and then the second. The columns are not aligned and result in duplicate values.
The same rules of alignment and filling of NaN
values apply in this case, except that they are applied to the rows' index labels. The following demonstrates a concatenation along the columns axis with two DataFrame
objects that have row index labels in common (2
and 3
) along with disjoint rows (0
in df1
and 4
in df3
). Additionally, some of the columns in df3
overlap with df1 (a
) as well as being disjoint (d
):
In [14]: # a new DataFrame to merge with df1 # this has two common row labels (2, 3) # common columns (a) and one disjoint column # in each (b in df1 and d in df2) df3 = pd.DataFrame(np.arange(20, 26).reshape(3, 2), columns=['a', 'd'], index=[2, 3, 4]) df3 Out[14]: a d 2 20 21 3 22 23 4 24 25 In [15]: # concat them. Alignment is along row labels # columns first from df1 and then df3, with duplicates. # NaN filled in where those columns do not exist in the source pd.concat([df1, df3], axis=1) Out[15]: a b c a d 0 0 1 2 NaN NaN 1 3 4 5 NaN NaN 2 6 7 8 20 21 3 NaN NaN NaN 22 23 4 NaN NaN NaN 24 25
A concatenation of two or more DataFrame
objects actually performs an outer join operation along the index labels on the axis opposite to the one specified. This makes the result of the concatenation similar to having performed a union of those index labels, and then data is filled based on the alignment of those labels to the source objects.
The type of join can be changed to an inner join and can be performed by specifying join='inner'
as the parameter. The inner join then logically performs an intersection instead of a union. The following demonstrates this and results in a single row because 2
is the only row index label in common:
In [16]: # do an inner join instead of outer # results in one row pd.concat([df1, df3], axis=1, join='inner') Out[16]: a b c a d 2 6 7 8 20 21
It is also possible to use label groups of data along the columns using the keys
parameter when applying the concatenation along axis=1
.
In [17]: # add keys to the columns df = pd.concat([df1, df2], axis=1, keys=['df1', 'df2']) df Out[17]: df1 df2 a b c a c d 0 0 1 2 9 10 11 1 3 4 5 12 13 14 2 6 7 8 15 16 17
The different groups can be accessed using the .ix
process and slicing:
In [18]: # retrieve the data that originated from the # DataFrame with key 'df2' df.ix[:, 'df2'] Out[18]: a c d 0 9 10 11 1 12 13 14 2 15 16 17
A DataFrame
(and Series
) object also contains an .append()
method, which will concatenate the two specified DataFrame
objects along the row index labels.
In [19]: # append does a concatenate along axis=0 # duplicate row index labels can result df1.append(df2) Out[19]: a b c d 0 0 1 2 NaN 1 3 4 5 NaN 2 6 7 8 NaN 0 9 NaN 10 11 1 12 NaN 13 14 2 15 NaN 16 17
As with a concatenation on axis=1
, the index labels in the rows are copied without consideration of the creation of duplicates, and the columns labels are joined in a manner which ensures no duplicate column name is included in the result. If you would like to ensure that the resulting index does not have duplicates but preserves all of the rows, you can use the ignore_index=True
parameter. This essentially returns the same result except with new Int64Index
:
In [20]: # remove duplicates in the result index by ignoring the # index labels in the source DataFrame objects df1.append(df2, ignore_index=True) Out[20]: a b c d 0 0 1 2 NaN 1 3 4 5 NaN 2 6 7 8 NaN 3 9 NaN 10 11 4 12 NaN 13 14 5 15 NaN 16 17
18.226.172.200