The structure and content of a DataFrame
can be mutated in several ways. Rows and columns can be added and removed, and data within either can be modified to take on new values. Additionally, columns, as well as index labels, can also be renamed. Each of these will be described in the following sections.
A column can be renamed using the .rename()
method of the DataFrame
. The Book Value
column is inconvenient since it has a space, so we will rename it to BookValue
:
In [48]: # rename the Book Value column to not have a space # this returns a copy with the column renamed df = sp500.rename(columns= {'Book Value': 'BookValue'}) # print first 2 rows df[:2] Out[48]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573
This has returned a new DataFrame
object with the renamed column and data copied from the original DataFrame
. We can verify that the original DataFrame
did not have its column names modified:
In [49]: # verify the columns in the original did not change sp500.columns Out[49]: Index([u'Sector', u'Price', u'Book Value'], dtype='object')
To modify the DataFrame
without making a copy, we can use the inplace=True
parameter to .rename()
:
In [50]: # this changes the column in-place sp500.rename(columns= {'Book Value': 'BookValue'}, inplace=True) # we can see the column is changed sp500.columns Out[50]: Index([u'Sector', u'Price', u'BookValue'], dtype='object')
A convenient side effect of this, is that the DataFrame
now has a .BookValue
attribute as before renaming the column, the space prevented attribute-based access of the column:
In [51]: # and now we can use .BookValue sp500.BookValue[:5] Out[51]: Symbol MMM 26.668 ABT 15.573 ABBV 2.954 ACN 8.326 ACE 86.897 Name: BookValue, dtype: float64
Columns can be added to a DataFrame
using several methods. The simplest way is by merging a new Series
into the DataFrame
object, along the index using the []
operator assigning the Series
to a new column, with a name not already in the .columns
index. Note that this will modify the DataFrame
in-place and not result in a copy.
Alignment of data is important to understanding this process, as pandas does not simply concatenate the Series
to the DataFrame
. pandas will first align the data in the DataFrame
with the Series
using the index from both objects, and fill in the data from the Series
into the new DataFrame
at the appropriate index labels.
To demonstrate this, we will add a purely demonstrative column called TwicePrice
which adds a new column with a calculated value of 2.0 *
the Price
column. Since this modifies the DataFrame
object in-place, we will also make a copy and then add the column to the copy, so as to leave the original unmodified:
In [52]: # make a copy copy = sp500.copy() # add a new column to the copy copy['TwicePrice'] = sp500.Price * 2 copy[:2] Out[52]: Sector Price BookValue TwicePrice Symbol MMM Industrials 141.14 26.668 282.28 ABT Health Care 39.60 15.573 79.20
This process is actually selecting the Price
column out of the sp500
object, then creating another Series
with each value of the Price
multiplied by two. The DataFrame
then aligns this new Series
by label, copies the data at the appropriate labels, and adds the column at the end of the columns index.
If you want to add the column at a different location in the DataFrame
object, instead of at the rightmost position, use the .insert()
method of the DataFrame
. The following code inserts the TwicePrice
column between Price
and BookValue
:
In [53]: copy = sp500.copy() # insert sp500.Price * 2 as the # second column in the DataFrame copy.insert(1, 'TwicePrice', sp500.Price * 2) copy[:2] Out[53]: Sector TwicePrice Price BookValue Symbol MMM Industrials 282.28 141.14 26.668 ABT Health Care 79.20 39.60 15.573
It is important to remember that this is not simply inserting a column into the DataFrame
. The alignment process used here is performing a left join of the DataFrame
and the Series
by their index labels, and then creating the column and populating the data in the appropriate cells in the DataFrame
from matching entries in the Series
. If an index label in the DataFrame
is not matched in the Series
, the value used will be NaN
. Items in the Series
that do not have a matching label will be ignored.
The following example demonstrates this operation:
In [54]: # extract the first four rows and just the Price column rcopy = sp500[0:3][['Price']].copy() rcopy Out[54]: Price Symbol MMM 141.14 ABT 39.60 ABBV 53.95 In [55]: # create a new Series to merge as a column # one label exists in rcopy (MSFT), and MMM does not s = pd.Series( {'MMM': 'Is in the DataFrame', 'MSFT': 'Not in the DataFrame'} ) s Out[55]: MMM Is in the DataFrame MSFT Not in the DataFrame dtype: object In [56]: # add rcopy into a column named 'Comment' rcopy['Comment'] = s rcopy Out[56]: Price Comment Symbol MMM 141.14 Is in the DataFrame ABT 39.60 NaN ABBV 53.95 NaN
The labels for ABT
and ABBV
were not found in rcopy
and therefore, the values in the result are NaN
. MMM
is the only value in both, so the value from rcopy
is put in the result.
In general, assignment of a Series
to a column using the []
operator will either create a new column if the column does not already exist, or replace the contents of a column if it already exists. To demonstrate replacement, the following code replaces the Price
column with the result of the multiplication, instead of creating a new column:
In [57]: copy = sp500.copy() # replace the Price column data with the new values # instead of adding a new column copy.Price = sp500.Price * 2 copy[:5] Out[57]: Sector Price BookValue Symbol MMM Industrials 282.28 26.668 ABT Health Care 79.20 15.573 ABBV Health Care 107.90 2.954 ACN Information Technology 159.58 8.326 ACE Financials 205.82 86.897
To emphasize that this is also doing an alignment, we can change the sample slightly. The following code only utilizes the prices from three of the first four rows. This will force the result to not align values for 497 of the symbols, resulting in NaN
values:
In [58]: # copy all 500 rows copy = sp500.copy() # this just copies the first 2 rows of prices prices = sp500.iloc[[3, 1, 0]].Price.copy() # examine the extracted prices prices Out[58]: Symbol ACN 79.79 ABT 39.60 MMM 141.14 Name: Price, dtype: float64 In [59]: # now replace the Prices column with prices copy.Price = prices # it's not really simple insertion, it is alignment # values are put in the correct place according to labels copy Out[59]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care NaN 2.954 ACN Information Technology 79.79 8.326 ACE Financials NaN 86.897 ... ... ... ... YHOO Information Technology NaN 12.768 YUM Consumer Discretionary NaN 5.147 ZMH Health Care NaN 37.181 ZION Financials NaN 30.191 ZTS Health Care NaN 2.150 [500 rows x 3 columns]
Columns can be deleted from a DataFrame
by using the del
keyword, the pop(column)
method of the DataFrame,
or by calling the drop()
method of the DataFrame
.
The behavior of each of these differs slightly:
del
will simply delete the Series
from the DataFrame
(in-place)pop()
will both delete the Series
and return the Series
as a result (also in-place)drop(labels, axis=1)
will return a new DataFrame
with the column(s) removed (the original DataFrame
object is not modified)The following code demonstrates using del
to delete the BookValue
column from a copy of the sp500
data:
In [60]: # Example of using del to delete a column # make a copy of a subset of the data frame copy = sp500[:2].copy() copy Out[60]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 In [61]: # delete the BookValue column # deletion is in-place del copy['BookValue'] copy Out[61]: Sector Price Symbol MMM Industrials 141.14 ABT Health Care 39.60
The following code demonstrates using the .pop()
method to remove a column:
In [62]: # Example of using pop to remove a column from a DataFrame # first make a copy of a subset of the data frame # pop works in-place copy = sp500[:2].copy() # this will remove Sector and return it as a series popped = copy.pop('Sector') # Sector column removed in-place copy Out[62]: Price BookValue Symbol MMM 141.14 26.668 ABT 39.60 15.573 In [63]: # and we have the Sector column as the result of the pop popped Out[63]: Symbol MMM Industrials ABT Health Care Name: Sector, dtype: object
The .drop()
method can be used to remove both rows and columns. To use it to remove a column, specify axis=1
:
In [64]: # Example of using drop to remove a column # make a copy of a subset of the DataFrame copy = sp500[:2].copy() # this will return a new DataFrame with 'Sector' removed # the copy DataFrame is not modified afterdrop = copy.drop(['Sector'], axis = 1) afterdrop Out[64]: Price BookValue Symbol MMM 141.14 26.668 ABT 39.60 15.573
Rows can be added to a DataFrame
object via several different operations:
DataFrame
to anotherDataFrame
objectsAppending is performed using the .append()
method of the DataFrame
. The process of appending returns a new DataFrame
with the data from the original DataFrame
added first, and the rows from the second. Appending does not perform alignment and can result in duplicate index values.
The following code demonstrates appending two DataFrame
objects extracted from the sp500
data. The first DataFrame
consists of rows 0
, 1
and 2
, and the second consists of rows 10
, 11
and 2
. Row 2
(with label ABBV
) is included in both to demonstrate creation of duplicate index labels. The code is as follows:
In [65]: # copy the first three rows of sp500 df1 = sp500.iloc[0:3].copy() # copy 10th and 11th rows df2 = sp500.iloc[[10, 11, 2]] # append df1 and df2 appended = df1.append(df2) # the result is the rows of the first followed by # those of the second appended Out[65]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 A Health Care 56.18 16.928 GAS Utilities 52.98 32.462 ABBV Health Care 53.95 2.954
The set of columns of the DataFrame
objects being appended do not need to be the same. The resulting DataFrame
will consist of the union of the columns in both and where either did not have a column, NaN
will be used as the value. The following code demonstrates this by creating a third DataFrame
using the same index as df1
, but having a single column with a unique column name:
In [66]: # DataFrame using df1.index and just a PER column # also a good example of using a scalar value # to initialize multiple rows df3 = pd.DataFrame(0.0, index=df1.index, columns=['PER']) df3 Out[66]: PER Symbol MMM 0 ABT 0 ABBV 0 In [67]: # append df1 and df3 # each has three rows, so 6 rows is the result # df1 had no PER column, so NaN for those rows # df3 had no BookValue, Price or Sector, so NaN values df1.append(df3) Out[67]: BookValue PER Price Sector Symbol MMM 26.668 NaN 141.14 Industrials ABT 15.573 NaN 39.60 Health Care ABBV 2.954 NaN 53.95 Health Care MMM NaN 0 NaN NaN ABT NaN 0 NaN NaN ABBV NaN 0 NaN NaN
To append without forcing the index to be taken from either DataFrame
, you can use the ignore_index=True
parameter. This is useful when the index values are not of significant meaning, and you just want concatenated data with sequentially increasing integers as indexes:
In [68]: # ignore index labels, create default index df1.append(df3, ignore_index=True) Out[68]: BookValue PER Price Sector 0 26.668 NaN 141.14 Industrials 1 15.573 NaN 39.60 Health Care 2 2.954 NaN 53.95 Health Care 3 NaN 0 NaN NaN 4 NaN 0 NaN NaN 5 NaN 0 NaN NaN
A DataFrame can be concatenated to another using the pd.concat()
function. This function functions similarly to the .append()
method, but also adds the ability to specify an axis (appending can be row or column based), as well as being able to perform several join operations between the objects. Also, the function takes a list of pandas objects to concatenate, so you can concatenate more than two objects in a single call.
The default operation of pd.concat()
on two DataFrame
objects operates in the same way as the .append()
method. This can be demonstrated by reconstructing the two datasets from the earlier example and concatenating them. This is shown in the following example:
In [69]: # copy the first three rows of sp500 df1 = sp500.iloc[0:3].copy() # copy 10th and 11th rows df2 = sp500.iloc[[10, 11, 2]] # pass them as a list pd.concat([df1, df2]) Out[69]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 A Health Care 56.18 16.928 GAS Utilities 52.98 32.462 ABBV Health Care 53.95 2.954
Actually, pandas calculates the sorted union of distinct column names across all supplied objects and uses those as the columns, and then appends data along the rows for each object in the order given in the list.
A slight variant of this example adds an additional column to one of the DataFrame
objects and then performs the concatenation:
In [70]: # copy df2 df2_2 = df2.copy() # add a column to df2_2 that is not in df1 df2_2.insert(3, 'Foo', pd.Series(0, index=df2.index)) # see what it looks like df2_2 Out[70]: Sector Price BookValue Foo Symbol A Health Care 56.18 16.928 0 GAS Utilities 52.98 32.462 0 ABBV Health Care 53.95 2.954 0 In [71]: # now concatenate pd.concat([df1, df2_2]) Out[71]: BookValue Foo Price Sector Symbol MMM 26.668 NaN 141.14 Industrials ABT 15.573 NaN 39.60 Health Care ABBV 2.954 NaN 53.95 Health Care A 16.928 0 56.18 Health Care GAS 32.462 0 52.98 Utilities ABBV 2.954 0 53.95 Health Care
Duplicate index labels still result, as the rows are copied verbatim from the source objects. However, note the NaN
values in the rows originating from df1
, since it does not have a Foo
column.
Using the keys
parameter, it is possible to differentiate the pandas objects from which the rows originated. The following code adds a level to the index which represents the source object:
In [72]: # specify keys r = pd.concat([df1, df2_2], keys=['df1', 'df2']) r Out[72]: BookValue Foo Price Sector Symbol df1 MMM 26.668 NaN 141.14 Industrials ABT 15.573 NaN 39.60 Health Care ABBV 2.954 NaN 53.95 Health Care df2 A 16.928 0 56.18 Health Care GAS 32.462 0 52.98 Utilities ABBV 2.954 0 53.95 Health Care
We can change the axis of the concatenation to work along the columns by specifying axis=1
, which will calculate the sorted union of the distinct index labels from the rows and then append columns and their data from the specified objects.
To demonstrate, the following splits the sp500 data into two DataFrame
objects, each with a different set of columns, and then concatenates along axis=1
:
In [73]: # first three rows, columns 0 and 1 df3 = sp500[:3][[0, 1]] df3 Out[73]: Sector Price Symbol MMM Industrials 141.14 ABT Health Care 39.60 ABBV Health Care 53.95 In [74]: # first three rows, column 2 df4 = sp500[:3][[2]] df4 Out[74]: BookValue Symbol MMM 26.668 ABT 15.573 ABBV 2.954 In [75]: # put them back together pd.concat([df3, df4], axis=1) Out[75]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954
We can further examine this operation by adding a column to the second DataFrame
that has a duplicate name to a column in the first. The result will have duplicate columns, as the columns are blindly appended without regard to already existing columns:
In [76]: # make a copy of df4 df4_2 = df4.copy() # add a column to df4_2, that is also in df3 df4_2.insert(1, 'Sector', pd.Series(1, index=df4_2.index)) df4_2 Out[76]: BookValue Sector Symbol MMM 26.668 1 ABT 15.573 1 ABBV 2.954 1 In [77]: # demonstrate duplicate columns pd.concat([df3, df4_2], axis=1) Out[77]: Sector Price BookValue Sector Symbol MMM Industrials 141.14 26.668 1 ABT Health Care 39.60 15.573 1 ABBV Health Care 53.95 2.954 1
To be very specific, pandas is performing an outer join along the labels of the specified axis. An inner join can be specified using the join='inner'
parameter, which changes the operation from being a sorted union of distinct labels to the distinct values of the intersection of the labels. To demonstrate, the following selects two subsets of the financial data with one row in common and performs an inner join:
In [78]: # first three rows and first two columns df5 = sp500[:3][[0, 1]] df5 Out[78]: Sector Price Symbol MMM Industrials 141.14 ABT Health Care 39.60 ABBV Health Care 53.95 In [79]: # row 2 through 4 and first two columns df6 = sp500[2:5][[0,1]] df6 Out[79]: Sector Price Symbol ABBV Health Care 53.95 ACN Information Technology 79.79 ACE Financials 102.91 In [80]: # inner join on index labels will return in only one row pd.concat([df5, df6], join='inner', axis=1) Out[80]: Sector Price Sector Price Symbol ABBV Health Care 53.95 Health Care 53.95
Rows can also be added to a DataFrame
through the .loc
property. This technique is referred to as setting with enlargement. The parameter for .loc
specifies the index label where the row is to be placed. If the label does not exist, the values are appended to the DataFrame
using the given index label. If it does exist, then the values in the specified row are replaced.
The following example takes a subset of sp500
and adds a row with the label FOO
:
In [81]: # get a small subset of the sp500 # make sure to copy the slice to make a copy ss = sp500[:3].copy() # create a new row with index label FOO # and assign some values to the columns via a list ss.loc['FOO'] = ['the sector', 100, 110] ss Out[81]: Sector Price BookValue MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 FOO the sector 100.00 110.000
Note that the change is made in place. If FOO
already exists as an index label, then the column data would be replaced. This is one of the means of updating data in a DataFrame
in-place, as .loc
not only retrieves row(s), but also lets you modify the results that are returned.
It is also possible to add columns in this manner. The following code demonstrates by adding a new column to a subset of sp500
using .loc
. Note that to accomplish this, we use the colon in the rows' position to select all rows to be included to add the new column and value:
In [82]: # copy of subset / slice ss = sp500[:3].copy() # add the new column initialized to 0 ss.loc[:,'PER'] = 0 # take a look at the results ss Out[82]: Sector Price BookValue PER Symbol MMM Industrials 141.14 26.668 0 ABT Health Care 39.60 15.573 0 ABBV Health Care 53.95 2.954 0
Removing rows from a DataFrame
object is normally performed using one of three techniques:
.drop()
methodTechnically, only the .drop()
method removes rows in-place on the source object. The other techniques either create a copy without specific rows, or a view into the rows that are not to be dropped. Details of each are given in the following sections.
To remove rows from a DataFrame
by the index label, you can use the .drop()
method of the DataFrame
. The .drop()
method takes a list of index labels and will return a copy of the DataFrame
with the rows for the specified labels removed. The source DataFrame
remains unmodified. The code is as follows:
In [83]: # get a copy of the first 5 rows of sp500 ss = sp500[:5].copy() ss Out[83]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 ACN Information Technology 79.79 8.326 ACE Financials 102.91 86.897 In [84]: # drop rows with labels ABT and ACN afterdrop = ss.drop(['ABT', 'ACN']) afterdrop Out[84]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABBV Health Care 53.95 2.954 ACE Financials 102.91 86.897 In [85]: # note that ss is not modified ss Out[85]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 ACN Information Technology 79.79 8.326 ACE Financials 102.91 86.897
Boolean selection can be used to remove rows from a DataFrame
by creating a new DataFrame
without the desired rows. Suppose we want to remove rows where Price
is greater than 300
. The process to do this, is to first determine which rows match that criteria, and then to select the rows that do not. The following code selects those rows and lets us know how many of them there are:
In [86]: # determine the rows where Price > 300 selection = sp500.Price > 300 # to make the output shorter, report the # of rows returned (500), # and the sum of those where Price > 300 (which is 10) "{0} {1}".format(len(selection), selection.sum()) Out[86]: '500 10'
We now know both the rows that match this criteria (the 10
with True
values) and those that do not (the other 490). To remove the rows now, select out the complement of the previous result. This gives us a new DataFrame
containing only the rows where we had a False
value from the previous selection:
In [87]: # select the complement withPriceLessThan300 = sp500[~selection] withPriceLessThan300 Out[87]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 ACN Information Technology 79.79 8.326 ACE Financials 102.91 86.897 ... ... ... ... YHOO Information Technology 35.02 12.768 YUM Consumer Discretionary 74.77 5.147 ZMH Health Care 101.84 37.181 ZION Financials 28.43 30.191 ZTS Health Care 30.53 2.150 [490 rows x 3 columns]
Slicing is also often used to remove records from a DataFrame
. It is a process similar to Boolean selection, where we select out all of the rows, except for the ones you want deleted.
Suppose we want to remove all but the first three records from sp500
. The slice to perform this task is [:3]
:
In [88]: # get only the first three rows onlyFirstThree = sp500[:3] onlyFirstThree Out[88]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954
Remember, that this result is a slice. Therefore, it is a view into the DataFrame
. Data has not been removed from the sp500
object. Changes to these three rows will change the data in sp500
. To prevent this from occurring, the proper action is to make a copy of the slice, as follows:
In [89]: # first three, but a copy of them onlyFirstThree = sp500[:3].copy() onlyFirstThree Out[89]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954
Scalar values in a DataFrame
can be changed by assignment of the new value to the result of the value lookup using the .ix
, .iloc
and .loc
attributes. These three attributes can all be passed both a row and column selectors, and the result can be assigned a new value that will be made in the original DataFrame
.
.ix
can be used by passing either the index label of the row and the name of the column, or the integer location of the row and column. The following code makes a copy of the first three rows of sp500
and then demonstrates changing the Price
on the MMM
and ABBV
securities:
In [90]: # get a subset / copy of the data subset = sp500[:3].copy() subset Out[90]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 In [91]: # change scalar by label on row and column subset.ix['MMM', 'Price'] = 0 subset Out[91]: Sector Price BookValue Symbol MMM Industrials 0.00 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954
Using .ix
is generally frowned upon when using production code, as it has performance issues over using the .loc
and .iloc
attributes, which have a higher performance, with .iloc
being the highest performance of those two.
The following code replicates the .ix
example, but uses .loc
with the index and column labels and uses different values to differentiate the result slightly:
In [92]: subset = sp500[:3].copy() subset.loc['MMM', 'Price'] = 10 subset.loc['ABBV', 'Price'] = 20 subset Out[92]: Sector Price BookValue Symbol MMM Industrials 10.0 26.668 ABT Health Care 39.6 15.573 ABBV Health Care 20.0 2.954
.loc
may suffer from lower performance, as compared to .iloc
, due to the possibility of needing to map the label values into locations. The following example gets the location of the specific row and column that is desired to be changed and then uses .iloc
to execute the change (the examples only change one price for brevity):
In [93]: # subset of the first three rows subset = sp500[:3].copy() # get the location of the Price column price_loc = sp500.columns.get_loc('Price') # get the location of the MMM row abt_row_loc = sp500.index.get_loc('ABT') # change the price subset.iloc[abt_row_loc, price_loc] = 1000 subset Out[93]: Sector Price BookValue Symbol MMM Industrials 141.14 26.668 ABT Health Care 1000.00 15.573 ABBV Health Care 53.95 2.954
This may be look like overkill for this small example. But if this is where code is being executed frequently, such as in a loop or in response to market changes, looking up the locations once and always using .loc
with those values, will give significant performance gains over the other options.
3.147.79.84