Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Modifying the structure and content of DataFrame

The structure and content of a DataFrame can be mutated in several ways. Rows and columns can be added and removed, and data within either can be modified to take on new values. Additionally, columns, as well as index labels, can also be renamed. Each of these will be described in the following sections.

Renaming columns

A column can be renamed using the .rename() method of the DataFrame. The Book Value column is inconvenient since it has a space, so we will rename it to BookValue:

In [48]:
   # rename the Book Value column to not have a space
   # this returns a copy with the column renamed
   df = sp500.rename(columns=
                     {'Book Value': 'BookValue'})
   # print first 2 rows
   df[:2]

Out[48]:
                Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABT     Health Care   39.60     15.573

This has returned a new DataFrame object with the renamed column and data copied from the original DataFrame. We can verify that the original DataFrame did not have its column names modified:

In [49]:
   # verify the columns in the original did not change
   sp500.columns

Out[49]:
   Index([u'Sector', u'Price', u'Book Value'], dtype='object')

To modify the DataFrame without making a copy, we can use the inplace=True parameter to .rename():

In [50]:
   # this changes the column in-place
   sp500.rename(columns=                  
                {'Book Value': 'BookValue'},                   
                inplace=True)
   # we can see the column is changed
   sp500.columns

Out[50]:
   Index([u'Sector', u'Price', u'BookValue'], dtype='object')

A convenient side effect of this, is that the DataFrame now has a .BookValue attribute as before renaming the column, the space prevented attribute-based access of the column:

In [51]:
   # and now we can use .BookValue
   sp500.BookValue[:5]

Out[51]:
   Symbol
   MMM       26.668
   ABT       15.573
   ABBV       2.954
   ACN        8.326
   ACE       86.897
   Name: BookValue, dtype: float64

Adding and inserting columns

Columns can be added to a DataFrame using several methods. The simplest way is by merging a new Series into the DataFrame object, along the index using the [] operator assigning the Series to a new column, with a name not already in the .columns index. Note that this will modify the DataFrame in-place and not result in a copy.

Alignment of data is important to understanding this process, as pandas does not simply concatenate the Series to the DataFrame. pandas will first align the data in the DataFrame with the Series using the index from both objects, and fill in the data from the Series into the new DataFrame at the appropriate index labels.

To demonstrate this, we will add a purely demonstrative column called TwicePrice which adds a new column with a calculated value of 2.0 * the Price column. Since this modifies the DataFrame object in-place, we will also make a copy and then add the column to the copy, so as to leave the original unmodified:

In [52]:
   # make a copy
   copy = sp500.copy()
   # add a new column to the copy
   copy['TwicePrice'] = sp500.Price * 2
   copy[:2]

Out[52]:
                Sector   Price  BookValue  TwicePrice
   Symbol                                            
   MMM     Industrials  141.14     26.668      282.28
   ABT     Health Care   39.60     15.573       79.20

This process is actually selecting the Price column out of the sp500 object, then creating another Series with each value of the Price multiplied by two. The DataFrame then aligns this new Series by label, copies the data at the appropriate labels, and adds the column at the end of the columns index.

If you want to add the column at a different location in the DataFrame object, instead of at the rightmost position, use the .insert() method of the DataFrame. The following code inserts the TwicePrice column between Price and BookValue:

In [53]:
   copy = sp500.copy()
   # insert sp500.Price * 2 as the 
   # second column in the DataFrame
   copy.insert(1, 'TwicePrice', sp500.Price * 2)
   copy[:2]

Out[53]:
                Sector  TwicePrice   Price  BookValue
   Symbol                                            
   MMM     Industrials      282.28  141.14     26.668
   ABT     Health Care       79.20   39.60     15.573

It is important to remember that this is not simply inserting a column into the DataFrame. The alignment process used here is performing a left join of the DataFrame and the Series by their index labels, and then creating the column and populating the data in the appropriate cells in the DataFrame from matching entries in the Series. If an index label in the DataFrame is not matched in the Series, the value used will be NaN. Items in the Series that do not have a matching label will be ignored.

The following example demonstrates this operation:

In [54]:
   # extract the first four rows and just the Price column
   rcopy = sp500[0:3][['Price']].copy()
   rcopy

Out[54]:
            Price
   Symbol        
   MMM     141.14
   ABT      39.60
   ABBV     53.95

In [55]:
   # create a new Series to merge as a column
   # one label exists in rcopy (MSFT), and MMM does not
   s = pd.Series(
                 {'MMM': 'Is in the DataFrame', 
                  'MSFT': 'Not in the DataFrame'} )
   s

Out[55]:
   MMM      Is in the DataFrame
   MSFT    Not in the DataFrame
   dtype: object

In [56]:
   # add rcopy into a column named 'Comment'
   rcopy['Comment'] = s
   rcopy

Out[56]:
            Price              Comment
   Symbol                             
   MMM     141.14  Is in the DataFrame
   ABT      39.60                  NaN
   ABBV     53.95                  NaN

The labels for ABT and ABBV were not found in rcopy and therefore, the values in the result are NaN. MMM is the only value in both, so the value from rcopy is put in the result.

Replacing the contents of a column

In general, assignment of a Series to a column using the [] operator will either create a new column if the column does not already exist, or replace the contents of a column if it already exists. To demonstrate replacement, the following code replaces the Price column with the result of the multiplication, instead of creating a new column:

In [57]:
   copy = sp500.copy()
   # replace the Price column data with the new values
   # instead of adding a new column
   copy.Price = sp500.Price * 2
   copy[:5]

Out[57]:
                           Sector   Price  BookValue
   Symbol                                           
   MMM                Industrials  282.28     26.668
   ABT                Health Care   79.20     15.573
   ABBV               Health Care  107.90      2.954
   ACN     Information Technology  159.58      8.326
   ACE                 Financials  205.82     86.897

To emphasize that this is also doing an alignment, we can change the sample slightly. The following code only utilizes the prices from three of the first four rows. This will force the result to not align values for 497 of the symbols, resulting in NaN values:

In [58]:
   # copy all 500 rows
   copy = sp500.copy()
   # this just copies the first 2 rows of prices
   prices = sp500.iloc[[3, 1, 0]].Price.copy()
   # examine the extracted prices
   prices

Out[58]:
   Symbol
   ACN        79.79
   ABT        39.60
   MMM       141.14
   Name: Price, dtype: float64

In [59]:
   # now replace the Prices column with prices
   copy.Price = prices
   # it's not really simple insertion, it is alignment
   # values are put in the correct place according to labels
   copy

Out[59]:
                           Sector   Price  BookValue
   Symbol                                           
   MMM                Industrials  141.14     26.668
   ABT                Health Care   39.60     15.573
   ABBV               Health Care     NaN      2.954
   ACN     Information Technology   79.79      8.326
   ACE                 Financials     NaN     86.897
   ...                        ...     ...        ...
   YHOO    Information Technology     NaN     12.768
   YUM     Consumer Discretionary     NaN      5.147
   ZMH                Health Care     NaN     37.181
   ZION                Financials     NaN     30.191
   ZTS                Health Care     NaN      2.150

   [500 rows x 3 columns]

Deleting columns in a DataFrame

Columns can be deleted from a DataFrame by using the del keyword, the pop(column) method of the DataFrame, or by calling the drop() method of the DataFrame.

The behavior of each of these differs slightly:

del will simply delete the Series from the DataFrame (in-place)
pop() will both delete the Series and return the Series as a result (also in-place)
drop(labels, axis=1) will return a new DataFrame with the column(s) removed (the original DataFrame object is not modified)

The following code demonstrates using del to delete the BookValue column from a copy of the sp500 data:

In [60]:
   # Example of using del to delete a column
   # make a copy of a subset of the data frame
   copy = sp500[:2].copy()
   copy

   Out[60]:
                Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABT     Health Care   39.60     15.573

In [61]:
   # delete the BookValue column
   # deletion is in-place
   del copy['BookValue']
   copy

Out[61]:
               Sector   Price
   Symbol                     
   MMM     Industrials  141.14
   ABT     Health Care   39.60

The following code demonstrates using the .pop() method to remove a column:

In [62]:
   # Example of using pop to remove a column from a DataFrame
   # first make a copy of a subset of the data frame
   # pop works in-place
   copy = sp500[:2].copy()
   # this will remove Sector and return it as a series
   popped = copy.pop('Sector')
   # Sector column removed in-place
   copy

Out[62]:
            Price  BookValue
   Symbol                   
   MMM     141.14     26.668
   ABT      39.60     15.573

In [63]:
   # and we have the Sector column as the result of the pop
   popped

Out[63]:
   Symbol
   MMM       Industrials
   ABT       Health Care
   Name: Sector, dtype: object

The .drop() method can be used to remove both rows and columns. To use it to remove a column, specify axis=1:

In [64]:
   # Example of using drop to remove a column 
   # make a copy of a subset of the DataFrame
   copy = sp500[:2].copy()
   # this will return a new DataFrame with 'Sector' removed
   # the copy DataFrame is not modified
   afterdrop = copy.drop(['Sector'], axis = 1)
   afterdrop

Out[64]:
            Price  BookValue
   Symbol                   
   MMM     141.14     26.668
   ABT      39.60     15.573

Adding rows to a DataFrame

Rows can be added to a DataFrame object via several different operations:

Appending a DataFrame to another
Concatenation of two DataFrame objects
Setting with enlargement

Appending rows with .append()

Appending is performed using the .append() method of the DataFrame. The process of appending returns a new DataFrame with the data from the original DataFrame added first, and the rows from the second. Appending does not perform alignment and can result in duplicate index values.

The following code demonstrates appending two DataFrame objects extracted from the sp500 data. The first DataFrame consists of rows 0, 1 and 2, and the second consists of rows 10, 11 and 2. Row 2 (with label ABBV) is included in both to demonstrate creation of duplicate index labels. The code is as follows:

In [65]:
   # copy the first three rows of sp500
   df1 = sp500.iloc[0:3].copy()
   # copy 10th and 11th rows
   df2 = sp500.iloc[[10, 11, 2]]
   # append df1 and df2
   appended = df1.append(df2)
   # the result is the rows of the first followed by 
   # those of the second
   appended

Out[65]:
                Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABT     Health Care   39.60     15.573
   ABBV    Health Care   53.95      2.954
   A       Health Care   56.18     16.928
   GAS       Utilities   52.98     32.462
   ABBV    Health Care   53.95      2.954

The set of columns of the DataFrame objects being appended do not need to be the same. The resulting DataFrame will consist of the union of the columns in both and where either did not have a column, NaN will be used as the value. The following code demonstrates this by creating a third DataFrame using the same index as df1, but having a single column with a unique column name:

In [66]:
   # DataFrame using df1.index and just a PER column
   # also a good example of using a scalar value
   # to initialize multiple rows
   df3 = pd.DataFrame(0.0, 
                      index=df1.index,
                      columns=['PER'])
   df3

Out[66]:
           PER
   Symbol     
   MMM       0
   ABT       0
   ABBV      0

In [67]:
   # append df1 and df3
   # each has three rows, so 6 rows is the result
   # df1 had no PER column, so NaN for those rows
   # df3 had no BookValue, Price or Sector, so NaN values
   df1.append(df3)

Out[67]:
           BookValue  PER   Price       Sector
   Symbol                                     
   MMM        26.668  NaN  141.14  Industrials
   ABT        15.573  NaN   39.60  Health Care
   ABBV        2.954  NaN   53.95  Health Care
   MMM           NaN    0     NaN          NaN
   ABT           NaN    0     NaN          NaN
   ABBV          NaN    0     NaN          NaN

To append without forcing the index to be taken from either DataFrame, you can use the ignore_index=True parameter. This is useful when the index values are not of significant meaning, and you just want concatenated data with sequentially increasing integers as indexes:

In [68]:
   # ignore index labels, create default index
   df1.append(df3, ignore_index=True)

Out[68]:
      BookValue  PER   Price       Sector
   0     26.668  NaN  141.14  Industrials
   1     15.573  NaN   39.60  Health Care
   2      2.954  NaN   53.95  Health Care
   3        NaN    0     NaN          NaN
   4        NaN    0     NaN          NaN
   5        NaN    0     NaN          NaN

Concatenating DataFrame objects with pd.concat()

A DataFrame can be concatenated to another using the pd.concat() function. This function functions similarly to the .append() method, but also adds the ability to specify an axis (appending can be row or column based), as well as being able to perform several join operations between the objects. Also, the function takes a list of pandas objects to concatenate, so you can concatenate more than two objects in a single call.

The default operation of pd.concat() on two DataFrame objects operates in the same way as the .append() method. This can be demonstrated by reconstructing the two datasets from the earlier example and concatenating them. This is shown in the following example:

In [69]:
   # copy the first three rows of sp500
   df1 = sp500.iloc[0:3].copy()
   # copy 10th and 11th rows
   df2 = sp500.iloc[[10, 11, 2]]
   # pass them as a list
   pd.concat([df1, df2])

Out[69]:
               Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABT     Health Care   39.60     15.573
   ABBV    Health Care   53.95      2.954
   A       Health Care   56.18     16.928
   GAS       Utilities   52.98     32.462
   ABBV    Health Care   53.95      2.954

Actually, pandas calculates the sorted union of distinct column names across all supplied objects and uses those as the columns, and then appends data along the rows for each object in the order given in the list.

A slight variant of this example adds an additional column to one of the DataFrame objects and then performs the concatenation:

In [70]:
   # copy df2
   df2_2 = df2.copy()
   # add a column to df2_2 that is not in df1
   df2_2.insert(3, 'Foo', pd.Series(0, index=df2.index))
   # see what it looks like
   df2_2

Out[70]:
               Sector  Price  BookValue  Foo
   Symbol                                    
   A       Health Care  56.18     16.928    0
   GAS       Utilities  52.98     32.462    0
   ABBV    Health Care  53.95      2.954    0

In [71]:
   # now concatenate
   pd.concat([df1, df2_2])

Out[71]:
           BookValue  Foo   Price       Sector
   Symbol                                     
   MMM        26.668  NaN  141.14  Industrials
   ABT        15.573  NaN   39.60  Health Care
   ABBV        2.954  NaN   53.95  Health Care
   A          16.928    0   56.18  Health Care
   GAS        32.462    0   52.98    Utilities
   ABBV        2.954    0   53.95  Health Care

Duplicate index labels still result, as the rows are copied verbatim from the source objects. However, note the NaN values in the rows originating from df1, since it does not have a Foo column.

Using the keys parameter, it is possible to differentiate the pandas objects from which the rows originated. The following code adds a level to the index which represents the source object:

In [72]:
   # specify keys
   r = pd.concat([df1, df2_2], keys=['df1', 'df2'])
   r

Out[72]:
               BookValue  Foo   Price       Sector
       Symbol                                     
   df1 MMM        26.668  NaN  141.14  Industrials
       ABT        15.573  NaN   39.60  Health Care
       ABBV        2.954  NaN   53.95  Health Care
   df2 A          16.928    0   56.18  Health Care
       GAS        32.462    0   52.98    Utilities
       ABBV        2.954    0   53.95  Health Care

We can change the axis of the concatenation to work along the columns by specifying axis=1, which will calculate the sorted union of the distinct index labels from the rows and then append columns and their data from the specified objects.

To demonstrate, the following splits the sp500 data into two DataFrame objects, each with a different set of columns, and then concatenates along axis=1:

In [73]:
   # first three rows, columns 0 and 1
   df3 = sp500[:3][[0, 1]]
   df3

Out[73]:
                Sector   Price
   Symbol                     
   MMM     Industrials  141.14
   ABT     Health Care   39.60
   ABBV    Health Care   53.95

In [74]:
   # first three rows, column 2
   df4 = sp500[:3][[2]]
   df4

Out[74]:
           BookValue
   Symbol           
   MMM        26.668
   ABT        15.573
   ABBV        2.954

In [75]:
   # put them back together
   pd.concat([df3, df4], axis=1)

Out[75]:
                Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABT     Health Care   39.60     15.573
   ABBV    Health Care   53.95      2.954

We can further examine this operation by adding a column to the second DataFrame that has a duplicate name to a column in the first. The result will have duplicate columns, as the columns are blindly appended without regard to already existing columns:

In [76]:
   # make a copy of df4
   df4_2 = df4.copy()
   # add a column to df4_2, that is also in df3
   df4_2.insert(1, 'Sector', pd.Series(1, index=df4_2.index))
   df4_2

Out[76]:
           BookValue  Sector
   Symbol                   
   MMM        26.668       1
   ABT        15.573       1
   ABBV        2.954       1

In [77]:
   # demonstrate duplicate columns
   pd.concat([df3, df4_2], axis=1)

Out[77]:
                Sector   Price  BookValue  Sector
   Symbol                                        
   MMM     Industrials  141.14     26.668       1
   ABT     Health Care   39.60     15.573       1
   ABBV    Health Care   53.95      2.954       1

To be very specific, pandas is performing an outer join along the labels of the specified axis. An inner join can be specified using the join='inner' parameter, which changes the operation from being a sorted union of distinct labels to the distinct values of the intersection of the labels. To demonstrate, the following selects two subsets of the financial data with one row in common and performs an inner join:

In [78]:
   # first three rows and first two columns
   df5 = sp500[:3][[0, 1]]
   df5

Out[78]:
                Sector   Price
   Symbol                     
   MMM     Industrials  141.14
   ABT     Health Care   39.60
   ABBV    Health Care   53.95

In [79]:
   # row 2 through 4 and first two columns
   df6 = sp500[2:5][[0,1]]
   df6

Out[79]:
                           Sector   Price
   Symbol                                
   ABBV               Health Care   53.95
   ACN     Information Technology   79.79
   ACE                 Financials  102.91

In [80]:
   # inner join on index labels will return in only one row
   pd.concat([df5, df6], join='inner', axis=1)

Out[80]:
                Sector  Price       Sector  Price
   Symbol                                        
   ABBV    Health Care  53.95  Health Care  53.95

Adding rows (and columns) via setting with enlargement

Rows can also be added to a DataFrame through the .loc property. This technique is referred to as setting with enlargement. The parameter for .loc specifies the index label where the row is to be placed. If the label does not exist, the values are appended to the DataFrame using the given index label. If it does exist, then the values in the specified row are replaced.

The following example takes a subset of sp500 and adds a row with the label FOO:

In [81]:
   # get a small subset of the sp500 
   # make sure to copy the slice to make a copy
   ss = sp500[:3].copy()
   # create a new row with index label FOO
   # and assign some values to the columns via a list
   ss.loc['FOO'] = ['the sector', 100, 110]
   ss

Out[81]:
              Sector   Price  BookValue
   MMM   Industrials  141.14     26.668
   ABT   Health Care   39.60     15.573
   ABBV  Health Care   53.95      2.954
   FOO    the sector  100.00    110.000

Note that the change is made in place. If FOO already exists as an index label, then the column data would be replaced. This is one of the means of updating data in a DataFrame in-place, as .loc not only retrieves row(s), but also lets you modify the results that are returned.

It is also possible to add columns in this manner. The following code demonstrates by adding a new column to a subset of sp500 using .loc. Note that to accomplish this, we use the colon in the rows' position to select all rows to be included to add the new column and value:

In [82]:
   # copy of subset / slice
   ss = sp500[:3].copy()
   # add the new column initialized to 0
   ss.loc[:,'PER'] = 0
   # take a look at the results
   ss

Out[82]:
                Sector   Price  BookValue  PER
   Symbol                                     
   MMM     Industrials  141.14     26.668    0
   ABT     Health Care   39.60     15.573    0
   ABBV    Health Care   53.95      2.954    0

Removing rows from a DataFrame

Removing rows from a DataFrame object is normally performed using one of three techniques:

Using the .drop() method
Boolean selection
Selection using a slice

Technically, only the .drop() method removes rows in-place on the source object. The other techniques either create a copy without specific rows, or a view into the rows that are not to be dropped. Details of each are given in the following sections.

Removing rows using .drop()

To remove rows from a DataFrame by the index label, you can use the .drop() method of the DataFrame. The .drop() method takes a list of index labels and will return a copy of the DataFrame with the rows for the specified labels removed. The source DataFrame remains unmodified. The code is as follows:

In [83]:
   # get a copy of the first 5 rows of sp500
   ss = sp500[:5].copy()
   ss

Out[83]:
                           Sector   Price  BookValue
   Symbol                                           
   MMM                Industrials  141.14     26.668
   ABT                Health Care   39.60     15.573
   ABBV               Health Care   53.95      2.954
   ACN     Information Technology   79.79      8.326
   ACE                 Financials  102.91     86.897

In [84]:
   # drop rows with labels ABT and ACN
   afterdrop = ss.drop(['ABT', 'ACN'])
   afterdrop

Out[84]:
                Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABBV    Health Care   53.95      2.954
   ACE      Financials  102.91     86.897

In [85]:
   # note that ss is not modified
   ss

Out[85]:
                           Sector   Price  BookValue
   Symbol                                           
   MMM                Industrials  141.14     26.668
   ABT                Health Care   39.60     15.573
   ABBV               Health Care   53.95      2.954
   ACN     Information Technology   79.79      8.326
   ACE                 Financials  102.91     86.897

Removing rows using Boolean selection

Boolean selection can be used to remove rows from a DataFrame by creating a new DataFrame without the desired rows. Suppose we want to remove rows where Price is greater than 300. The process to do this, is to first determine which rows match that criteria, and then to select the rows that do not. The following code selects those rows and lets us know how many of them there are:

In [86]:
   # determine the rows where Price > 300
   selection = sp500.Price > 300
   # to make the output shorter, report the # of rows returned (500), 
   # and the sum of those where Price > 300 (which is 10)
   "{0} {1}".format(len(selection), selection.sum())

Out[86]:
   '500 10'

We now know both the rows that match this criteria (the 10 with True values) and those that do not (the other 490). To remove the rows now, select out the complement of the previous result. This gives us a new DataFrame containing only the rows where we had a False value from the previous selection:

In [87]:
   # select the complement
   withPriceLessThan300 = sp500[~selection]
   withPriceLessThan300

Out[87]:
                           Sector   Price  BookValue
   Symbol                                           
   MMM                Industrials  141.14     26.668
   ABT                Health Care   39.60     15.573
   ABBV               Health Care   53.95      2.954
   ACN     Information Technology   79.79      8.326
   ACE                 Financials  102.91     86.897
   ...                        ...     ...        ...
   YHOO    Information Technology   35.02     12.768
   YUM     Consumer Discretionary   74.77      5.147
   ZMH                Health Care  101.84     37.181
   ZION                Financials   28.43     30.191
   ZTS                Health Care   30.53      2.150

   [490 rows x 3 columns]

Removing rows using a slice

Slicing is also often used to remove records from a DataFrame. It is a process similar to Boolean selection, where we select out all of the rows, except for the ones you want deleted.

Suppose we want to remove all but the first three records from sp500. The slice to perform this task is [:3]:

In [88]:
   # get only the first three rows
   onlyFirstThree = sp500[:3]
   onlyFirstThree

Out[88]:
               Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABT     Health Care   39.60     15.573
   ABBV    Health Care   53.95      2.954

Remember, that this result is a slice. Therefore, it is a view into the DataFrame. Data has not been removed from the sp500 object. Changes to these three rows will change the data in sp500. To prevent this from occurring, the proper action is to make a copy of the slice, as follows:

In [89]:
   # first three, but a copy of them
   onlyFirstThree = sp500[:3].copy()
   onlyFirstThree

Out[89]:
                Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABT     Health Care   39.60     15.573
   ABBV    Health Care   53.95      2.954

Changing scalar values in a DataFrame

Scalar values in a DataFrame can be changed by assignment of the new value to the result of the value lookup using the .ix, .iloc and .loc attributes. These three attributes can all be passed both a row and column selectors, and the result can be assigned a new value that will be made in the original DataFrame.

.ix can be used by passing either the index label of the row and the name of the column, or the integer location of the row and column. The following code makes a copy of the first three rows of sp500 and then demonstrates changing the Price on the MMM and ABBV securities:

In [90]:
   # get a subset / copy of the data
   subset = sp500[:3].copy()
   subset

Out[90]:
                Sector   Price  BookValue
   Symbol                                
   MMM     Industrials  141.14     26.668
   ABT     Health Care   39.60     15.573
   ABBV    Health Care   53.95      2.954

In [91]:
   # change scalar by label on row and column
   subset.ix['MMM', 'Price'] = 0
   subset

Out[91]:
                Sector  Price  BookValue
   Symbol                               
   MMM     Industrials   0.00     26.668
   ABT     Health Care  39.60     15.573
   ABBV    Health Care  53.95      2.954

Using .ix is generally frowned upon when using production code, as it has performance issues over using the .loc and .iloc attributes, which have a higher performance, with .iloc being the highest performance of those two.

The following code replicates the .ix example, but uses .loc with the index and column labels and uses different values to differentiate the result slightly:

In [92]:
   subset = sp500[:3].copy()
   subset.loc['MMM', 'Price'] = 10
   subset.loc['ABBV', 'Price'] = 20
   subset

Out[92]:
                Sector  Price  BookValue
   Symbol                               
   MMM     Industrials   10.0     26.668
   ABT     Health Care   39.6     15.573
   ABBV    Health Care   20.0      2.954

.loc may suffer from lower performance, as compared to .iloc, due to the possibility of needing to map the label values into locations. The following example gets the location of the specific row and column that is desired to be changed and then uses .iloc to execute the change (the examples only change one price for brevity):

In [93]:
   # subset of the first three rows
   subset = sp500[:3].copy()
   # get the location of the Price column
   price_loc = sp500.columns.get_loc('Price')
   # get the location of the MMM row
   abt_row_loc = sp500.index.get_loc('ABT')
   # change the price
   subset.iloc[abt_row_loc, price_loc] = 1000
   subset

Out[93]:
                Sector    Price  BookValue
   Symbol                                 
   MMM     Industrials   141.14     26.668
   ABT     Health Care  1000.00     15.573
   ABBV    Health Care    53.95      2.954

This may be look like overkill for this small example. But if this is where code is being executed frequently, such as in a loop or in response to market changes, looking up the locations once and always using .loc with those values, will give significant performance gains over the other options.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Modifying the structure and content of DataFrame

Create new playlist

Sign In

Sign Up