To clarify, *add* data.
Edward R. Tufte
In This Chapter
The Pandas DataFrame, which is built on top of the NumPy array, is probably the most commonly used data structure. DataFrames are like supercharged spreadsheets in code. They are one of the primary tools used in data science. This chapter looks at creating DataFrames, manipulating DataFrames, accessing data in DataFrames, and manipulating that data.
A Pandas DataFrame, like a spreadsheet, is made up of columns and rows. Each column is a pandas.Series
object. A DataFrame is, in some ways, similar to a two-dimensional NumPy array, with labels for the columns and index. Unlike a NumPy array, however, a DataFrame can contain different data types. You can think of a pandas.Series
object as a one-dimensional NumPy array with labels. The pandas.Series
object, like a NumPy array, can contain only one data type. The pandas.Series
object can use many of the same methods you have seen with arrays, such as min()
, max()
, mean()
, and medium()
.
The usual convention is to import the Pandas package aliased as pd
:
import pandas as pd
You can create DataFrames with data from many sources, including dictionaries and lists and, more commonly, by reading files. You can create an empty DataFrame by using the DataFrame
constructor:
df = pd.DataFrame() print(df) Empty DataFrame Columns: [] Index: []
As a best practice, though, DataFrames should be initialized with data.
You can create DataFrames from a list of dictionaries or from a dictionary, where each key is a column label with the values for that key holding the data for the column. Listing 9.1 shows how to create a DataFrame by creating a list of data for each column and then creating a dictionary with the column names as keys and these lists as the values. The listing shows how to then pass this dictionary to the DataFrame
constructor to construct the DataFrame.
first_names = ['shanda', 'rolly', 'molly', 'frank', 'rip', 'steven', 'gwen', 'arthur'] last_names = ['smith', 'brocker', 'stein', 'bach', 'spencer', 'de wilde', 'mason', 'davis'] ages = [43, 23, 78, 56, 26, 14, 46, 92] data = {'first':first_names, 'last':last_names, 'ages':ages} participants = pd.DataFrame(data)
The resulting DataFrame, participants
, looks as follows in Colab or in a Jupyter notebook:
|
first |
last |
ages |
0 |
shanda |
smith |
43 |
1 |
rolly |
brocker |
23 |
2 |
molly |
stein |
78 |
3 |
frank |
bach |
56 |
4 |
rip |
spencer |
26 |
5 |
steven |
de wilde |
14 |
6 |
gwen |
mason |
46 |
7 |
arthur |
davis |
92 |
Note
In this chapter, DataFrame tables that result from a code example will be presented as a table after the code.
You can see the column labels across the top, the data in each row, and the index labels to the left.
You can create a list of lists, with each sublist containing the data for one row, in the order of the columns:
data = [["shanda", "smith", 43], ["rolly", "brocker", 23], ["molly", "stein", 78], ["frank", "bach", 56], ["rip", "spencer", 26], ["steven", "de wilde", 14], ["gwen", "mason", 46], ["arthur", "davis", 92]]
Then you can use this as the data argument:
participants = pd.DataFrame(data) participants
You get the same result as when creating a DataFrame from a dictionary:
|
0 |
1 |
2 |
0 |
shanda |
smith |
43 |
1 |
rolly |
brocker |
23 |
2 |
molly |
stein |
78 |
3 |
frank |
bach |
56 |
4 |
rip |
spencer |
26 |
5 |
steven |
de wilde |
14 |
6 |
gwen |
mason |
46 |
7 |
arthur |
davis |
92 |
Notice that the resulting DataFrame has been created with integer column names. This is the default if no column names are supplied. You can supply column names explicitly as a list of strings:
column_names = ['first', 'last', 'ages']
Similarly, you can supply index labels as a list:
index_labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
These labels are then used during initialization, using the parameters columns
and index
:
participants = pd.DataFrame(data, columns=column_names, index=index_labels)
|
first |
last |
ages |
a |
shanda |
smith |
43 |
b |
rolly |
brocker |
23 |
c |
molly |
stein |
78 |
d |
frank |
bach |
56 |
e |
rip |
spencer |
26 |
f |
steven |
de wilde |
14 |
g |
gwen |
mason |
46 |
h |
arthur |
davis |
92 |
While creating DataFrames from dictionaries and lists is possible, the vast majority of the time you will create DataFrames from existing data sources. Files are the most common of these data sources. Pandas supplies functions for creating DataFrames from files for many common file types, including CSV, Excel, HTML, JSON, and SQL database connections.
Say that you want to open a CSV file from the FiveThirtyEight website, https://data.fivethirtyeight.com, under the data set college_majors
. After you unzip and upload the CSV file to Colab, you open it by simply supplying its path to the Pandas read
_csv
function:
college_majors = pd.read_csv('/content/all-ages.csv') college_majors
|
Major |
Major_category |
Total |
Unemployment_rate |
0 |
GENERAL AGRICULTURE |
Agriculture & Natural Resources |
128148 |
0.026147 |
1 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
Agriculture & Natural Resources |
95326 |
0.028636 |
2 |
AGRICULTURAL ECONOMICS |
Agriculture & Natural Resources |
33955 |
0.030248 |
... |
... |
... |
... |
... |
170 |
MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION |
Business |
102753 |
0.052679 |
171 |
HISTORY |
Humanities & Liberal Arts |
712509 |
0.065851 |
172 |
UNITED STATES HISTORY |
Humanities & Liberal Arts |
17746 |
0.073500 |
Pandas uses the data in the CSV file to determine column labels and column types.
Once you have data loaded into a DataFrame, you should take a look at it. Pandas offers numerous ways of accessing data in a DataFrame. You can look at data by rows, columns, individual cells, or some combination of these. You can also extract data based on its value.
Note
When I first load data that I am unfamiliar with, I start by taking a peek at the top few rows and checking summary statistics on the data. Looking at the top rows of a DataFrame gives me a sense of what the new data looks like and allows me to confirm that the data is what I expect.
To see the top rows of a DataFrame, you can use the head
method, which returns the top five rows:
college_majors.head()
|
Major |
Major_category |
Total |
Unemployment_rate |
0 |
GENERAL AGRICULTURE |
Agriculture & Natural Resources |
128148 |
0.026147 |
1 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
Agriculture & Natural Resources |
95326 |
0.028636 |
2 |
AGRICULTURAL ECONOMICS |
Agriculture & Natural Resources |
33955 |
0.030248 |
3 |
ANIMAL SCIENCES |
Agriculture & Natural Resources |
103549 |
0.042679 |
4 |
FOOD SCIENCE |
Agriculture & Natural Resources |
24280 |
0.049188 |
The head
method takes an optional argument, which specifies the number of rows to return. You would specify the top three rows like this:
college_majors.head(3)
|
Major |
Major_category |
Total |
Unemployment_rate |
0 |
GENERAL AGRICULTURE |
Agriculture & Natural Resources |
128148 |
0.026147 |
1 |
AGRICULTURE PRODUCTION AND MANAGEMENT |
Agriculture & Natural Resources |
95326 |
0.028636 |
2 |
AGRICULTURAL ECONOMICS |
Agriculture & Natural Resources |
33955 |
0.030248 |
The tail
method works in a similar way to head
but returns rows from the bottom. It also takes an optional argument that specifies the number of rows to return:
college_majors.tail()
|
Major |
Major_category |
Total |
Unemployment_rate |
168 |
HOSPITALITY MANAGEMENT |
Business |
200854 |
0.051447 |
169 |
MANAGEMENT INFORMATION SYSTEMS AND STATISTICS |
Business |
156673 |
0.043977 |
170 |
MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION |
Business |
102753 |
0.052679 |
171 |
HISTORY |
Humanities & Liberal Arts |
712509 |
0.065851 |
172 |
UNITED STATES HISTORY |
Humanities & Liberal Arts |
17746 |
0.073500 |
Once I’ve taken a look at some rows from a DataFrame, I like to get a sense of the shape of the data. One tool for doing this is the DataFrame describe
method, which produces various descriptive statistics about the data. You can call describe
with no arguments, as shown here:
college_majors.describe()
|
Total |
Unemployment_rate |
count |
1.730000e+02 |
173.000000 |
mean |
2.302566e+05 |
0.057355 |
std |
4.220685e+05 |
0.019177 |
min |
2.396000e+03 |
0.000000 |
25% |
2.428000e+04 |
0.046261 |
50% |
7.579100e+04 |
0.054719 |
75% |
2.057630e+05 |
0.069043 |
max |
3.123510e+06 |
0.156147 |
This method calculates the count, mean, standard deviation, minimum, maximum, and quantiles for columns containing numeric data. It accepts optional arguments to control which data types are processed and the ranges of the quantiles returned. To change the quantiles, you use the percentiles
argument:
college_majors.describe(percentiles=[0.1, 0.9])
|
Total |
Unemployment_rate |
count |
1.730000e+02 |
173.000000 |
mean |
2.302566e+05 |
0.057355 |
std |
4.220685e+05 |
0.019177 |
min |
2.396000e+03 |
0.000000 |
10% |
9.775600e+03 |
0.037053 |
50% |
7.579100e+04 |
0.054719 |
90% |
6.739758e+05 |
0.080062 |
max |
3.123510e+06 |
0.156147 |
This example specifies percentiles for 10% and 90% rather than the default 25% and 75%. Note that 50% is inserted regardless of the argument.
If you want to see statistics calculated from nonnumeric columns, you can specify which data types are processed. You do this by using the include
keyword. The value passed to this keyword should be a sequence of data types, which can be NumPy data types, such as np.object
. In Pandas, strings are of type object
, so the following includes columns with string data types:
import numpy as np college_majors.describe(include=[np.object])
This would also find the string name of the data type, which, in the case of np.object
, would be object
. The following returns statistics appropriately for the type:
college_majors.describe(include=['object'])
So, for strings, you get the count, the number of unique values, the top value, and the frequency of this top value:
|
Major |
Major_category |
count |
173 |
173 |
unique |
173 |
16 |
top |
GEOSCIENCES |
Engineering |
freq |
1 |
29 |
You can pass the string all
instead of a list of data types to produce statistics for all the columns:
college_majors.describe(include='all')
|
Major |
Major_category |
Total |
Unemployment_rate |
count |
173 |
173 |
1.730000e+02 |
173.000000 |
unique |
173 |
16 |
NaN |
NaN |
top |
GEOSCIENCES |
Engineering |
NaN |
NaN |
freq |
1 |
29 |
NaN |
NaN |
mean |
NaN |
NaN |
2.302566e+05 |
0.057355 |
std |
NaN |
NaN |
4.220685e+05 |
0.019177 |
min |
NaN |
NaN |
2.396000e+03 |
0.000000 |
25% |
NaN |
NaN |
2.428000e+04 |
0.046261 |
50% |
NaN |
NaN |
7.579100e+04 |
0.054719 |
75% |
NaN |
NaN |
2.057630e+05 |
0.069043 |
max |
NaN |
NaN |
3.123510e+06 |
0.156147 |
Note
Where a statistic is not appropriate for a data type, such as with the standard deviation for a string, the not-a-number value NAN
is inserted.
In case you want to exclude certain data types rather than specify which ones to include, Pandas supplies the exclude
argument, which takes the same types of arguments as include
:
college_majors.describe(exclude=['int'])
Once you have taken an initial peek at a frame using head
or tail
and gotten a sense of the shape of the data using describe
, you can start looking at the data and individual columns, rows, or cells.
This is the participants
DataFrame from earlier in the chapter:
participants
|
first |
last |
ages |
a |
shanda |
smith |
43 |
b |
rolly |
brocker |
23 |
c |
molly |
stein |
78 |
d |
frank |
bach |
56 |
e |
rip |
spencer |
26 |
f |
steven |
de wilde |
14 |
g |
gwen |
mason |
46 |
h |
arthur |
davis |
92 |
To access columns or rows in Pandas DataFrames, you need to use bracket syntax. This syntax is great for interactive sessions where you are exploring and playing with data, and using it is a best practice.
To access a single column, you supply the column name as an argument in brackets, much as you would a dictionary key:
participants['first'] a shanda b rolly c molly d frank e rip f steven g gwen h arthur Name: first, dtype: object
You can see that this returns the data for the column along with its index, label, and data type. If a column name does not contain dashes or special characters, and if the column name is not the same as an existing attribute of the DataFrame, you can access the column as an attribute.
For example, here is how you access the ages
column:
participants.ages a 43 b 23 c 78 d 56 e 26 f 14 g 46 h 92 Name: ages, dtype: int64
This would not work with the columns first
or last
, as these already exist as attributes of the DataFrame.
To access multiple columns, you specify the column label as a list:
participants[['last', 'first']]
|
last |
first |
a |
smith |
shanda |
b |
brocker |
rolly |
c |
stein |
molly |
d |
bach |
frank |
e |
spencer |
rip |
f |
de wilde |
steven |
g |
mason |
gwen |
This returns a DataFrame with only the requested columns.
The bracket syntax is overloaded to allow you to grab rows as well as columns. To specify rows, you use a slice as an argument. If the slice uses integers, then those integers represent the row numbers to return. To return rows 3, 4, and 5 of the DataFrame participants
, for example, you can use the slice 3:6
:
participants[3:6]
|
first |
last |
ages |
d |
frank |
bach |
56 |
e |
rip |
spencer |
26 |
f |
steven |
de wilde |
14 |
You can also slice using index labels. When you use labels to slice, the last value is included. So to get rows a
, b
, and c
, you slice using a:c
:
participants['a':'c']
|
first |
last |
ages |
a |
shanda |
smith |
43 |
b |
rolly |
brocker |
23 |
c |
molly |
stein |
78 |
You can indicate which rows to return by using a list of Booleans. The list should have one Boolean per row: True
for the desired rows, and False
for the others. The following example returns the second, third, and sixth rows:
mask = [False, True, True, False, False, True, False, False] participants[mask]
|
first |
last |
ages |
b |
rolly |
brocker |
23 |
c |
molly |
stein |
78 |
f |
steven |
de wilde |
14 |
The bracket syntax provides a very convenient and easy-to-read way to access data. It is often used in interactive sessions when experimenting with and exploring DataFrames, but it is not optimized for performance with large data sets. The recommended way to index into DataFrames in production code or for large data sets is to use the DataFrame loc
and iloc
indexers. These indexers use a bracket syntax very similar to what you have seen here. The loc
indexer indexes using labels, and iloc
uses index positions.
With the loc
indexer, you can supply a single label, and the values for that row will be returned. To get the values from the row labeled c
, for example, you simply supply c
as an argument:
participants.loc['c'] first molly last stein ages 78 Name: c, dtype: object
You can provide a slice of labels, and once again, the last label is included:
participants.loc['c':'f']
|
first |
last |
ages |
c |
molly |
stein |
78 |
d |
frank |
bach |
56 |
e |
rip |
spencer |
26 |
f |
steven |
de wilde |
14 |
Or you can provide a sequence of Booleans:
mask = [False, True, True, False, False, True, False, False] participants.loc[mask]
|
first |
last |
ages |
b |
rolly |
brocker |
23 |
c |
molly |
stein |
78 |
f |
steven |
de wilde |
14 |
An optional second argument can indicate which columns to return. If you want to return all the rows for the column first
, for example, you specify all rows with a slice, a comma, and the column label:
participants.loc[:, 'first'] a shanda b rolly c molly d frank e rip f steven g gwen h arthur Name: first, dtype: object
You could provide a list of column labels:
participants.loc[:'c', ['ages', 'last']]
|
ages |
last |
a |
43 |
smith |
b |
23 |
brocker |
c |
78 |
stein |
Or you could provide a list of Booleans:
participants.loc[:'c', [False, True, True]]
|
last |
ages |
a |
smith |
43 |
b |
brocker |
23 |
c |
stein |
78 |
The iloc
indexer enables you to use index positions to select rows and columns. Much as you’ve seen before with brackets, you can use a single value to specify a single row:
participants.iloc[3] first frank last bach ages 56 Name: d, dtype: object
Or you can specify multiple rows by using a slice:
participants.iloc[1:4]
|
first |
last |
ages |
b |
rolly |
brocker |
23 |
c |
molly |
stein |
78 |
d |
frank |
bach |
56 |
You can, optionally, indicate which column to return by using a second slice:
participants.iloc[1:4, :2]
|
first |
last |
b |
rolly |
brocker |
c |
molly |
stein |
d |
frank |
bach |
A powerful feature of DataFrames is the ability to select data based on values. You can use comparison operators with columns to see which values meet some condition. For example, if you want to see which rows of the college_majors
DataFrame have the value Humanities & Liberal Arts
as a major category, you can use the equality operator (==
):
college_majors.Major_category == 'Humanities & Liberal Arts' 0 False 1 False 2 False 3 False ... 169 False 170 False 171 True 172 True Name: Major_category, Length: 173, dtype: bool
This produces a pandas.Series
object that contains True
for every row that matches the condition. A series of Booleans is mildly interesting, but the real power comes when you combine it with an indexer to filter the results. Remember that loc
returns rows for every True
value of an input sequence. You can make a condition based on a comparison operator and a row, for example, as shown here for the greater-than operator and the row Total
:
total_mask = college_majors.loc[:, 'Total'] > 1200000
You can use the result as a mask to select only the rows that meet this condition:
top_majors = college_majors.loc[total_mask] top_majors
|
Major |
Major_category |
Total |
Unemployment_rate |
25 |
GENERAL EDUCATION |
Education |
1438867 |
0.043904 |
28 |
ELEMENTARY EDUCATION |
Education |
1446701 |
0.038359 |
114 |
PSYCHOLOGY |
Psychology & Social Work |
1484075 |
0.069667 |
153 |
NURSING |
Health |
1769892 |
0.026797 |
158 |
GENERAL BUSINESS |
Business |
2148712 |
0.051378 |
159 |
ACCOUNTING |
Business |
1779219 |
0.053415 |
161 |
BUSINESS MANAGEMENT AND ADMINISTRATION |
Business |
3123510 |
0.058865 |
You can use the min()
method to check whether the resulting DataFrame meets the condition:
top_majors.Total.min() 1438867
Now say that you want to see which major categories have the lowest unemployment rates. You can use describe
on a single column as well as with a full DataFrame. If you use describe
on the column Unemployment_rate
, for example, you can see that the top rate for the bottom percentile is 0.046261:
college_majors.Unemployment_rate.describe() count 173.000000 mean 0.057355 std 0.019177 min 0.000000 25% 0.046261 50% 0.054719 75% 0.069043 max 0.156147 Name: Unemployment_rate, dtype: float64
You can create a mask for all rows with an unemployment rate less than or equal to this:
employ_rate_mask = college_majors.loc[:, 'Unemployment_rate'] <= 0.046261
And you can use this mask to produce a DataFrame with only these rows:
employ_rate_majors = college_majors.loc[employ_rate_mask]
Then you can use the pandas.Series
object’s unique
method to see which major categories are in the resulting DataFrame:
employ_rate_majors.Major_category.unique() array(['Agriculture & Natural Resources', 'Education', 'Engineering', 'Biology & Life Science', 'Computers & Mathematics', 'Humanities & Liberal Arts', 'Physical Sciences', 'Health', ‘Business’], dtype=object)
All these categories have at least one row with an employment rate that meets the condition.
You can use the three Boolean operators AND (&
), OR (|
), and NOT (~
) with the results of your conditions. You can use &
or |
to combine conditions and create more complex ones. You can use ~
to create a mask that is the opposite of your condition.
For example, you can use AND to create a new mask based on the previous ones to see which major categories of the most popular majors have a low unemployment rate. To do this, you use the &
operator between your existing masks to produce a new one:
total_rate_mask = employ_rate_mask & total_mask total_rate_mask 0 False 1 False 2 False 3 False 4 False ... 168 False 169 False 170 False 171 False 172 False Length: 173, dtype: bool
By looking at the resulting DataFrame, you can see which of the most popular majors have the lowest unemployment rates:
college_majors.loc[total_rate_mask]
|
Major |
Major_category |
Total |
Unemployment_rate |
25 |
GENERAL EDUCATION |
Education |
1438867 |
0.043904 |
28 |
ELEMENTARY EDUCATION |
Education |
1446701 |
0.038359 |
153 |
NURSING |
Health |
1769892 |
0.026797 |
You can use the ~
operator with your employment rate mask to create a DataFrame whose rows all have an employment rate higher than the bottom percentile:
lower_rate_mask = ~employ_rate_mask lower_rate_majors = college_majors.loc[lower_rate_mask]
You can check this work by using the min
method on the Unemployment_rate
column to see that it is above the top rate for the bottom percentile:
lower_rate_majors.Unemployment_rate.min() 0.046261360999999994
To select all the rows that either fit the top majors condition or the employment rate condition, you can use the |
operator:
college_majors.loc[total_mask | employ_rate_mask]
The resulting DataFrame contains all the rows that fit either condition.
Once you have the data you need in a DataFrame, you might want to change the DataFrame. You can rename columns or indexes, you can add new columns and rows, and you can delete columns and rows.
Changing the label of a column is simple using the DataFrame rename
method. This is how you can use the DataFrame columns attribute to look at the current column names:
participants.columns Index(['first', 'last', 'ages'], dtype='object')
You can then rename the columns of your choice by providing a dictionary mapping each old column name to the new one. For example, here is how you change the label of the column ages
to Age
:
participants.rename(columns={'ages': 'Age'})
|
first |
last |
Age |
a |
shanda |
smith |
43 |
b |
rolly |
brocker |
23 |
c |
molly |
stein |
78 |
d |
frank |
bach |
56 |
e |
rip |
spencer |
26 |
f |
steven |
de wilde |
14 |
g |
gwen |
mason |
46 |
h |
arthur |
davis |
92 |
By default, the rename
method returns a new DataFrame using the new column labels. So, if you check your original DataFrame's column names again, you see the old column name:
participants.columns Index(['first', 'last', 'ages'], dtype='object')
This is how many DataFrame methods work (preserving the original state). Many of these methods offer an optional inplace
argument, which, if set to True
, changes the original DataFrame:
participants.rename(columns={'ages':'Age'}, inplace=True) participants.columns Index(['first', 'last', 'Age'], dtype='object')
You can use the indexer syntax to create new columns. To do so, you simply access the column as if it already exists by using an indexer and the cited values:
participants['Zip Code'] = [94702, 97402, 94223, 94705, 97503, 94705, 94111, 95333] participants
|
first |
last |
Age |
Zip Code |
a |
shanda |
smith |
43 |
94702 |
b |
rolly |
brocker |
23 |
97402 |
c |
molly |
stein |
78 |
94223 |
d |
frank |
bach |
99 |
94705 |
e |
rip |
spencer |
26 |
97503 |
f |
steven |
de wilde |
14 |
94705 |
g |
gwen |
mason |
46 |
94111 |
h |
arthur |
davis |
92 |
95333 |
You can use operations between columns such as string addition to create values for a new column. If you decide you want to add a column with participants’ full names, you can construct the values from the existing columns for their first and last names:
participants['Full Name'] = ( participants.loc[:, 'first'] + participants.loc[:, 'last'] ) participants
|
first |
last |
Age |
Zip Code |
Full Name |
a |
shanda |
smith |
43 |
94702 |
shandasmith |
b |
rolly |
brocker |
23 |
97402 |
rollybrocker |
c |
molly |
stein |
78 |
94223 |
mollystein |
d |
frank |
bach |
99 |
94705 |
frankbach |
e |
rip |
spencer |
26 |
97503 |
ripspencer |
f |
steven |
de wilde |
14 |
94705 |
stevende wilde |
g |
gwen |
mason |
46 |
94111 |
gwenmason |
h |
arthur |
davis |
92 |
95333 |
arthurdavis |
You can update a column by using the same syntax. For example, if you decide that the values in the full name column should have a white space between the names, you can just assign new values by using the same column name:
participants['Full Name'] = ( participants.loc[:, 'first'] + ' ' + participants.loc[:, 'last'] ) participants
|
first |
last |
Age |
Zip Code |
Full Name |
a |
shanda |
smith |
43 |
94702 |
shanda smith |
b |
rolly |
brocker |
23 |
97402 |
rolly brocker |
c |
molly |
stein |
78 |
94223 |
molly stein |
d |
frank |
bach |
99 |
94705 |
frank bach |
e |
rip |
spencer |
26 |
97503 |
rip spencer |
f |
steven |
de wilde |
14 |
94705 |
steven de wilde |
g |
gwen |
mason |
46 |
94111 |
gwen mason |
h |
arthur |
davis |
92 |
95333 |
arthur davis |
Pandas gives you many ways to change data in a DataFrame. You can set values by using the same indexers you used before. You can do operations on whole DataFrames or on individual columns. And you can apply functions to change elements in a column or create new values from multiple rows or columns.
To change data using an indexer, you select the location where you want the new data to reside in the same way you select to view data, and then you assign a new value. To change arthur
in column h
to Paul
, for example, you can use loc
:
participants.loc['h', 'first'] = 'Paul' participants
|
first |
last |
Age |
Zip Code |
Full Name |
a |
shanda |
smith |
43 |
94702 |
shanda smith |
b |
rolly |
brocker |
23 |
97402 |
rolly brocker |
c |
molly |
stein |
78 |
94223 |
molly stein |
d |
frank |
bach |
99 |
94705 |
frank bach |
e |
rip |
spencer |
26 |
97503 |
rip spencer |
f |
steven |
de wilde |
14 |
94705 |
steven de wilde |
g |
gwen |
mason |
46 |
94111 |
gwen mason |
h |
paul |
davis |
92 |
95333 |
arthur davis |
Alternatively, you can use iloc
to set the age of Molly in row c
to 99
:
participants.iloc[3, 2] = 99 participants
|
first |
last |
Age |
Zip Code |
Full Name |
a |
shanda |
smith |
43 |
94702 |
shanda smith |
b |
rolly |
brocker |
23 |
97402 |
rolly brocker |
c |
molly |
stein |
78 |
94223 |
molly stein |
d |
frank |
bach |
99 |
94705 |
frank bach |
e |
rip |
spencer |
26 |
97503 |
rip spencer |
f |
steven |
de wilde |
14 |
94705 |
steven de wilde |
g |
gwen |
mason |
46 |
94111 |
gwen mason |
h |
paul |
davis |
92 |
95333 |
arthur davis |
This should seem fairly intuitive if you think of it as a variation on the indexed assignment you have used with lists and dictionaries.
Earlier in this chapter, you used operations between columns to construct values for a new column. You can also use in-place operators such as +=
, -=
, and /=
, to change values in a column. To subtract 1 from the age of each participant, for example, you can use the -=
operator:
participants.Age -= 1 participants
|
first |
last |
Age |
Zip Code |
Full Name |
a |
shanda |
smith |
42 |
94702 |
shanda smith |
b |
rolly |
brocker |
22 |
97402 |
rolly brocker |
c |
molly |
stein |
77 |
94223 |
molly stein |
d |
frank |
bach |
98 |
94705 |
frank bach |
e |
rip |
spencer |
25 |
97503 |
rip spencer |
f |
steven |
de wilde |
13 |
94705 |
steven de wilde |
g |
gwen |
mason |
45 |
94111 |
gwen mason |
h |
paul |
davis |
91 |
95333 |
arthur davis |
replace
MethodThe replace
method finds and replaces values across a DataFrame. For example, you can use it to replace the name rolly
with Smiley
:
participants.replace('rolly', 'Smiley')
|
first |
last |
Age |
Zip Code |
Full Name |
a |
shanda |
smith |
42 |
94702 |
shanda smith |
b |
smiley |
brocker |
22 |
97402 |
rolly brocker |
c |
molly |
stein |
77 |
94223 |
molly stein |
d |
frank |
bach |
98 |
94705 |
frank bach |
e |
rip |
spencer |
25 |
97503 |
rip spencer |
f |
steven |
de wilde |
13 |
94705 |
steven de wilde |
...
This method also works with regular expressions. Here is how you construct a regular expression that matches words starting with s
and replaces the s
with S
:
participants.replace(r'(s)([a-z]+)', r'S2', regex=True)
|
first |
last |
Age |
Zip Code |
Full Name |
a |
shanda |
smith |
42 |
94702 |
Shanda Smith |
b |
rolly |
brocker |
22 |
97402 |
rolly brocker |
c |
molly |
stein |
77 |
94223 |
molly Stein |
d |
frank |
bach |
98 |
94705 |
frank bach |
e |
rip |
spencer |
25 |
97503 |
rip Spencer |
f |
steven |
de wilde |
13 |
94705 |
Steven de wilde |
g |
gwen |
mason |
45 |
94111 |
gwen mason |
h |
paul |
davis |
91 |
95333 |
arthur davis |
Both DataFrames and the pandas.Series
object have an apply()
method that can call a function on values. In the case of a pandas.Series
object, the apply()
method calls a function of your choosing on every value in the pandas.Series
object individually.
Say that you define a function that capitalizes any string passed to it:
def cap_word(w): return w.capitalize()
Then, if you pass it as an argument to apply()
on the column first
, it capitalizes each first name:
participants.loc[:, 'first'].apply(cap_word) a Shanda b Rolly c Molly d Frank e Rip f Steven g Gwen h Paul Name: first, dtype: object
In the case of a DataFrame, apply
takes a row as an argument, enabling you to produce new values from the columns of that row. Say that you define a function that uses values from the columns first
and Age
:
def say_hello(row): return f'{row["first"]} is {row["Age"]} years old.'
You can then apply the function to the whole DataFrame:
participants.apply(say_hello, axis=1) a shanda is 42 years old. b rolly is 22 years old. c molly is 77 years old. d frank is 98 years old. e rip is 25 years old. f steven is 13 years old. g gwen is 45 years old. h paul is 91 years old. dtype: object
You can use this method to call a function across rows or across columns. You use the axis
argument to indicate whether your function should expect a row or a column.
If you are working with DataFrames in Colab, you should try running this snippet:
%load_ext google.colab.data_table
This makes the output of your DataFrames interactive and enables you to filter and select interactively.
A Pandas DataFrame is a powerful tool for working with data in a spreadsheet-like environment. You can create DataFrames from many sources, but creating a DataFrame from a file is the most common. You can extend DataFrames with new columns and rows. You can access the data itself by using powerful indexers, which you can also use to set data. DataFrames provide a great way to explore and manipulate data.
Use this table to answer the following questions:
Sample Size (mg) |
%P |
%Q |
0.24 |
40 |
60 |
2.34 |
34 |
66 |
0.0234 |
12 |
88 |
1. Create a DataFrame representing this table.
2. Add a new column labeled Total Q that contains the amount of Q (in mg) for each sample.
3. Divide the columns %P and %Q by 100.
3.147.44.255