9

Pandas

To clarify, *add* data.

Edward R. Tufte

In This Chapter

The Pandas DataFrame, which is built on top of the NumPy array, is probably the most commonly used data structure. DataFrames are like supercharged spreadsheets in code. They are one of the primary tools used in data science. This chapter looks at creating DataFrames, manipulating DataFrames, accessing data in DataFrames, and manipulating that data.

About DataFrames

A Pandas DataFrame, like a spreadsheet, is made up of columns and rows. Each column is a pandas.Series object. A DataFrame is, in some ways, similar to a two-dimensional NumPy array, with labels for the columns and index. Unlike a NumPy array, however, a DataFrame can contain different data types. You can think of a pandas.Series object as a one-dimensional NumPy array with labels. The pandas.Series object, like a NumPy array, can contain only one data type. The pandas.Series object can use many of the same methods you have seen with arrays, such as min(), max(), mean(), and medium().

The usual convention is to import the Pandas package aliased as pd:

import pandas as pd

Creating DataFrames

You can create DataFrames with data from many sources, including dictionaries and lists and, more commonly, by reading files. You can create an empty DataFrame by using the DataFrame constructor:

df = pd.DataFrame()
print(df)
Empty DataFrame
Columns: []
Index: []

As a best practice, though, DataFrames should be initialized with data.

Creating a DataFrame from a Dictionary

You can create DataFrames from a list of dictionaries or from a dictionary, where each key is a column label with the values for that key holding the data for the column. Listing 9.1 shows how to create a DataFrame by creating a list of data for each column and then creating a dictionary with the column names as keys and these lists as the values. The listing shows how to then pass this dictionary to the DataFrame constructor to construct the DataFrame.

Listing 9.1 Creating a DataFrame from a Dictionary

first_names = ['shanda', 'rolly', 'molly', 'frank',
               'rip', 'steven', 'gwen', 'arthur']

last_names = ['smith', 'brocker', 'stein', 'bach',
              'spencer', 'de wilde', 'mason', 'davis']

ages = [43, 23, 78, 56, 26, 14, 46, 92]
data = {'first':first_names,
        'last':last_names,
        'ages':ages}

participants = pd.DataFrame(data)

The resulting DataFrame, participants, looks as follows in Colab or in a Jupyter notebook:

 

first

last

ages

0

shanda

smith

43

1

rolly

brocker

23

2

molly

stein

78

3

frank

bach

56

4

rip

spencer

26

5

steven

de wilde

14

6

gwen

mason

46

7

arthur

davis

92

Note

In this chapter, DataFrame tables that result from a code example will be presented as a table after the code.

You can see the column labels across the top, the data in each row, and the index labels to the left.

Creating a DataFrame from a List of Lists

You can create a list of lists, with each sublist containing the data for one row, in the order of the columns:

data = [["shanda", "smith", 43],
        ["rolly", "brocker", 23],
        ["molly", "stein", 78],
        ["frank", "bach", 56],
        ["rip", "spencer", 26],
        ["steven", "de wilde", 14],
        ["gwen", "mason", 46],
        ["arthur", "davis", 92]]

Then you can use this as the data argument:

participants = pd.DataFrame(data)
participants

You get the same result as when creating a DataFrame from a dictionary:

 

0

1

2

0

shanda

smith

43

1

rolly

brocker

23

2

molly

stein

78

3

frank

bach

56

4

rip

spencer

26

5

steven

de wilde

14

6

gwen

mason

46

7

arthur

davis

92

Notice that the resulting DataFrame has been created with integer column names. This is the default if no column names are supplied. You can supply column names explicitly as a list of strings:

column_names = ['first', 'last', 'ages']

Similarly, you can supply index labels as a list:

index_labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

These labels are then used during initialization, using the parameters columns and index:

participants = pd.DataFrame(data,
                            columns=column_names,
                            index=index_labels)

 

first

last

ages

a

shanda

smith

43

b

rolly

brocker

23

c

molly

stein

78

d

frank

bach

56

e

rip

spencer

26

f

steven

de wilde

14

g

gwen

mason

46

h

arthur

davis

92

Creating a DataFrame from a File

While creating DataFrames from dictionaries and lists is possible, the vast majority of the time you will create DataFrames from existing data sources. Files are the most common of these data sources. Pandas supplies functions for creating DataFrames from files for many common file types, including CSV, Excel, HTML, JSON, and SQL database connections.

Say that you want to open a CSV file from the FiveThirtyEight website, https://data.fivethirtyeight.com, under the data set college_majors. After you unzip and upload the CSV file to Colab, you open it by simply supplying its path to the Pandas read_csv function:

college_majors = pd.read_csv('/content/all-ages.csv')
college_majors

 

Major

Major_category

Total

Unemployment_rate

0

GENERAL AGRICULTURE

Agriculture & Natural Resources

128148

0.026147

1

AGRICULTURE PRODUCTION AND MANAGEMENT

Agriculture & Natural Resources

95326

0.028636

2

AGRICULTURAL ECONOMICS

Agriculture & Natural Resources

33955

0.030248

...

...

...

...

...

170

MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION

Business

102753

0.052679

171

HISTORY

Humanities & Liberal Arts

712509

0.065851

172

UNITED STATES HISTORY

Humanities & Liberal Arts

17746

0.073500

Pandas uses the data in the CSV file to determine column labels and column types.

Interacting with DataFrame Data

Once you have data loaded into a DataFrame, you should take a look at it. Pandas offers numerous ways of accessing data in a DataFrame. You can look at data by rows, columns, individual cells, or some combination of these. You can also extract data based on its value.

Note

When I first load data that I am unfamiliar with, I start by taking a peek at the top few rows and checking summary statistics on the data. Looking at the top rows of a DataFrame gives me a sense of what the new data looks like and allows me to confirm that the data is what I expect.

Heads and Tails

To see the top rows of a DataFrame, you can use the head method, which returns the top five rows:

college_majors.head()

 

Major

Major_category

Total

Unemployment_rate

0

GENERAL AGRICULTURE

Agriculture & Natural Resources

128148

0.026147

1

AGRICULTURE PRODUCTION AND MANAGEMENT

Agriculture & Natural Resources

95326

0.028636

2

AGRICULTURAL ECONOMICS

Agriculture & Natural Resources

33955

0.030248

3

ANIMAL SCIENCES

Agriculture & Natural Resources

103549

0.042679

4

FOOD SCIENCE

Agriculture & Natural Resources

24280

0.049188

The head method takes an optional argument, which specifies the number of rows to return. You would specify the top three rows like this:

college_majors.head(3)

 

Major

Major_category

Total

Unemployment_rate

0

GENERAL AGRICULTURE

Agriculture & Natural Resources

128148

0.026147

1

AGRICULTURE PRODUCTION AND MANAGEMENT

Agriculture & Natural Resources

95326

0.028636

2

AGRICULTURAL ECONOMICS

Agriculture & Natural Resources

33955

0.030248

The tail method works in a similar way to head but returns rows from the bottom. It also takes an optional argument that specifies the number of rows to return:

college_majors.tail()

 

Major

Major_category

Total

Unemployment_rate

168

HOSPITALITY MANAGEMENT

Business

200854

0.051447

169

MANAGEMENT INFORMATION SYSTEMS AND STATISTICS

Business

156673

0.043977

170

MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION

Business

102753

0.052679

171

HISTORY

Humanities & Liberal Arts

712509

0.065851

172

UNITED STATES HISTORY

Humanities & Liberal Arts

17746

0.073500

Descriptive Statistics

Once I’ve taken a look at some rows from a DataFrame, I like to get a sense of the shape of the data. One tool for doing this is the DataFrame describe method, which produces various descriptive statistics about the data. You can call describe with no arguments, as shown here:

college_majors.describe()

 

Total

Unemployment_rate

count

1.730000e+02

173.000000

mean

2.302566e+05

0.057355

std

4.220685e+05

0.019177

min

2.396000e+03

0.000000

25%

2.428000e+04

0.046261

50%

7.579100e+04

0.054719

75%

2.057630e+05

0.069043

max

3.123510e+06

0.156147

This method calculates the count, mean, standard deviation, minimum, maximum, and quantiles for columns containing numeric data. It accepts optional arguments to control which data types are processed and the ranges of the quantiles returned. To change the quantiles, you use the percentiles argument:

college_majors.describe(percentiles=[0.1, 0.9])

 

Total

Unemployment_rate

count

1.730000e+02

173.000000

mean

2.302566e+05

0.057355

std

4.220685e+05

0.019177

min

2.396000e+03

0.000000

10%

9.775600e+03

0.037053

50%

7.579100e+04

0.054719

90%

6.739758e+05

0.080062

max

3.123510e+06

0.156147

This example specifies percentiles for 10% and 90% rather than the default 25% and 75%. Note that 50% is inserted regardless of the argument.

If you want to see statistics calculated from nonnumeric columns, you can specify which data types are processed. You do this by using the include keyword. The value passed to this keyword should be a sequence of data types, which can be NumPy data types, such as np.object. In Pandas, strings are of type object, so the following includes columns with string data types:

import numpy as np
college_majors.describe(include=[np.object])

This would also find the string name of the data type, which, in the case of np.object, would be object. The following returns statistics appropriately for the type:

college_majors.describe(include=['object'])

So, for strings, you get the count, the number of unique values, the top value, and the frequency of this top value:

 

Major

Major_category

count

173

173

unique

173

16

top

GEOSCIENCES

Engineering

freq

1

29

You can pass the string all instead of a list of data types to produce statistics for all the columns:

college_majors.describe(include='all')

 

Major

Major_category

Total

Unemployment_rate

count

173

173

1.730000e+02

173.000000

unique

173

16

NaN

NaN

top

GEOSCIENCES

Engineering

NaN

NaN

freq

1

29

NaN

NaN

mean

NaN

NaN

2.302566e+05

0.057355

std

NaN

NaN

4.220685e+05

0.019177

min

NaN

NaN

2.396000e+03

0.000000

25%

NaN

NaN

2.428000e+04

0.046261

50%

NaN

NaN

7.579100e+04

0.054719

75%

NaN

NaN

2.057630e+05

0.069043

max

NaN

NaN

3.123510e+06

0.156147

Note

Where a statistic is not appropriate for a data type, such as with the standard deviation for a string, the not-a-number value NAN is inserted.

In case you want to exclude certain data types rather than specify which ones to include, Pandas supplies the exclude argument, which takes the same types of arguments as include:

college_majors.describe(exclude=['int'])

Accessing Data

Once you have taken an initial peek at a frame using head or tail and gotten a sense of the shape of the data using describe, you can start looking at the data and individual columns, rows, or cells.

This is the participants DataFrame from earlier in the chapter:

participants

 

first

last

ages

a

shanda

smith

43

b

rolly

brocker

23

c

molly

stein

78

d

frank

bach

56

e

rip

spencer

26

f

steven

de wilde

14

g

gwen

mason

46

h

arthur

davis

92

Bracket Syntax

To access columns or rows in Pandas DataFrames, you need to use bracket syntax. This syntax is great for interactive sessions where you are exploring and playing with data, and using it is a best practice.

To access a single column, you supply the column name as an argument in brackets, much as you would a dictionary key:

participants['first']
a    shanda
b     rolly
c     molly
d     frank
e       rip
f    steven
g      gwen
h    arthur
Name: first, dtype: object

You can see that this returns the data for the column along with its index, label, and data type. If a column name does not contain dashes or special characters, and if the column name is not the same as an existing attribute of the DataFrame, you can access the column as an attribute.

For example, here is how you access the ages column:

participants.ages
a    43
b    23
c    78
d    56
e    26
f    14
g    46
h    92
Name: ages, dtype: int64

This would not work with the columns first or last, as these already exist as attributes of the DataFrame.

To access multiple columns, you specify the column label as a list:

participants[['last', 'first']]

 

last

first

a

smith

shanda

b

brocker

rolly

c

stein

molly

d

bach

frank

e

spencer

rip

f

de wilde

steven

g

mason

gwen

This returns a DataFrame with only the requested columns.

The bracket syntax is overloaded to allow you to grab rows as well as columns. To specify rows, you use a slice as an argument. If the slice uses integers, then those integers represent the row numbers to return. To return rows 3, 4, and 5 of the DataFrame participants, for example, you can use the slice 3:6:

participants[3:6]

 

first

last

ages

d

frank

bach

56

e

rip

spencer

26

f

steven

de wilde

14

You can also slice using index labels. When you use labels to slice, the last value is included. So to get rows a, b, and c, you slice using a:c:

participants['a':'c']

 

first

last

ages

a

shanda

smith

43

b

rolly

brocker

23

c

molly

stein

78

You can indicate which rows to return by using a list of Booleans. The list should have one Boolean per row: True for the desired rows, and False for the others. The following example returns the second, third, and sixth rows:

mask = [False, True, True, False, False, True, False, False]
participants[mask]

 

first

last

ages

b

rolly

brocker

23

c

molly

stein

78

f

steven

de wilde

14

The bracket syntax provides a very convenient and easy-to-read way to access data. It is often used in interactive sessions when experimenting with and exploring DataFrames, but it is not optimized for performance with large data sets. The recommended way to index into DataFrames in production code or for large data sets is to use the DataFrame loc and iloc indexers. These indexers use a bracket syntax very similar to what you have seen here. The loc indexer indexes using labels, and iloc uses index positions.

Optimized Access by Label

With the loc indexer, you can supply a single label, and the values for that row will be returned. To get the values from the row labeled c, for example, you simply supply c as an argument:

participants.loc['c']
first    molly
last     stein
ages        78
Name: c, dtype: object

You can provide a slice of labels, and once again, the last label is included:

participants.loc['c':'f']

 

first

last

ages

c

molly

stein

78

d

frank

bach

56

e

rip

spencer

26

f

steven

de wilde

14

Or you can provide a sequence of Booleans:

mask = [False, True, True, False, False, True, False, False]
participants.loc[mask]

 

first

last

ages

b

rolly

brocker

23

c

molly

stein

78

f

steven

de wilde

14

An optional second argument can indicate which columns to return. If you want to return all the rows for the column first, for example, you specify all rows with a slice, a comma, and the column label:

participants.loc[:, 'first']
a    shanda
b     rolly
c     molly
d     frank
e       rip
f    steven
g      gwen
h    arthur
Name: first, dtype: object

You could provide a list of column labels:

participants.loc[:'c', ['ages', 'last']]

 

ages

last

a

43

smith

b

23

brocker

c

78

stein

Or you could provide a list of Booleans:

participants.loc[:'c', [False, True, True]]

 

last

ages

a

smith

43

b

brocker

23

c

stein

78

Optimized Access by Index

The iloc indexer enables you to use index positions to select rows and columns. Much as you’ve seen before with brackets, you can use a single value to specify a single row:

participants.iloc[3]
first    frank
last      bach
ages        56

Name: d, dtype: object

Or you can specify multiple rows by using a slice:

participants.iloc[1:4]

 

first

last

ages

b

rolly

brocker

23

c

molly

stein

78

d

frank

bach

56

You can, optionally, indicate which column to return by using a second slice:

participants.iloc[1:4, :2]

 

first

last

b

rolly

brocker

c

molly

stein

d

frank

bach

Masking and Filtering

A powerful feature of DataFrames is the ability to select data based on values. You can use comparison operators with columns to see which values meet some condition. For example, if you want to see which rows of the college_majors DataFrame have the value Humanities & Liberal Arts as a major category, you can use the equality operator (==):

college_majors.Major_category == 'Humanities & Liberal Arts'
0      False
1      False
2      False
3      False
       ...
169    False
170    False
171     True
172     True
Name: Major_category, Length: 173, dtype: bool

This produces a pandas.Series object that contains True for every row that matches the condition. A series of Booleans is mildly interesting, but the real power comes when you combine it with an indexer to filter the results. Remember that loc returns rows for every True value of an input sequence. You can make a condition based on a comparison operator and a row, for example, as shown here for the greater-than operator and the row Total:

total_mask = college_majors.loc[:, 'Total'] > 1200000

You can use the result as a mask to select only the rows that meet this condition:

top_majors = college_majors.loc[total_mask]
top_majors

 

Major

Major_category

Total

Unemployment_rate

25

GENERAL EDUCATION

Education

1438867

0.043904

28

ELEMENTARY EDUCATION

Education

1446701

0.038359

114

PSYCHOLOGY

Psychology & Social Work

1484075

0.069667

153

NURSING

Health

1769892

0.026797

158

GENERAL BUSINESS

Business

2148712

0.051378

159

ACCOUNTING

Business

1779219

0.053415

161

BUSINESS MANAGEMENT AND ADMINISTRATION

Business

3123510

0.058865

You can use the min() method to check whether the resulting DataFrame meets the condition:

top_majors.Total.min()
1438867

Now say that you want to see which major categories have the lowest unemployment rates. You can use describe on a single column as well as with a full DataFrame. If you use describe on the column Unemployment_rate, for example, you can see that the top rate for the bottom percentile is 0.046261:

college_majors.Unemployment_rate.describe()
count    173.000000
mean       0.057355
std        0.019177
min        0.000000
25%        0.046261
50%        0.054719
75%        0.069043
max        0.156147
Name: Unemployment_rate, dtype: float64

You can create a mask for all rows with an unemployment rate less than or equal to this:

employ_rate_mask = college_majors.loc[:, 'Unemployment_rate'] <= 0.046261

And you can use this mask to produce a DataFrame with only these rows:

employ_rate_majors = college_majors.loc[employ_rate_mask]

Then you can use the pandas.Series object’s unique method to see which major categories are in the resulting DataFrame:

employ_rate_majors.Major_category.unique()
array(['Agriculture & Natural Resources', 'Education', 'Engineering',
       'Biology & Life Science', 'Computers & Mathematics',
       'Humanities & Liberal Arts', 'Physical Sciences', 'Health',
       ‘Business’], dtype=object)

All these categories have at least one row with an employment rate that meets the condition.

Pandas Boolean Operators

You can use the three Boolean operators AND (&), OR (|), and NOT (~) with the results of your conditions. You can use & or | to combine conditions and create more complex ones. You can use ~ to create a mask that is the opposite of your condition.

For example, you can use AND to create a new mask based on the previous ones to see which major categories of the most popular majors have a low unemployment rate. To do this, you use the & operator between your existing masks to produce a new one:

total_rate_mask = employ_rate_mask & total_mask
total_rate_mask
0      False
1      False
2      False
3      False
4      False
      ...
168    False
169    False
170    False
171    False
172    False
Length: 173, dtype: bool

By looking at the resulting DataFrame, you can see which of the most popular majors have the lowest unemployment rates:

college_majors.loc[total_rate_mask]

 

Major

Major_category

Total

Unemployment_rate

25

GENERAL EDUCATION

Education

1438867

0.043904

28

ELEMENTARY EDUCATION

Education

1446701

0.038359

153

NURSING

Health

1769892

0.026797

You can use the ~ operator with your employment rate mask to create a DataFrame whose rows all have an employment rate higher than the bottom percentile:

lower_rate_mask = ~employ_rate_mask
lower_rate_majors = college_majors.loc[lower_rate_mask]

You can check this work by using the min method on the Unemployment_rate column to see that it is above the top rate for the bottom percentile:

lower_rate_majors.Unemployment_rate.min()
0.046261360999999994

To select all the rows that either fit the top majors condition or the employment rate condition, you can use the | operator:

college_majors.loc[total_mask | employ_rate_mask]

The resulting DataFrame contains all the rows that fit either condition.

Manipulating DataFrames

Once you have the data you need in a DataFrame, you might want to change the DataFrame. You can rename columns or indexes, you can add new columns and rows, and you can delete columns and rows.

Changing the label of a column is simple using the DataFrame rename method. This is how you can use the DataFrame columns attribute to look at the current column names:

participants.columns
Index(['first', 'last', 'ages'], dtype='object')

You can then rename the columns of your choice by providing a dictionary mapping each old column name to the new one. For example, here is how you change the label of the column ages to Age:

participants.rename(columns={'ages': 'Age'})

 

first

last

Age

a

shanda

smith

43

b

rolly

brocker

23

c

molly

stein

78

d

frank

bach

56

e

rip

spencer

26

f

steven

de wilde

14

g

gwen

mason

46

h

arthur

davis

92

By default, the rename method returns a new DataFrame using the new column labels. So, if you check your original DataFrame's column names again, you see the old column name:

participants.columns
Index(['first', 'last', 'ages'], dtype='object')

This is how many DataFrame methods work (preserving the original state). Many of these methods offer an optional inplace argument, which, if set to True, changes the original DataFrame:

participants.rename(columns={'ages':'Age'}, inplace=True)
participants.columns

Index(['first', 'last', 'Age'], dtype='object')

You can use the indexer syntax to create new columns. To do so, you simply access the column as if it already exists by using an indexer and the cited values:

participants['Zip Code'] = [94702, 97402, 94223, 94705,
                            97503, 94705, 94111, 95333]
participants

 

first

last

Age

Zip Code

a

shanda

smith

43

94702

b

rolly

brocker

23

97402

c

molly

stein

78

94223

d

frank

bach

99

94705

e

rip

spencer

26

97503

f

steven

de wilde

14

94705

g

gwen

mason

46

94111

h

arthur

davis

92

95333

You can use operations between columns such as string addition to create values for a new column. If you decide you want to add a column with participants’ full names, you can construct the values from the existing columns for their first and last names:

participants['Full Name'] = ( participants.loc[:, 'first'] +
                              participants.loc[:, 'last'] )

participants

 

first

last

Age

Zip Code

Full Name

a

shanda

smith

43

94702

shandasmith

b

rolly

brocker

23

97402

rollybrocker

c

molly

stein

78

94223

mollystein

d

frank

bach

99

94705

frankbach

e

rip

spencer

26

97503

ripspencer

f

steven

de wilde

14

94705

stevende wilde

g

gwen

mason

46

94111

gwenmason

h

arthur

davis

92

95333

arthurdavis

You can update a column by using the same syntax. For example, if you decide that the values in the full name column should have a white space between the names, you can just assign new values by using the same column name:

participants['Full Name'] = ( participants.loc[:, 'first'] +
                              ' ' +
                              participants.loc[:, 'last'] )
participants

 

first

last

Age

Zip Code

Full Name

a

shanda

smith

43

94702

shanda smith

b

rolly

brocker

23

97402

rolly brocker

c

molly

stein

78

94223

molly stein

d

frank

bach

99

94705

frank bach

e

rip

spencer

26

97503

rip spencer

f

steven

de wilde

14

94705

steven de wilde

g

gwen

mason

46

94111

gwen mason

h

arthur

davis

92

95333

arthur davis

Manipulating Data

Pandas gives you many ways to change data in a DataFrame. You can set values by using the same indexers you used before. You can do operations on whole DataFrames or on individual columns. And you can apply functions to change elements in a column or create new values from multiple rows or columns.

To change data using an indexer, you select the location where you want the new data to reside in the same way you select to view data, and then you assign a new value. To change arthur in column h to Paul, for example, you can use loc:

participants.loc['h', 'first'] = 'Paul'
participants

 

first

last

Age

Zip Code

Full Name

a

shanda

smith

43

94702

shanda smith

b

rolly

brocker

23

97402

rolly brocker

c

molly

stein

78

94223

molly stein

d

frank

bach

99

94705

frank bach

e

rip

spencer

26

97503

rip spencer

f

steven

de wilde

14

94705

steven de wilde

g

gwen

mason

46

94111

gwen mason

h

paul

davis

92

95333

arthur davis

Alternatively, you can use iloc to set the age of Molly in row c to 99:

participants.iloc[3, 2] = 99
participants

 

first

last

Age

Zip Code

Full Name

a

shanda

smith

43

94702

shanda smith

b

rolly

brocker

23

97402

rolly brocker

c

molly

stein

78

94223

molly stein

d

frank

bach

99

94705

frank bach

e

rip

spencer

26

97503

rip spencer

f

steven

de wilde

14

94705

steven de wilde

g

gwen

mason

46

94111

gwen mason

h

paul

davis

92

95333

arthur davis

This should seem fairly intuitive if you think of it as a variation on the indexed assignment you have used with lists and dictionaries.

Earlier in this chapter, you used operations between columns to construct values for a new column. You can also use in-place operators such as +=, -=, and /= , to change values in a column. To subtract 1 from the age of each participant, for example, you can use the -= operator:

participants.Age -= 1
participants

 

first

last

Age

Zip Code

Full Name

a

shanda

smith

42

94702

shanda smith

b

rolly

brocker

22

97402

rolly brocker

c

molly

stein

77

94223

molly stein

d

frank

bach

98

94705

frank bach

e

rip

spencer

25

97503

rip spencer

f

steven

de wilde

13

94705

steven de wilde

g

gwen

mason

45

94111

gwen mason

h

paul

davis

91

95333

arthur davis

The replace Method

The replace method finds and replaces values across a DataFrame. For example, you can use it to replace the name rolly with Smiley:

participants.replace('rolly', 'Smiley')

 

first

last

Age

Zip Code

Full Name

a

shanda

smith

42

94702

shanda smith

b

smiley

brocker

22

97402

rolly brocker

c

molly

stein

77

94223

molly stein

d

frank

bach

98

94705

frank bach

e

rip

spencer

25

97503

rip spencer

f

steven

de wilde

13

94705

steven de wilde

                       ...

This method also works with regular expressions. Here is how you construct a regular expression that matches words starting with s and replaces the s with S:

participants.replace(r'(s)([a-z]+)', r'S2', regex=True)

 

first

last

Age

Zip Code

Full Name

a

shanda

smith

42

94702

Shanda Smith

b

rolly

brocker

22

97402

rolly brocker

c

molly

stein

77

94223

molly Stein

d

frank

bach

98

94705

frank bach

e

rip

spencer

25

97503

rip Spencer

f

steven

de wilde

13

94705

Steven de wilde

g

gwen

mason

45

94111

gwen mason

h

paul

davis

91

95333

arthur davis

Both DataFrames and the pandas.Series object have an apply() method that can call a function on values. In the case of a pandas.Series object, the apply() method calls a function of your choosing on every value in the pandas.Series object individually.

Say that you define a function that capitalizes any string passed to it:

def cap_word(w):
    return w.capitalize()

Then, if you pass it as an argument to apply() on the column first, it capitalizes each first name:

participants.loc[:, 'first'].apply(cap_word)
a    Shanda
b     Rolly
c     Molly
d     Frank
e       Rip
f    Steven
g      Gwen
h      Paul
Name: first, dtype: object

In the case of a DataFrame, apply takes a row as an argument, enabling you to produce new values from the columns of that row. Say that you define a function that uses values from the columns first and Age:

def say_hello(row):
    return f'{row["first"]} is {row["Age"]} years old.'

You can then apply the function to the whole DataFrame:

participants.apply(say_hello, axis=1)
a    shanda is 42 years old.
b     rolly is 22 years old.
c     molly is 77 years old.
d     frank is 98 years old.
e       rip is 25 years old.
f    steven is 13 years old.
g      gwen is 45 years old.
h      paul is 91 years old.
dtype: object

You can use this method to call a function across rows or across columns. You use the axis argument to indicate whether your function should expect a row or a column.

Interactive Display

If you are working with DataFrames in Colab, you should try running this snippet:

%load_ext google.colab.data_table

This makes the output of your DataFrames interactive and enables you to filter and select interactively.

Summary

A Pandas DataFrame is a powerful tool for working with data in a spreadsheet-like environment. You can create DataFrames from many sources, but creating a DataFrame from a file is the most common. You can extend DataFrames with new columns and rows. You can access the data itself by using powerful indexers, which you can also use to set data. DataFrames provide a great way to explore and manipulate data.

Questions

Use this table to answer the following questions:

Sample Size (mg)

%P

%Q

0.24

40

60

2.34

34

66

0.0234

12

88

1.   Create a DataFrame representing this table.

2.   Add a new column labeled Total Q that contains the amount of Q (in mg) for each sample.

3.   Divide the columns %P and %Q by 100.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.44.255