Chapter 1 introduced the Pandas DataFrame
and Series
objects. These data structures resemble the primitive Python data containers (lists and dictionaries) for indexing and labeling, but have additional features that make working with data easier.
The concept map for this chapter can be found in Figure A.2.
Use functions to create and load manual data
Describe the Series
object
Describe the DataFrame
object
Identify basic operations on Series
objects
Identify basic operations on DataFrame
objects
Perform conditional subsetting, fancy slicing, and indexing
Use methods to save data
Whether you are manually inputting data or creating a small test example, knowing how to create DataFrames
without loading data from a file is a useful skill. It is especially helpful when you are asking a question about a StackOverflow error.
The Pandas Series
is a one-dimensional container (i.e., Python Iterable
), similar to the built-in Python list
. It is the data type that represents each column of the DataFrame
. Table 1.1 lists the possible dtypes
for Pandas DataFrame
columns. Each value in a DataFrame
column must be stored as the same dtype
. For example, if a column contains the number 1
and the sequence of letters (i.e., string) "pizza"
, the entire dtype
of the column will be a string (Pandas will call this an object dtype
).
Since a DataFrame
can be thought of as a dictionary of Series
objects, where each key
is the column name and the value
is the Series
, we can conclude that a Series
is very similar to a Python list
, except that each element must be the same dtype
. Those who have used the numpy
library will realize this is the same behavior as demonstrated by the ndarray
.
The easiest way to create a Series
is to pass in a Python list
. If we pass in a list of mixed types, the most common representation of both will be used. Typically the dtype
will be object
.
import pandas as pd
s = pd.Series(['banana', 42])
print(s)
0 banana
1 42
dtype: object
Notice that the “row number” is shown on the left of the Series
. This is actually the index
for the series. It is similar to the row name and row index we saw in Section 1.3.2 for DataFrames
. It implies that we can actually assign a “name” to values in our series.
# manually assign index values to a series
# by passing a Python list
s = pd.Series(
data=["Wes McKinney", "Creator of Pandas"],
index=["Person", "Who"],
)
print(s)
Person Wes McKinney
Who Creator of Pandas
dtype: object
As mentioned in Chapter 1, a DataFrame
can be thought of as a dictionary of Series
objects. This is why dictionaries are the most common way of creating a DataFrame
. The key
represents the column name, and the values
are the contents of the column.
scientists = pd.DataFrame(
{
"Name": ["Rosaline Franklin", "William Gosset"],
"Occupation": ["Chemist", "Statistician"],
"Born": ["1920-07-25", "1876-06-13"],
"Died": ["1958-04-16", "1937-10-16"],
"Age": [37, 61],
}
)
print(scientists)
Name Occupation Born Died Age
0 Rosaline Franklin Chemist 1920-07-25 1958-04-16 37
1 William Gosset Statistician 1876-06-13 1937-10-16 61
If we look at the documentation for DataFrame
1, we see that we can use the columns
parameter or specify the column order. If we want to use the name
column for the row index, we can use the index
parameter.
1. DataFrame
documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
scientists = pd.DataFrame(
data={
"Occupation": ["Chemist", "Statistician"],
"Born": ["1920-07-25", "1876-06-13"],
"Died": ["1958-04-16", "1937-10-16"],
"Age": [37, 61],
},
index=["Rosaline Franklin", "William Gosset"],
columns=["Occupation", "Born", "Died", "Age"],
)
print(scientists)
Occupation Born Died Age
Rosaline Franklin Chemist 1920-07-25 1958-04-16 37
William Gosset Statistician 1876-06-13 1937-10-16 61
In Section 1.3.2.1, we saw how the slicing method affects the type
of the result. If we use .loc[]
to subset the first row of our scientists DataFrame
, we will get a Series
object back.
First, let’s re-create our example DataFrame
.
# create our example dataframe
# with a row index label
scientists = pd.DataFrame(
data={
"Occupation": ["Chemist", "Statistician"],
"Born": ["1920-07-25", "1876-06-13"],
"Died": ["1958-04-16", "1937-10-16"],
"Age": [37, 61],
},
index=["Rosaline Franklin", "William Gosset"],
columns=["Occupation", "Born", "Died", "Age"],
)
print(scientists)
Occupation Born Died Age
Rosaline Franklin Chemist 1920-07-25 1958-04-16 37
William Gosset Statistician 1876-06-13 1937-10-16 61
Select a scientist by the row index label.
# select by row index label
first_row = scientists.loc['William Gosset']
print(type(first_row))
<class 'pandas.core.series.Series'>
print(first_row)
Occupation Statistician
Born 1876-06-13
Died 1937-10-16
Age 61
Name: William Gosset, dtype: object
When a series is printed (i.e., the string representation), the index is printed as the first “column,” and the values are printed as the second “column.” There are many attributes and methods associated with a Series
object.2 Two examples of attributes are .index
and .values
.
2. Series
documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
print(first_row.index)
Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')
print(first_row.values)
['Statistician' '1876-06-13' '1937-10-16' 61]
Table 2.1 Some of the Attributes Within a Series
Series attributes | Description |
---|---|
.loc | Subset using index value |
.iloc | Subset using index position |
.dtype or dtypes | The type of the |
.T | Transpose of the series |
.shape | Dimensions of the data |
.size | Number of elements in the |
.values |
|
An example of a Series
method is .keys()
, which is an alias for the .index
attribute.
print(first_row.keys())
Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')
By now, you might have questions about the syntax for .index
, .values
, and .keys()
. More information about attributes and methods is found in Appendix P on classes. Attributes can be thought of as features of an object (in this example, our object is a Series
). Methods can be thought of as some calculation or operation that is performed on an object. The subsetting syntax for .loc[]
and .iloc[]
(from Section 1.3.2) consists of all attributes. This is why the syntax does not rely on a set of round parentheses, ( )
, but rather a set of square brackets, [ ]
, for subsetting. Since .keys()
is a method, if we wanted to get the first key (which is also the first index), we would use the square brackets after the method call. Some attributes for the series are listed in Table 2.1.
# get the first index using an attribute
print(first_row.index[0])
Occupation
# get the first index using a method
print(first_row.keys()[0])
Occupation
The Pandas data structure known as Series
is very similar to the numpy.ndarray
(Appendix O). In turn, many methods and functions that operate on a ndarray
will also operate on a Series
. A Series
may sometimes be referred to as a “vector.”
Let’s first get a series of the Age
column from our scientists
dataframe.
# get the 'Age' column
ages = scientists['Age']
print(ages)
Rosaline Franklin 37
William Gosset 61
Name: Age, dtype: int64
NumPy is a scientific computing library that typically deals with numeric vectors. Since a Series
can be thought of as an extension to the numpy.ndarray
, there is an overlap of attributes and methods. When we have a vector of numbers, there are common calculations we can perform.3
3. Descriptive statistics: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics
# calculate the mean
print(ages.mean())
49.0
# calculate the minimum
print(ages.min())
37
# calculate the maximum
print(ages.max())
61
# calculate the standard deviation
print(ages.std())
16.97056274847714
The .mean()
, .min()
, .max()
, and .std()
are also methods in the numpy.ndarray
.4 Some Series
methods are listed in Table 2.2.
4. NumPy ndarrary
documentation: https://numpy.org/doc/stable/reference/arrays.ndarray.html
Chapter 1 showed how we can use specific indices to subset our data. Only rarely, however, will we know the exact row or column index to subset the data. Typically you are looking for values that meet (or don’t meet) a particular calculation or observation.
To explore this process, let’s use a larger data set.
scientists = pd.read_csv('data/scientists.csv')
We just saw how we can calculate basic descriptive metrics of vectors. The .describe()
method will calculate multiple descriptive statistics in a single method call.
ages = scientists['Age']
print(ages)
Table 2.2 Some of the Methods That Can Be Performed on a Series
Series Methods | Description |
---|---|
| Concatenates two or more |
| Calculate a correlation with another |
| Calculate a covariance with another |
| Calculate summary statistics7 |
| Returns a |
| Determines whether a |
| Get values of the |
| Draw a histogram |
| Checks whether values are contained in a |
| Returns the minimum value |
| Returns the maximum value |
| Returns the arithmetic mean |
| Returns the median |
| Returns the mode(s) |
| Returns the value at a given quantile |
| Replaces values in the |
| Returns a random sample of values from the |
| Sorts values |
| Converts a |
| Returns the transpose |
| Returns a |
5. Missing values will be automatically dropped.
6. Missing values will be automatically dropped.
7. Missing values will be automatically dropped.
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
# get basic stats
print(ages.describe())
count 8.000000
mean 59.125000
std 18.325918
min 37.000000
25% 44.000000
50% 58.500000
75% 68.750000
max 90.000000
Name: Age, dtype: float64
# mean of all ages
print(ages.mean())
59.125
What if we wanted to subset our ages by identifying those above the mean?
print(ages[ages > ages.mean()])
1 61
2 90
3 66
7 77
Name: Age, dtype: int64
Let’s tease out this statement and look at what ages > ages.mean()
returns.
print(ages > ages.mean())
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 True
Name: Age, dtype: bool
print(type(ages > ages.mean()))
<class 'pandas.core.series.Series'>
This statement returns a Series
with a .dtype
of bool
. In other words, we can not only subset values using labels and indices, but also supply a vector of boolean values. Python has many functions and methods. Depending on how they are implemented, they may return labels, indices, or booleans. Keep this point in mind as you learn new methods and seek to piece together various parts for your work.
If we liked, we could manually supply a vector of bool
s to subset our data.
# get index 0, 1, 4, 5, and 7
manual_bool_values = [
True, # 0
True, # 1
False, # 2
False, # 3
True, # 4
True, # 5
False, # 6
True, # 7
]
print(ages[manual_bool_values])
0 37
1 61
4 56
5 45
7 77
Name: Age, dtype: int64
If you’re familiar with programming, you would find it strange that ages > ages.mean()
returns a vector without any for
loops (Appendix J). Many of the methods that work on Series
(and also DataFrames
) are “vectorized,” meaning that they work on the entire vector simultaneously. This approach makes the code easier to read, and typically, optimizations are available to make calculations faster.
If you perform an operation between two vectors of the same length, the resulting vector will be an element-by-element calculation of the vectors.
print(ages + ages)
0 74
1 122
2 180
3 132
4 112
5 90
6 82
7 154
Name: Age, dtype: int64
print(ages * ages)
0 1369
1 3721
2 8100
3 4356
4 3136
5 2025
6 1681
7 5929
Name: Age, dtype: int64
When you perform an operation on a vector using a scalar, the scalar will be recycled across all the elements in the vector.
print(ages + 100)
0 137
1 161
2 190
3 166
4 156
5 145
6 141
7 177
Name: Age, dtype: int64
print(ages * 2)
0 74
1 122
2 180
3 132
4 112
5 90
6 82
7 154
Name: Age, dtype: int64
When you are working with vectors of different lengths, the behavior will depend on the type()
of the vectors. With a Series
, the vectors will perform an operation matched by the index. The rest of the resulting vector will be filled with a “missing” value, denoted with NaN
, signifying “not a number” (Chapter 9).
This type of behavior, which is called broadcasting, differs between languages. Broadcasting in Pandas refers to how operations are calculated between arrays with different shapes.
print(ages + pd.Series([1, 100]))
0 38.0
1 161.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
dtype: float64
With other types()
, the shapes must match.
import numpy as np
# this will cause an error
print(ages + np.array([1, 100]))
ValueError: operands could not be broadcast together with shapes (8,) (2,)
What’s convenient in Pandas is how data alignment is almost always automatic. If possible, things will always align themselves with the index label when actions are performed.
# ages as they appear in the data
print(ages)
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
rev_ages = ages.sort_index(ascending=False)
print(rev_ages)
7 77
6 41
5 45
4 56
3 66
2 90
1 61
0 37
Name: Age, dtype: int64
If we perform an operation using ages
and rev_ages
, it will still be conducted on an element-by-element basis, but the vectors will be aligned first before the operation is carried out.
# reference output to show index label alignment
print(ages * 2)
0 74
1 122
2 180
3 132
4 112
5 90
6 82
7 154
Name: Age, dtype: int64
# note how we get the same values
# even though the vector is reversed
print(ages + rev_ages)
0 74
1 122
2 180
3 132
4 112
5 90
6 82
7 154
Name: Age, dtype: int64
The DataFrame
is the most common Pandas object. It can be thought of as Python’s way of storing spreadsheet-like data. Many of the features of the Series
data structure carry over into the DataFrame
.
There are 3 main parts to a Pandas DataFrame
object the .index
, .columns
, and .values
. These refer to the row name, column names, and data values, respectively.
scientists.index
RangeIndex(start=0, stop=8, step=1)
scientists.columns
Index(['Name', 'Born', 'Died', 'Age', 'Occupation'], dtype='object')
scientists.values
array([['Rosaline Franklin', '1920-07-25', '1958-04-16', 37, 'Chemist'],
['William Gosset', '1876-06-13', '1937-10-16', 61, 'Statistician'],
['Florence Nightingale', '1820-05-12', '1910-08-13', 90, 'Nurse'],
['Marie Curie', '1867-11-07', '1934-07-04', 66, 'Chemist'],
['Rachel Carson', '1907-05-27', '1964-04-14', 56, 'Biologist'],
['John Snow', '1813-03-15', '1858-06-16', 45, 'Physician'],
['Alan Turing', '1912-06-23', '1954-06-07', 41,
'Computer Scientist'],
['Johann Gauss', '1777-04-30', '1855-02-23', 77, 'Mathematician']],
dtype=object)
The .values
comes in handy when you don’t want all the row index label information, and really just want the base numpy
representation of the data.
Just as we were able to subset a Series
with a boolean vector, so can we subset a DataFrame
with a bool
.
# boolean vectors will subset rows
print(scientists.loc[scientists['Age'] > scientists['Age'].mean()])
Name Born Died Age Occupation
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician
Table 2.3 summarizes the various subsetting methods.
Table 2.3 Table of DataFrame Subsetting Methods
Syntax | Selection Result |
---|---|
|
|
|
|
| Row by row index label (row name) |
| Multiple rows by index label |
| Row by row number |
| Multiple rows by row number |
| Row based on |
| Multiple rows based on |
| Rows based on slicing notation |
Pandas supports broadcasting because the Series
and DataFrame
objects are built on top of the numpy
library.8 Broadcasting describes what happens when performing operations between array-like objects. These behaviors depend on the type of object, its length, and any labels associated with the object.
8. NumPy
Library: http://www.numpy.org/
First, let’s create a subset of our dataframes.
first_half = scientists[:4]
second_half = scientists[4:]
print(first_half)
Name Born Died Age Occupation
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
print(second_half)
Name Born Died Age Occupation
4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist
5 John Snow 1813-03-15 1858-06-16 45 Physician
6 Alan Turing 1912-06-23 1954-06-07 41 Computer Scientist
7 Johann Gauss 1777-04-30 1855-02-23 77 Mathematician
When we perform an action on a dataframe with a scalar, it will try to apply the operation on each cell of the dataframe. In this example, numbers will be multiplied by 2, and strings will be doubled (this is Python’s normal behavior with strings).
# multiply by a scalar
print(scientists * 2)
Name Born
0 Rosaline FranklinRosaline Franklin 1920-07-251920-07-25
1 William GossetWilliam Gosset 1876-06-131876-06-13
2 Florence NightingaleFlorence Nightingale 1820-05-121820-05-12
3 Marie CurieMarie Curie 1867-11-071867-11-07
4 Rachel CarsonRachel Carson 1907-05-271907-05-27
5 John SnowJohn Snow 1813-03-151813-03-15
6 Alan TuringAlan Turing 1912-06-231912-06-23
7 Johann GaussJohann Gauss 1777-04-301777-04-30
Died Age Occupation
0 1958-04-161958-04-16 74 ChemistChemist
1 1937-10-161937-10-16 122 StatisticianStatistician
2 1910-08-131910-08-13 180 NurseNurse
3 1934-07-041934-07-04 132 ChemistChemist
4 1964-04-141964-04-14 112 BiologistBiologist
5 1858-06-161858-06-16 90 PhysicianPhysician
6 1954-06-071954-06-07 82 Computer ScientistComputer Scientist
7 1855-02-231855-02-23 154 MathematicianMathematician
If your dataframes are all numeric values and you want to “add” the values on a cell-by-cell basis, you can use the .add()
method. The automatic alignment can be better seen in Chapter 6, when we concatenate dataframes together.
Now that we know various ways of subsetting and slicing our data (see Table 2.3), we should be able to alter our data objects.
The type
of the Born
and Died
columns is object
, meaning they are strings or a sequence of characters.
print(scientists.dtypes)
Name object
Born object
Died object
Age int64
Occupation object
dtype: object
We can convert the strings to a proper datetime
type so we can perform common date and time operations (e.g., take differences between dates or calculate a person’s age). You can provide your own format
if you have a date that has a specific format. A list of format
variables can be found in the Python datetime
module documentation.9 More examples with datetimes can be found in Chapter 12. The format of our date looks like “YYYY-MM-DD,” so we can use the %Y-%m-%d
format.
9. datetime
module documentation: https://docs.python.org/3.10/library/datetime.html#strftime-and-strptime-behavior
# format the 'Born' column as a datetime
born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
print(born_datetime)
0 1920-07-25
1 1876-06-13
2 1820-05-12
3 1867-11-07
4 1907-05-27
5 1813-03-15
6 1912-06-23
7 1777-04-30
Name: Born, dtype: datetime64[ns]
# format the 'Died' column as a datetime
died_datetime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
If we wanted, we could create a new set of columns that contain the datetime
representations of the object
(string) dates. The below example uses python’s multiple assignment syntax (Appendix N).
scientists['born_dt'], scientists['died_dt'] = (
born_datetime,
died_datetime
)
print(scientists.head())
Name Born Died Age Occupation
0 Rosaline Franklin 1920-07-25 1958-04-16 37 Chemist
1 William Gosset 1876-06-13 1937-10-16 61 Statistician
2 Florence Nightingale 1820-05-12 1910-08-13 90 Nurse
3 Marie Curie 1867-11-07 1934-07-04 66 Chemist
4 Rachel Carson 1907-05-27 1964-04-14 56 Biologist
born_dt died_dt
0 1920-07-25 1958-04-16
1 1876-06-13 1937-10-16
2 1820-05-12 1910-08-13
3 1867-11-07 1934-07-04
4 1907-05-27 1964-04-14
print(scientists.shape)
(8, 7)
print(scientists.dtypes)
Name object
Born object
Died object
Age int64
Occupation object
born_dt datetime64[ns]
died_dt datetime64[ns]
dtype: object
We can also assign a new value directly to the existing column. The example in this section shows how to randomize the contents of a column. More complex calculations that involve multiple columns can be seen in Chapter 5, in the discussion of the .apply()
method.
First, let’s look at the original Age
values.
print(scientists['Age'])
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
Now let’s shuffle the values.
# the frac=1 tells pandas to randomly select 100% of the values
# the random_state makes the randomization the same each time
scientists["Age"] = scientists["Age"].sample(frac=1, random_state=42)
For long bits of code we can wrap the code around round parentheses ( )
to break up the code into multiple lines. We will be using this convention for longer bits of code in this book (Appendix D.1).
# the previous line of code is equivalent to
scientists['Age'] = (
scientists['Age']
.sample(frac=1, random_state=42)
)
print(scientists['Age'])
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
If you notice, that we tried to randomly shuffle the column, but when we assigned the values back into the dataframe, it reverted back to the original order. That’s because Pandas will try to automatically join on the .index
values on many operations, for this example to get around this problem we need to remove that .index
information. One way of doing that, is to assign just the .values
of the shuffled values that does not have any .index
value associated with it.
scientists['Age'] = (
scientists['Age']
.sample(frac=1, random_state=42)
.values # remove the index so it doesn't auto align the values
)
print(scientists['Age'])
0 61
1 45
2 37
3 77
4 90
5 56
6 66
7 41
Name: Age, dtype: int64
We can recalculate the “real” age using datetime
arithmetic. More information about datetime
can be found in Chapter 12.
# subtracting dates will give us number of days
scientists['age_days'] = (
scientists['died_dt'] - scientists['born_dt']
)
print(scientists)
Name Born Died Age
0 Rosaline Franklin 1920-07-25 1958-04-16 61
1 William Gosset 1876-06-13 1937-10-16 45
2 Florence Nightingale 1820-05-12 1910-08-13 37
3 Marie Curie 1867-11-07 1934-07-04 77
4 Rachel Carson 1907-05-27 1964-04-14 90
5 John Snow 1813-03-15 1858-06-16 56
6 Alan Turing 1912-06-23 1954-06-07 66
7 Johann Gauss 1777-04-30 1855-02-23 41
Occupation born_dt died_dt age_days
0 Chemist 1920-07-25 1958-04-16 13779 days
1 Statistician 1876-06-13 1937-10-16 22404 days
2 Nurse 1820-05-12 1910-08-13 32964 days
3 Chemist 1867-11-07 1934-07-04 24345 days
4 Biologist 1907-05-27 1964-04-14 20777 days
5 Physician 1813-03-15 1858-06-16 16529 days
6 Computer Scientist 1912-06-23 1954-06-07 15324 days
7 Mathematician 1777-04-30 1855-02-23 28422 days
# we can convert the value to just the year
# using the astype method
scientists['age_years'] = (
scientists['age_days']
.astype('timedelta64[Y]')
)
print(scientists)
Name Born Died Age
0 Rosaline Franklin 1920-07-25 1958-04-16 61
1 William Gosset 1876-06-13 1937-10-16 45
2 Florence Nightingale 1820-05-12 1910-08-13 37
3 Marie Curie 1867-11-07 1934-07-04 77
4 Rachel Carson 1907-05-27 1964-04-14 90
5 John Snow 1813-03-15 1858-06-16 56
6 Alan Turing 1912-06-23 1954-06-07 66
7 Johann Gauss 1777-04-30 1855-02-23 41
Occupation born_dt died_dt age_days age_years
0 Chemist 1920-07-25 1958-04-16 13779 days 37.0
1 Statistician 1876-06-13 1937-10-16 22404 days 61.0
2 Nurse 1820-05-12 1910-08-13 32964 days 90.0
3 Chemist 1867-11-07 1934-07-04 24345 days 66.0
4 Biologist 1907-05-27 1964-04-14 20777 days 56.0
5 Physician 1813-03-15 1858-06-16 16529 days 45.0
6 Computer Scientist 1912-06-23 1954-06-07 15324 days 41.0
7 Mathematician 1777-04-30 1855-02-23 28422 days 77.0
.assign()
Another way you can assign and modify columns is with the .assign()
method. This has the benefit of using method chaining (Appendix R). Let’s redo the age_years
column creation, but this time using '.assign()
.
scientists = scientists.assign(
# new columns on the left of the equal sign
# how to calculate values on the right of the equal sign
# separate new columns with a comma
age_days_assign=scientists['died_dt'] - scientists['born_dt'],
age_year_assign=scientists['age_days'].astype('timedelta64[Y]')
)
print(scientists)
Name Born Died Age
0 Rosaline Franklin 1920-07-25 1958-04-16 61
1 William Gosset 1876-06-13 1937-10-16 45
2 Florence Nightingale 1820-05-12 1910-08-13 37
3 Marie Curie 1867-11-07 1934-07-04 77
4 Rachel Carson 1907-05-27 1964-04-14 90
5 John Snow 1813-03-15 1858-06-16 56
6 Alan Turing 1912-06-23 1954-06-07 66
7 Johann Gauss 1777-04-30 1855-02-23 41
Occupation born_dt died_dt age_days age_years
0 Chemist 1920-07-25 1958-04-16 13779 days 37.0
1 Statistician 1876-06-13 1937-10-16 22404 days 61.0
2 Nurse 1820-05-12 1910-08-13 32964 days 90.0
3 Chemist 1867-11-07 1934-07-04 24345 days 66.0
4 Biologist 1907-05-27 1964-04-14 20777 days 56.0
5 Physician 1813-03-15 1858-06-16 16529 days 45.0
6 Computer Scientist 1912-06-23 1954-06-07 15324 days 41.0
7 Mathematician 1777-04-30 1855-02-23 28422 days 77.0
age_days_assign age_year_assign
0 13779 days 37.0
1 22404 days 61.0
2 32964 days 90.0
3 24345 days 66.0
4 20777 days 56.0
5 16529 days 45.0
6 15324 days 41.0
7 28422 days 77.0
You can look into the .assign()
documentation for more examples.10 Since this is only showing a simple example of how to use the method to assign new values. Effectively using .assign()
will require you to know about lambda
functions, which we will cover in Chapter 5.
10. .assign()
documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
To drop a column, we can either select all the columns we want to by using the column subsetting techniques (Section 1.3.1), or select columns to drop with the .drop()
method on our dataframe.11
11. DataFrame .drop()
method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
# all the current columns in our data
print(scientists.columns)
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt',
'died_dt', 'age_days', 'age_years', 'age_days_assign',
'age_year_assign'], dtype='object')
# drop the shuffled age column
# you provide the axis=1 argument to drop column-wise
scientists_dropped = scientists.drop(['Age'], axis="columns")
# columns after dropping our column
print(scientists_dropped.columns)
Index(['Name', 'Born', 'Died', 'Occupation', 'born_dt', 'died_dt',
'age_days', 'age_years', 'age_days_assign',
'age_year_assign'],
dtype='object')
In our examples so far, we have been importing data. It is also common practice to export or save data sets while processing them. Data sets are either saved out as final cleaned versions of data or in intermediate steps. Both of these outputs can be used for analysis or as input to another part of the data processing pipeline.
Python has a way to pickle
data. This is Python’s way of serializing and saving data in a binary format. Reading pickle data is also backwards compatible. pickle
files are usually saved with an extension of .p
, .pkl
, or .pickle
. We will see how to save and load pickle
data below.
Many of the export methods for a Series
are also available for a DataFrame
. Those readers who have experience with numpy
will know that a .save()
method is available for ndarrays
. This method has been deprecated, and the replacement is to use the .to_pickle
method.
names = scientists['Name']
print(names)
0 Rosaline Franklin
1 William Gosset
2 Florence Nightingale
3 Marie Curie
4 Rachel Carson
5 John Snow
6 Alan Turing
7 Johann Gauss
Name: Name, dtype: object
# pass in a string to the path you want to save
names.to_pickle('output/scientists_names_series.pickle')
The pickle
output is in a binary format. If you try to open it in a text editor, you will see a bunch of garbled characters.
If the object you are saving is an intermediate step in a set of calculations that you want to save, or if you know that your data will stay in the Python world, saving objects to a pickle
will be optimized for Python and disk storage space. However, this approach means that people who do not use Python will not be able to read the data.
The same method can be used on DataFrame
objects.
scientists.to_pickle('output/scientists_df.pickle')
pickle
dataTo read pickle
data, we can use the pd.read_pickle()
function.
# for a Series
series_pickle = pd.read_pickle(
"output/scientists_names_series.pickle"
)
print(series_pickle)
0 Rosaline Franklin
1 William Gosset
2 Florence Nightingale
3 Marie Curie
4 Rachel Carson
5 John Snow
6 Alan Turing
7 Johann Gauss
Name: Name, dtype: object
# for a DataFrame
dataframe_pickle = pd.read_pickle('output/scientists_df.pickle')
print(dataframe_pickle)
Name Born Died Age
0 Rosaline Franklin 1920-07-25 1958-04-16 61
1 William Gosset 1876-06-13 1937-10-16 45
2 Florence Nightingale 1820-05-12 1910-08-13 37
3 Marie Curie 1867-11-07 1934-07-04 77
4 Rachel Carson 1907-05-27 1964-04-14 90
5 John Snow 1813-03-15 1858-06-16 56
6 Alan Turing 1912-06-23 1954-06-07 66
7 Johann Gauss 1777-04-30 1855-02-23 41
Occupation born_dt died_dt age_days age_years
0 Chemist 1920-07-25 1958-04-16 13779 days 37.0
1 Statistician 1876-06-13 1937-10-16 22404 days 61.0
2 Nurse 1820-05-12 1910-08-13 32964 days 90.0
3 Chemist 1867-11-07 1934-07-04 24345 days 66.0
4 Biologist 1907-05-27 1964-04-14 20777 days 56.0
5 Physician 1813-03-15 1858-06-16 16529 days 45.0
6 Computer Scientist 1912-06-23 1954-06-07 15324 days 41.0
7 Mathematician 1777-04-30 1855-02-23 28422 days 77.0
age_days_assign age_year_assign
0 13779 days 37.0
1 22404 days 61.0
2 32964 days 90.0
3 24345 days 66.0
4 20777 days 56.0
5 16529 days 45.0
6 15324 days 41.0
7 28422 days 77.0
Again, the pickle
files are saved with an extension of .p
, .pkl
, or .pickle
.
Comma-separated values (CSV) are the most flexible data storage type. For each row, the column information is separated with a comma. The comma is not the only type of delimiter, however. Some files are delimited by a tab (TSV) or even a semicolon. The main reason why CSVs are a preferred data format when collaborating and sharing data is because any program can open this kind of data structure. It can even be opened in a text editor. However, the universal storage format does come at a price. CSV files are usually slower and take up more disk space when compared to other binary formats.
The Series
and DataFrame
have a .to_csv()
method to write a CSV file. The documentation for Series
12 and DataFrame
13 identifies many different ways you can modify the resulting CSV file. For example, if you wanted to save a TSV file because there are commas in your data, you can change the sep
parameter (Appendix O).
12. Saving a Series
to CSV: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_csv.html
13. Saving a DataFrame
to CSV: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
By default, the .index
of a DataFrame gets written to the CSV file. This creates a file where the first column does not have a name, and only holds the row numbers of the dataframe being saved. This extraneous column in the CSV becomes problematic when you try to read the CSV back into Pandas. So we typically put in the index=False
parameter when saving CSV files to avoid this problem.
# do not write the row names in the CSV output
scientists.to_csv('output/scientists_df_no_index.csv', index=False)
Importing CSV files was illustrated in Section 1.2. This operation uses the pd.read_csv()
function. In the documentation, you can see there are various ways to read in a CSV
.14 Look at Appendix O if you need more information on using function parameters.
14. pd.read_csv()
documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Excel, which is probably the most commonly used data type (or the second most commonly used, after CSVs), has a bad reputation within the data science community, mainly because colors and other superfluous information can easily find its way into the data set, not to mention one-off calculations that ruin the rectangular structure of a data set. Some other reasons are listed at the very beginning of this chapter. The goal of this book isn’t to bash Excel, but to teach you about a reasonable alternative tool for data analytics. In short, the more of your work you can do in a scripting language, the easier it will be to scale up to larger projects, catch and fix mistakes, and collaborate. However, Excel’s popularity and market share are unrivaled. Excel has its own scripting language if you absolutely have to work in it. This will allow you to work with data in a more predictable and reproducible manner.
The Series
data structure does not have an explicit .to_excel()
method. If you have a Series
that needs to be exported to an Excel file, one option is to convert the Series
into a one-column DataFrame
.
Before saving and reading Excel files, make sure you have the openpyxl
library installed (using pip install openpyxl
See Appendix B).
print(names)
0 Rosaline Franklin
1 William Gosset
2 Florence Nightingale
3 Marie Curie
4 Rachel Carson
5 John Snow
6 Alan Turing
7 Johann Gauss
Name: Name, dtype: object
# convert the Series into a DataFrame
# before saving it to an Excel file
names_df = names.to_frame()
# save to an excel file
names_df.to_excel(
'output/scientists_names_series_df.xls', engine='openpyxl'
)
From the preceding example, you can see how to export a DataFrame
to an Excel file. The documentation shows several ways to further fine-tune the output.15 For example, you can output data to a specific “sheet” using the sheet_name
parameter.
15. .to_excel()
documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
# saving a DataFrame into Excel format
scientists.to_excel(
"output/scientists_df.xlsx",
sheet_name="scientists",
index=False
)
The format called “feather” is used to save DataFrame
s into a binary object that can also be loaded into other languages (e.g., R). The main benefit of this approach is that it is faster than writing and reading a CSV file between the languages. See the Pandas .to_feather()
16 and feather file format documentation17 for more information on storing for backwards compatibility.
16. Pandas to_feather()
documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_feather.html
17. Feather file format documentation: https://arrow.apache.org/docs/python/feather.html
The feather
formatter is installed via conda install -c conda-forge pyarrow
or pip install pyarrow
. More on installing packages are described in Appendix B.
You can use the .to_feather()
method on a dataframe to save the feather
objects.
# save to feather file
scientists.to_feather('output/scientists.feather')
# read feather file
sci_feather = pd.read_feather('output/scientists.feather')
print(sci_feather)
Name Born Died Age
0 Rosaline Franklin 1920-07-25 1958-04-16 61
1 William Gosset 1876-06-13 1937-10-16 45
2 Florence Nightingale 1820-05-12 1910-08-13 37
3 Marie Curie 1867-11-07 1934-07-04 77
4 Rachel Carson 1907-05-27 1964-04-14 90
5 John Snow 1813-03-15 1858-06-16 56
6 Alan Turing 1912-06-23 1954-06-07 66
7 Johann Gauss 1777-04-30 1855-02-23 41
Occupation born_dt died_dt age_days age_years
0 Chemist 1920-07-25 1958-04-16 13779 days 37.0
1 Statistician 1876-06-13 1937-10-16 22404 days 61.0
2 Nurse 1820-05-12 1910-08-13 32964 days 90.0
3 Chemist 1867-11-07 1934-07-04 24345 days 66.0
4 Biologist 1907-05-27 1964-04-14 20777 days 56.0
5 Physician 1813-03-15 1858-06-16 16529 days 45.0
6 Computer Scientist 1912-06-23 1954-06-07 15324 days 41.0
7 Mathematician 1777-04-30 1855-02-23 28422 days 77.0
age_days_assign age_year_assign
0 13779 days 37.0
1 22404 days 61.0
2 32964 days 90.0
3 24345 days 66.0
4 20777 days 56.0
5 16529 days 45.0
6 15324 days 41.0
7 28422 days 77.0
Feather files are part of the Apache Arrow project.18 One of the main goals of Arrow is to have a memory storage format for dataframe objects that work across multiple programming languages without having to convert types for each of them.
18. Apache Arrow: https://arrow.apache.org/docs/index.html
]
Arrow has its own Pandas integration19 to convert Pandas DataFrame
objects to Arrow objects (from_pandas()
20) and from Arrow objects to Pandas DataFrame
objects (to_pandas()
21). Once the data is in an Arrow format, it can much more efficiently be used in other programming languages.
19. Arrow Pandas integration: https://arrow.apache.org/docs/python/pandas.html
20. Arrow from_pandas()
: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas
21. Arrow to_pandas()
: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas
The Pandas Series
and DataFrame
objects also have a .to_dict()
method. This converts the object into a Python dictionary object. This format is particularly useful if you have a DataFrame
or Series
and you want to use the data from outside Pandas.
Let’s create a smaller subset of the scientist
data so all the dictionary data will display properly
# first 2 rows of data
sci_sub_dict = scientists.head(2)
# convert the dataframe into a dictionary
sci_dict = sci_sub_dict.to_dict()
# using the pretty print library to print the dictionary
import pprint
pprint.pprint(sci_dict)
{'Age': {0: 61, 1: 45},
'Born': {0: '1920-07-25', 1: '1876-06-13'},
'Died': {0: '1958-04-16', 1: '1937-10-16'},
'Name': {0: 'Rosaline Franklin', 1: 'William Gosset'},
'Occupation': {0: 'Chemist', 1: 'Statistician'},
'age_days': {0: Timedelta('13779 days 00:00:00'),
1: Timedelta('22404 days 00:00:00')},
'age_days_assign': {0: Timedelta('13779 days 00:00:00'),
1: Timedelta('22404 days 00:00:00')},
'age_year_assign': {0: 37.0, 1: 61.0},
'age_years': {0: 37.0, 1: 61.0},
'born_dt': {0: Timestamp('1920-07-25 00:00:00'),
1: Timestamp('1876-06-13 00:00:00')},
'died_dt': {0: Timestamp('1958-04-16 00:00:00'),
1: Timestamp('1937-10-16 00:00:00')}}
Once the dictionary output is created, we can read it back into Pandas.
# read in the dictionary object back into a dataframe
sci_dict_df = pd.DataFrame.from_dict(sci_dict)
print(sci_dict_df)
Name Born Died Age Occupation
0 Rosaline Franklin 1920-07-25 1958-04-16 61 Chemist
1 William Gosset 1876-06-13 1937-10-16 45 Statistician
born_dt died_dt age_days age_years age_days_assign
0 1920-07-25 1958-04-16 13779 days 37.0 13779 days
1 1876-06-13 1937-10-16 22404 days 61.0 22404 days
age_year_assign
0 37.0
1 61.0
JSON data is another common plain text file format. The benefit of using the .to_jsion()
is that it can convert dates and times for you to read back into Pandas. By using orient='records'
we can either pass in the variable or copy and paste from the output to load it back into Pandas. The indent=2
allows the output to print a bit nicer to the screen (and book).
# convert the dataframe into a dictionary
sci_json = sci_sub_dict.to_json(
orient='records', indent=2, date_format="iso"
)
pprint.pprint(sci_json)
('[
'
' {
'
' "Name":"Rosaline Franklin",
'
' "Born":"1920-07-25",
'
' "Died":"1958-04-16",
'
' "Age":61,
'
' "Occupation":"Chemist",
'
' "born_dt":"1920-07-25T00:00:00.000Z",
'
' "died_dt":"1958-04-16T00:00:00.000Z",
'
' "age_days":"P13779DT0H0M0S",
'
' "age_years":37.0,
'
' "age_days_assign":"P13779DT0H0M0S",
'
' "age_year_assign":37.0
'
' },
'
' {
'
' "Name":"William Gosset",
'
' "Born":"1876-06-13",
'
' "Died":"1937-10-16",
'
' "Age":45,
'
' "Occupation":"Statistician",
'
' "born_dt":"1876-06-13T00:00:00.000Z",
'
' "died_dt":"1937-10-16T00:00:00.000Z",
'
' "age_days":"P22404DT0H0M0S",
'
' "age_years":61.0,
'
' "age_days_assign":"P22404DT0H0M0S",
'
' "age_year_assign":61.0
'
' }
'
']')
# copy the string to re-create the dataframe
sci_json_df = pd.read_json(
('[
'
' {
'
' "Name":"Rosaline Franklin",
'
' "Born":"1920-07-25",
'
' "Died":"1958-04-16",
'
' "Age":61,
'
' "Occupation":"Chemist",
'
' "born_dt":"1920-07-25T00:00:00.000Z",
'
' "died_dt":"1958-04-16T00:00:00.000Z",
'
' "age_days":"P13779DT0H0M0S",
'
' "age_years":37.0,
'
' "age_days_assign":"P13779DT0H0M0S",
'
' "age_year_assign":37.0
'
' },
'
' {
'
' "Name":"William Gosset",
'
' "Born":"1876-06-13",
'
' "Died":"1937-10-16",
'
' "Age":45,
'
' "Occupation":"Statistician",
'
' "born_dt":"1876-06-13T00:00:00.000Z",
'
' "died_dt":"1937-10-16T00:00:00.000Z",
'
' "age_days":"P22404DT0H0M0S",
'
' "age_years":61.0,
'
' "age_days_assign":"P22404DT0H0M0S",
'
' "age_year_assign":61.0
'
' }
'
']'),
orient="records"
)
print(sci_json_df)
Name Born Died Age Occupation
0 Rosaline Franklin 1920-07-25 1958-04-16 61 Chemist
1 William Gosset 1876-06-13 1937-10-16 45 Statistician
born_dt died_dt
0 1920-07-25T00:00:00.000Z 1958-04-16T00:00:00.000Z
1 1876-06-13T00:00:00.000Z 1937-10-16T00:00:00.000Z
age_days age_years age_days_assign age_year_assign
0 P13779DT0H0M0S 37 P13779DT0H0M0S 37
1 P22404DT0H0M0S 61 P22404DT0H0M0S 61
Notice how the dates are all different from the original values? That’s because we choose to convert the dates into ISO 8601 string format.
print(sci_json_df.dtypes)
Name object
Born object
Died object
Age int64
Occupation object
born_dt object
died_dt object
age_days object
age_years int64
age_days_assign object
age_year_assign int64
dtype: object
If we want the original datetime object back, we need to convert that representation back into a date.
sci_json_df["died_dt_json"] = pd.to_datetime(sci_json_df["died_dt"])
print(sci_json_df)
Name Born Died Age Occupation
0 Rosaline Franklin 1920-07-25 1958-04-16 61 Chemist
1 William Gosset 1876-06-13 1937-10-16 45 Statistician
born_dt died_dt
0 1920-07-25T00:00:00.000Z 1958-04-16T00:00:00.000Z
1 1876-06-13T00:00:00.000Z 1937-10-16T00:00:00.000Z
age_days age_years age_days_assign age_year_assign
0 P13779DT0H0M0S 37 P13779DT0H0M0S 37
1 P22404DT0H0M0S 61 P22404DT0H0M0S 61
died_dt_json
0 1958-04-16 00:00:00+00:00
1 1937-10-16 00:00:00+00:00
print(sci_json_df.dtypes)
Name object
Born object
Died object
Age int64
Occupation object
born_dt object
died_dt object
age_days object
age_years int64
age_days_assign object
age_year_assign int64
died_dt_json datetime64[ns, UTC]
dtype: object
Working with dates and times is always tricky. We talk more about them in Chapter 12.
There are many ways Pandas can export and import data. Indeed, .to_pickle()
, .to_csv()
, .to_excel()
, .to_feather()
, .to_dict()
are only some of the data formats that can make their way into Pandas DataFrame
s. Table 2.4 lists some of these other output formats.
Table 2.4 DataFrame
Export Methods
Export Method | Description |
---|---|
| Save data into the system clipboard for pasting |
| Convert data into a regular “dense” |
| Convert data into a Python |
| Convert data into a Google BigQuery table |
| Save data into a hierarchal data format (HDF) |
| Save data into a portable JSON-like binary |
| Convert data into a HTML table |
| Convert data into a JSON string |
| Convert data into a LATEX tabular environment |
| Convert data into a record array |
| Show |
| Convert data into a |
| Save data into a SQL database |
| Convert data into a Stata |
This chapter went into a little more detail about how the Pandas Series
and DataFrame
objects work in Python. There were some simpler examples of data cleaning shown, along with a few common ways to export data to share with others. Chapter 1 and Chapter 2 should give you a good basis on how Pandas works as a library.
The next chapter covers the basics of plotting in Python and Pandas. Data visualization is not only used at the end of an analysis to plot results, but also is heavily utilized throughout the entire data pipeline.
3.14.252.56