2

Pandas Data Structures Basics

Chapter 1 introduced the Pandas DataFrame and Series objects. These data structures resemble the primitive Python data containers (lists and dictionaries) for indexing and labeling, but have additional features that make working with data easier.

Learning Objectives

The concept map for this chapter can be found in Figure A.2.

  • Use functions to create and load manual data

  • Describe the Series object

  • Describe the DataFrame object

  • Identify basic operations on Series objects

  • Identify basic operations on DataFrame objects

  • Perform conditional subsetting, fancy slicing, and indexing

  • Use methods to save data

2.1 Create Your Own Data

Whether you are manually inputting data or creating a small test example, knowing how to create DataFrames without loading data from a file is a useful skill. It is especially helpful when you are asking a question about a StackOverflow error.

2.1.1 Create a Series

The Pandas Series is a one-dimensional container (i.e., Python Iterable), similar to the built-in Python list. It is the data type that represents each column of the DataFrame. Table 1.1 lists the possible dtypes for Pandas DataFrame columns. Each value in a DataFrame column must be stored as the same dtype. For example, if a column contains the number 1 and the sequence of letters (i.e., string) "pizza", the entire dtype of the column will be a string (Pandas will call this an object dtype).

Since a DataFrame can be thought of as a dictionary of Series objects, where each key is the column name and the value is the Series, we can conclude that a Series is very similar to a Python list, except that each element must be the same dtype. Those who have used the numpy library will realize this is the same behavior as demonstrated by the ndarray.

The easiest way to create a Series is to pass in a Python list. If we pass in a list of mixed types, the most common representation of both will be used. Typically the dtype will be object.

import pandas as pd

s = pd.Series(['banana', 42])
print(s)
0    banana
1        42
dtype: object

Notice that the “row number” is shown on the left of the Series. This is actually the index for the series. It is similar to the row name and row index we saw in Section 1.3.2 for DataFrames. It implies that we can actually assign a “name” to values in our series.

# manually assign index values to a series
# by passing a Python list
s = pd.Series(
data=["Wes McKinney", "Creator of Pandas"],
index=["Person", "Who"],
)

print(s)
Person         Wes McKinney
Who       Creator of Pandas
dtype: object

2.1.2 Create a DataFrame

As mentioned in Chapter 1, a DataFrame can be thought of as a dictionary of Series objects. This is why dictionaries are the most common way of creating a DataFrame. The key represents the column name, and the values are the contents of the column.

scientists = pd.DataFrame(
  {
    "Name": ["Rosaline Franklin", "William Gosset"],
    "Occupation": ["Chemist", "Statistician"],
    "Born": ["1920-07-25", "1876-06-13"],
    "Died": ["1958-04-16", "1937-10-16"],
    "Age": [37, 61],
  }
)

print(scientists)
                Name    Occupation        Born        Died  Age
0  Rosaline Franklin       Chemist  1920-07-25  1958-04-16   37
1     William Gosset  Statistician  1876-06-13  1937-10-16   61

If we look at the documentation for DataFrame1, we see that we can use the columns parameter or specify the column order. If we want to use the name column for the row index, we can use the index parameter.

1. DataFrame documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

scientists = pd.DataFrame(
  data={
    "Occupation": ["Chemist", "Statistician"],
    "Born": ["1920-07-25", "1876-06-13"],
    "Died": ["1958-04-16", "1937-10-16"],
    "Age": [37, 61],
  },
  index=["Rosaline Franklin", "William Gosset"],
  columns=["Occupation", "Born", "Died", "Age"],
)
print(scientists)
                    Occupation        Born        Died  Age
Rosaline Franklin      Chemist  1920-07-25  1958-04-16   37
William Gosset    Statistician  1876-06-13  1937-10-16   61

2.2 The Series

In Section 1.3.2.1, we saw how the slicing method affects the type of the result. If we use .loc[] to subset the first row of our scientists DataFrame, we will get a Series object back.

First, let’s re-create our example DataFrame.

# create our example dataframe
# with a row index label
scientists = pd.DataFrame(

  data={
    "Occupation": ["Chemist", "Statistician"],
    "Born": ["1920-07-25", "1876-06-13"],
    "Died": ["1958-04-16", "1937-10-16"],
    "Age": [37, 61],
  },
  index=["Rosaline Franklin", "William Gosset"],
  columns=["Occupation", "Born", "Died", "Age"],
)

print(scientists)
                    Occupation        Born        Died  Age
Rosaline Franklin      Chemist  1920-07-25  1958-04-16   37
William Gosset    Statistician  1876-06-13  1937-10-16   61

Select a scientist by the row index label.

# select by row index label
first_row = scientists.loc['William Gosset']
print(type(first_row))
<class 'pandas.core.series.Series'>
print(first_row)
Occupation    Statistician
Born            1876-06-13
Died            1937-10-16
Age                     61
Name: William Gosset, dtype: object

When a series is printed (i.e., the string representation), the index is printed as the first “column,” and the values are printed as the second “column.” There are many attributes and methods associated with a Series object.2 Two examples of attributes are .index and .values.

2. Series documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

print(first_row.index)
Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')
print(first_row.values)
['Statistician' '1876-06-13' '1937-10-16' 61]

Table 2.1 Some of the Attributes Within a Series

Series attributes

Description

.loc

Subset using index value

.iloc

Subset using index position

.dtype or dtypes

The type of the Series contents

.T

Transpose of the series

.shape

Dimensions of the data

.size

Number of elements in the Series

.values

ndarray or ndarray-like of the Series

An example of a Series method is .keys(), which is an alias for the .index attribute.

print(first_row.keys())
Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')

By now, you might have questions about the syntax for .index, .values, and .keys(). More information about attributes and methods is found in Appendix P on classes. Attributes can be thought of as features of an object (in this example, our object is a Series). Methods can be thought of as some calculation or operation that is performed on an object. The subsetting syntax for .loc[] and .iloc[] (from Section 1.3.2) consists of all attributes. This is why the syntax does not rely on a set of round parentheses, ( ), but rather a set of square brackets, [ ], for subsetting. Since .keys() is a method, if we wanted to get the first key (which is also the first index), we would use the square brackets after the method call. Some attributes for the series are listed in Table 2.1.

# get the first index using an attribute
print(first_row.index[0])
Occupation
# get the first index using a method
print(first_row.keys()[0])
Occupation

2.2.1 The Series Is ndarray-like

The Pandas data structure known as Series is very similar to the numpy.ndarray (Appendix O). In turn, many methods and functions that operate on a ndarray will also operate on a Series. A Series may sometimes be referred to as a “vector.”

2.2.1.1 Series Methods

Let’s first get a series of the Age column from our scientists dataframe.

# get the 'Age' column
ages = scientists['Age']
print(ages)
Rosaline Franklin    37
William Gosset       61
Name: Age, dtype: int64

NumPy is a scientific computing library that typically deals with numeric vectors. Since a Series can be thought of as an extension to the numpy.ndarray, there is an overlap of attributes and methods. When we have a vector of numbers, there are common calculations we can perform.3

3. Descriptive statistics: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics

# calculate the mean
print(ages.mean())
49.0
# calculate the minimum
print(ages.min())
37
# calculate the maximum
print(ages.max())
61
# calculate the standard deviation
print(ages.std())
16.97056274847714

The .mean(), .min(), .max(), and .std() are also methods in the numpy.ndarray.4 Some Series methods are listed in Table 2.2.

4. NumPy ndarrary documentation: https://numpy.org/doc/stable/reference/arrays.ndarray.html

2.2.2 Boolean Subsetting: Series

Chapter 1 showed how we can use specific indices to subset our data. Only rarely, however, will we know the exact row or column index to subset the data. Typically you are looking for values that meet (or don’t meet) a particular calculation or observation.

To explore this process, let’s use a larger data set.

scientists = pd.read_csv('data/scientists.csv')

We just saw how we can calculate basic descriptive metrics of vectors. The .describe() method will calculate multiple descriptive statistics in a single method call.

        ages = scientists['Age']
        print(ages)

Table 2.2 Some of the Methods That Can Be Performed on a Series

Series Methods

Description

.append()

Concatenates two or more Series

.corr()

Calculate a correlation with another Series5

.cov()

Calculate a covariance with another Series6

.describe()

Calculate summary statistics7

.drop_duplicates()

Returns a Series without duplicates

.equals()

Determines whether a Series has the same elements

.get_values()

Get values of the Series; same as the values attribute

.hist()

Draw a histogram

.isin()

Checks whether values are contained in a Series

.min()

Returns the minimum value

.max()

Returns the maximum value

.mean()

Returns the arithmetic mean

.median()

Returns the median

.mode()

Returns the mode(s)

.quantile()

Returns the value at a given quantile

.replace()

Replaces values in the Series with a specified value

.sample()

Returns a random sample of values from the Series

.sort_values()

Sorts values

.to_frame()

Converts a Series to a DataFrame

.transpose()

Returns the transpose

.unique()

Returns a numpy.ndarray of unique values

5. Missing values will be automatically dropped.

6. Missing values will be automatically dropped.

7. Missing values will be automatically dropped.

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64
# get basic stats
print(ages.describe())
count    8.000000
mean    59.125000
std     18.325918

min     37.000000
25%     44.000000
50%     58.500000
75%     68.750000
max     90.000000
Name: Age, dtype: float64
# mean of all ages
print(ages.mean())
59.125

What if we wanted to subset our ages by identifying those above the mean?

print(ages[ages > ages.mean()])
1    61
2    90
3    66
7    77
Name: Age, dtype: int64

Let’s tease out this statement and look at what ages > ages.mean() returns.

print(ages > ages.mean())
0    False
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: Age, dtype: bool
print(type(ages > ages.mean()))
<class 'pandas.core.series.Series'>

This statement returns a Series with a .dtype of bool. In other words, we can not only subset values using labels and indices, but also supply a vector of boolean values. Python has many functions and methods. Depending on how they are implemented, they may return labels, indices, or booleans. Keep this point in mind as you learn new methods and seek to piece together various parts for your work.

If we liked, we could manually supply a vector of bools to subset our data.

# get index 0, 1, 4, 5, and 7
manual_bool_values = [

  True,  # 0
  True,  # 1
  False, # 2
  False, # 3
  True,  # 4
  True,  # 5
  False, # 6
  True,  # 7
]
print(ages[manual_bool_values])
0    37
1    61
4    56
5    45
7    77
Name: Age, dtype: int64

2.2.3 Operations Are Automatically Aligned and Vectorized (Broadcasting)

If you’re familiar with programming, you would find it strange that ages > ages.mean() returns a vector without any for loops (Appendix J). Many of the methods that work on Series (and also DataFrames) are “vectorized,” meaning that they work on the entire vector simultaneously. This approach makes the code easier to read, and typically, optimizations are available to make calculations faster.

2.2.3.1 Vectors of the Same Length

If you perform an operation between two vectors of the same length, the resulting vector will be an element-by-element calculation of the vectors.

print(ages + ages)
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64
print(ages * ages)
0    1369
1    3721
2    8100
3    4356

4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64
2.2.3.2 Vectors With Integers (Scalars)

When you perform an operation on a vector using a scalar, the scalar will be recycled across all the elements in the vector.

print(ages + 100)
0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64
print(ages * 2)
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64
2.2.3.3 Vectors With Different Lengths

When you are working with vectors of different lengths, the behavior will depend on the type() of the vectors. With a Series, the vectors will perform an operation matched by the index. The rest of the resulting vector will be filled with a “missing” value, denoted with NaN, signifying “not a number” (Chapter 9).

This type of behavior, which is called broadcasting, differs between languages. Broadcasting in Pandas refers to how operations are calculated between arrays with different shapes.

print(ages + pd.Series([1, 100]))
0     38.0
1    161.0
2      NaN
3      NaN

4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64

With other types(), the shapes must match.

import numpy as np
# this will cause an error
print(ages + np.array([1, 100]))
ValueError: operands could not be broadcast together with shapes (8,) (2,)
2.2.3.4 Vectors With Common Index Labels (Automatic Alignment)

What’s convenient in Pandas is how data alignment is almost always automatic. If possible, things will always align themselves with the index label when actions are performed.

# ages as they appear in the data
print(ages)
0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64
rev_ages = ages.sort_index(ascending=False)
print(rev_ages)
7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37
Name: Age, dtype: int64

If we perform an operation using ages and rev_ages, it will still be conducted on an element-by-element basis, but the vectors will be aligned first before the operation is carried out.

# reference output to show index label alignment
print(ages * 2)
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64
# note how we get the same values
# even though the vector is reversed
print(ages + rev_ages)
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

2.3 The DataFrame

The DataFrame is the most common Pandas object. It can be thought of as Python’s way of storing spreadsheet-like data. Many of the features of the Series data structure carry over into the DataFrame.

2.3.1 Parts of a DataFrame

There are 3 main parts to a Pandas DataFrame object the .index, .columns, and .values. These refer to the row name, column names, and data values, respectively.

scientists.index
RangeIndex(start=0, stop=8, step=1)
scientists.columns
Index(['Name', 'Born', 'Died', 'Age', 'Occupation'], dtype='object')
scientists.values

array([['Rosaline Franklin', '1920-07-25', '1958-04-16', 37, 'Chemist'],
       ['William Gosset', '1876-06-13', '1937-10-16', 61, 'Statistician'],
       ['Florence Nightingale', '1820-05-12', '1910-08-13', 90, 'Nurse'],
       ['Marie Curie', '1867-11-07', '1934-07-04', 66, 'Chemist'],
       ['Rachel Carson', '1907-05-27', '1964-04-14', 56, 'Biologist'],
       ['John Snow', '1813-03-15', '1858-06-16', 45, 'Physician'],
       ['Alan Turing', '1912-06-23', '1954-06-07', 41,
        'Computer Scientist'],
       ['Johann Gauss', '1777-04-30', '1855-02-23', 77, 'Mathematician']],
      dtype=object)

The .values comes in handy when you don’t want all the row index label information, and really just want the base numpy representation of the data.

2.3.2 Boolean Subsetting: DataFrames

Just as we were able to subset a Series with a boolean vector, so can we subset a DataFrame with a bool.

# boolean vectors will subset rows
print(scientists.loc[scientists['Age'] > scientists['Age'].mean()])

                   Name        Born        Died  Age     Occupation
1        William Gosset  1876-06-13  1937-10-16   61   Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90          Nurse
3           Marie Curie  1867-11-07  1934-07-04   66        Chemist
7          Johann Gauss  1777-04-30  1855-02-23   77  Mathematician

Table 2.3 summarizes the various subsetting methods.

Table 2.3 Table of DataFrame Subsetting Methods

Syntax

Selection Result

df[column_name]

Series

df[[column1, column2, ... ]]

DataFrame

df.loc[row_label]

Row by row index label (row name)

df.loc[[label1, label2, ...]]

Multiple rows by index label

df.iloc[row_number]

Row by row number

df.iloc[[row1, row2, ...]]

Multiple rows by row number

df[bool]

Row based on bool

df[[bool1, bool2, ...]]

Multiple rows based on bool

df[start:stop:step]

Rows based on slicing notation

2.3.3 Operations Are Automatically Aligned and Vectorized (Broadcasting)

Pandas supports broadcasting because the Series and DataFrame objects are built on top of the numpy library.8 Broadcasting describes what happens when performing operations between array-like objects. These behaviors depend on the type of object, its length, and any labels associated with the object.

8. NumPy Library: http://www.numpy.org/

First, let’s create a subset of our dataframes.

first_half = scientists[:4]
second_half = scientists[4:]
print(first_half)
                   Name        Born        Died    Age    Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16     37       Chemist
1        William Gosset  1876-06-13  1937-10-16     61  Statistician
2  Florence Nightingale  1820-05-12  1910-08-13     90         Nurse
3           Marie Curie  1867-11-07  1934-07-04     66       Chemist
print(second_half)
            Name        Born        Died  Age          Occupation
4  Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5      John Snow  1813-03-15  1858-06-16   45           Physician
6    Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7   Johann Gauss  1777-04-30  1855-02-23   77       Mathematician

When we perform an action on a dataframe with a scalar, it will try to apply the operation on each cell of the dataframe. In this example, numbers will be multiplied by 2, and strings will be doubled (this is Python’s normal behavior with strings).

# multiply by a scalar
print(scientists * 2)
                                       Name                  Born  
0        Rosaline FranklinRosaline Franklin  1920-07-251920-07-25
1              William GossetWilliam Gosset  1876-06-131876-06-13
2  Florence NightingaleFlorence Nightingale  1820-05-121820-05-12
3                    Marie CurieMarie Curie  1867-11-071867-11-07
4                Rachel CarsonRachel Carson  1907-05-271907-05-27
5                        John SnowJohn Snow  1813-03-151813-03-15
6                    Alan TuringAlan Turing  1912-06-231912-06-23
7                  Johann GaussJohann Gauss  1777-04-301777-04-30


                   Died  Age                            Occupation
0  1958-04-161958-04-16   74                        ChemistChemist
1  1937-10-161937-10-16  122              StatisticianStatistician
2  1910-08-131910-08-13  180                            NurseNurse
3  1934-07-041934-07-04  132                        ChemistChemist
4  1964-04-141964-04-14  112                    BiologistBiologist
5  1858-06-161858-06-16   90                    PhysicianPhysician
6  1954-06-071954-06-07   82  Computer ScientistComputer Scientist
7  1855-02-231855-02-23  154            MathematicianMathematician

If your dataframes are all numeric values and you want to “add” the values on a cell-by-cell basis, you can use the .add() method. The automatic alignment can be better seen in Chapter 6, when we concatenate dataframes together.

2.4 Making Changes to Series and DataFrames

Now that we know various ways of subsetting and slicing our data (see Table 2.3), we should be able to alter our data objects.

2.4.1 Add Additional Columns

The type of the Born and Died columns is object, meaning they are strings or a sequence of characters.

print(scientists.dtypes)
Name          object
Born          object
Died          object
Age            int64
Occupation    object
dtype: object

We can convert the strings to a proper datetime type so we can perform common date and time operations (e.g., take differences between dates or calculate a person’s age). You can provide your own format if you have a date that has a specific format. A list of format variables can be found in the Python datetime module documentation.9 More examples with datetimes can be found in Chapter 12. The format of our date looks like “YYYY-MM-DD,” so we can use the %Y-%m-%d format.

9. datetime module documentation: https://docs.python.org/3.10/library/datetime.html#strftime-and-strptime-behavior

# format the 'Born' column as a datetime
born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
print(born_datetime)

0    1920-07-25
1    1876-06-13
2    1820-05-12
3    1867-11-07
4    1907-05-27
5    1813-03-15
6    1912-06-23
7    1777-04-30
Name: Born, dtype: datetime64[ns]
# format the 'Died' column as a datetime
died_datetime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')

If we wanted, we could create a new set of columns that contain the datetime representations of the object (string) dates. The below example uses python’s multiple assignment syntax (Appendix N).

scientists['born_dt'], scientists['died_dt'] = (
  born_datetime,
  died_datetime
)

print(scientists.head())
                   Name        Born        Died  Age    Occupation  
0     Rosaline Franklin  1920-07-25  1958-04-16   37       Chemist
1        William Gosset  1876-06-13  1937-10-16   61  Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90         Nurse
3           Marie Curie  1867-11-07  1934-07-04   66       Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56     Biologist
     born_dt     died_dt
0 1920-07-25  1958-04-16
1 1876-06-13  1937-10-16
2 1820-05-12  1910-08-13
3 1867-11-07  1934-07-04
4 1907-05-27  1964-04-14
print(scientists.shape)
(8, 7)
print(scientists.dtypes)
Name                  object
Born                  object
Died                  object
Age                    int64

Occupation            object
born_dt       datetime64[ns]
died_dt       datetime64[ns]
dtype: object

2.4.2 Directly Change a Column

We can also assign a new value directly to the existing column. The example in this section shows how to randomize the contents of a column. More complex calculations that involve multiple columns can be seen in Chapter 5, in the discussion of the .apply() method.

First, let’s look at the original Age values.

print(scientists['Age'])
0  37
1  61
2  90
3  66
4  56
5  45
6  41
7  77
Name: Age, dtype: int64

Now let’s shuffle the values.

# the frac=1 tells pandas to randomly select 100% of the values
# the random_state makes the randomization the same each time
scientists["Age"] = scientists["Age"].sample(frac=1, random_state=42)

For long bits of code we can wrap the code around round parentheses ( ) to break up the code into multiple lines. We will be using this convention for longer bits of code in this book (Appendix D.1).

# the previous line of code is equivalent to
scientists['Age'] = (
  scientists['Age']
  .sample(frac=1, random_state=42)
)
print(scientists['Age'])
0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

If you notice, that we tried to randomly shuffle the column, but when we assigned the values back into the dataframe, it reverted back to the original order. That’s because Pandas will try to automatically join on the .index values on many operations, for this example to get around this problem we need to remove that .index information. One way of doing that, is to assign just the .values of the shuffled values that does not have any .index value associated with it.

scientists['Age'] = (
scientists['Age']
  .sample(frac=1, random_state=42)
  .values # remove the index so it doesn't auto align the values
)

print(scientists['Age'])
0    61
1    45
2    37
3    77
4    90
5    56
6    66
7    41
Name: Age, dtype: int64

We can recalculate the “real” age using datetime arithmetic. More information about datetime can be found in Chapter 12.

# subtracting dates will give us number of days
scientists['age_days'] = (
  scientists['died_dt'] - scientists['born_dt']
)

print(scientists)
                   Name        Born        Died  Age  
0     Rosaline Franklin  1920-07-25  1958-04-16   61
1        William Gosset  1876-06-13  1937-10-16   45
2  Florence Nightingale  1820-05-12  1910-08-13   37

3           Marie Curie  1867-11-07  1934-07-04   77
4         Rachel Carson  1907-05-27  1964-04-14   90
5             John Snow  1813-03-15  1858-06-16   56
6           Alan Turing  1912-06-23  1954-06-07   66
7          Johann Gauss  1777-04-30  1855-02-23   41

            Occupation    born_dt    died_dt   age_days
0              Chemist 1920-07-25 1958-04-16 13779 days
1         Statistician 1876-06-13 1937-10-16 22404 days
2                Nurse 1820-05-12 1910-08-13 32964 days
3              Chemist 1867-11-07 1934-07-04 24345 days
4            Biologist 1907-05-27 1964-04-14 20777 days
5            Physician 1813-03-15 1858-06-16 16529 days
6   Computer Scientist 1912-06-23 1954-06-07 15324 days
7        Mathematician 1777-04-30 1855-02-23 28422 days
# we can convert the value to just the year
# using the astype method
scientists['age_years'] = (
  scientists['age_days']
  .astype('timedelta64[Y]')
)
print(scientists)
                   Name        Born        Died  Age  
0     Rosaline Franklin  1920-07-25  1958-04-16   61
1        William Gosset  1876-06-13  1937-10-16   45
2  Florence Nightingale  1820-05-12  1910-08-13   37
3           Marie Curie  1867-11-07  1934-07-04   77
4         Rachel Carson  1907-05-27  1964-04-14   90
5             John Snow  1813-03-15  1858-06-16   56
6           Alan Turing  1912-06-23  1954-06-07   66
7          Johann Gauss  1777-04-30  1855-02-23   41
           Occupation    born_dt    died_dt   age_days  age_years
0             Chemist 1920-07-25 1958-04-16 13779 days       37.0
1        Statistician 1876-06-13 1937-10-16 22404 days       61.0
2               Nurse 1820-05-12 1910-08-13 32964 days       90.0
3             Chemist 1867-11-07 1934-07-04 24345 days       66.0
4           Biologist 1907-05-27 1964-04-14 20777 days       56.0
5           Physician 1813-03-15 1858-06-16 16529 days       45.0
6  Computer Scientist 1912-06-23 1954-06-07 15324 days       41.0
7       Mathematician 1777-04-30 1855-02-23 28422 days       77.0

2.4.3 Modifying Columns with .assign()

Another way you can assign and modify columns is with the .assign() method. This has the benefit of using method chaining (Appendix R). Let’s redo the age_years column creation, but this time using '.assign().

scientists = scientists.assign(
  # new columns on the left of the equal sign
  # how to calculate values on the right of the equal sign
  # separate new columns with a comma
  age_days_assign=scientists['died_dt'] - scientists['born_dt'],
  age_year_assign=scientists['age_days'].astype('timedelta64[Y]')
)

print(scientists)
                   Name        Born       Died  Age  
0     Rosaline Franklin  1920-07-25  1958-04-16  61
1        William Gosset  1876-06-13  1937-10-16  45
2  Florence Nightingale  1820-05-12  1910-08-13  37
3           Marie Curie  1867-11-07  1934-07-04  77
4         Rachel Carson  1907-05-27  1964-04-14  90
5             John Snow  1813-03-15  1858-06-16  56
6           Alan Turing  1912-06-23  1954-06-07  66
7          Johann Gauss  1777-04-30  1855-02-23  41
           Occupation    born_dt    died_dt   age_days  age_years  
0             Chemist 1920-07-25 1958-04-16 13779 days       37.0
1        Statistician 1876-06-13 1937-10-16 22404 days       61.0
2               Nurse 1820-05-12 1910-08-13 32964 days       90.0
3             Chemist 1867-11-07 1934-07-04 24345 days       66.0
4           Biologist 1907-05-27 1964-04-14 20777 days       56.0
5           Physician 1813-03-15 1858-06-16 16529 days       45.0
6  Computer Scientist 1912-06-23 1954-06-07 15324 days       41.0
7       Mathematician 1777-04-30 1855-02-23 28422 days       77.0
  age_days_assign  age_year_assign
0      13779 days             37.0
1      22404 days             61.0
2      32964 days             90.0
3      24345 days             66.0
4      20777 days             56.0
5      16529 days             45.0
6      15324 days             41.0
7      28422 days             77.0

You can look into the .assign() documentation for more examples.10 Since this is only showing a simple example of how to use the method to assign new values. Effectively using .assign() will require you to know about lambda functions, which we will cover in Chapter 5.

10. .assign() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

2.4.4 Dropping Values

To drop a column, we can either select all the columns we want to by using the column subsetting techniques (Section 1.3.1), or select columns to drop with the .drop() method on our dataframe.11

11. DataFrame .drop() method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

# all the current columns in our data
print(scientists.columns)
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt',
       'died_dt', 'age_days', 'age_years', 'age_days_assign',
       'age_year_assign'], dtype='object')
# drop the shuffled age column
# you provide the axis=1 argument to drop column-wise
scientists_dropped = scientists.drop(['Age'], axis="columns")
# columns after dropping our column
print(scientists_dropped.columns)
Index(['Name', 'Born', 'Died', 'Occupation', 'born_dt', 'died_dt',
       'age_days', 'age_years', 'age_days_assign',
       'age_year_assign'],
      dtype='object')

2.5 Exporting and Importing Data

In our examples so far, we have been importing data. It is also common practice to export or save data sets while processing them. Data sets are either saved out as final cleaned versions of data or in intermediate steps. Both of these outputs can be used for analysis or as input to another part of the data processing pipeline.

2.5.1 Pickle

Python has a way to pickle data. This is Python’s way of serializing and saving data in a binary format. Reading pickle data is also backwards compatible. pickle files are usually saved with an extension of .p, .pkl, or .pickle. We will see how to save and load pickle data below.

2.5.1.1 Series

Many of the export methods for a Series are also available for a DataFrame. Those readers who have experience with numpy will know that a .save() method is available for ndarrays. This method has been deprecated, and the replacement is to use the .to_pickle method.

names = scientists['Name']
print(names)
0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object
# pass in a string to the path you want to save
names.to_pickle('output/scientists_names_series.pickle')

The pickle output is in a binary format. If you try to open it in a text editor, you will see a bunch of garbled characters.

If the object you are saving is an intermediate step in a set of calculations that you want to save, or if you know that your data will stay in the Python world, saving objects to a pickle will be optimized for Python and disk storage space. However, this approach means that people who do not use Python will not be able to read the data.

2.5.1.2 DataFrame

The same method can be used on DataFrame objects.

scientists.to_pickle('output/scientists_df.pickle')
2.5.1.3 Read pickle data

To read pickle data, we can use the pd.read_pickle() function.

# for a Series
series_pickle = pd.read_pickle(
    "output/scientists_names_series.pickle"
)
print(series_pickle)
0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object
# for a DataFrame
dataframe_pickle = pd.read_pickle('output/scientists_df.pickle')
print(dataframe_pickle)
                   Name        Born        Died  Age 
0     Rosaline Franklin  1920-07-25  1958-04-16   61
1        William Gosset  1876-06-13  1937-10-16   45
2  Florence Nightingale  1820-05-12  1910-08-13   37
3           Marie Curie  1867-11-07  1934-07-04   77
4         Rachel Carson  1907-05-27  1964-04-14   90
5             John Snow  1813-03-15  1858-06-16   56
6           Alan Turing  1912-06-23  1954-06-07   66
7          Johann Gauss  1777-04-30  1855-02-23   41
           Occupation    born_dt    died_dt   age_days  age_years  
0             Chemist 1920-07-25 1958-04-16 13779 days       37.0
1        Statistician 1876-06-13 1937-10-16 22404 days       61.0
2               Nurse 1820-05-12 1910-08-13 32964 days       90.0
3             Chemist 1867-11-07 1934-07-04 24345 days       66.0
4           Biologist 1907-05-27 1964-04-14 20777 days       56.0
5           Physician 1813-03-15 1858-06-16 16529 days       45.0
6  Computer Scientist 1912-06-23 1954-06-07 15324 days       41.0
7       Mathematician 1777-04-30 1855-02-23 28422 days       77.0
  age_days_assign  age_year_assign
0      13779 days             37.0
1      22404 days             61.0
2      32964 days             90.0
3      24345 days             66.0
4      20777 days             56.0
5      16529 days             45.0
6      15324 days             41.0
7      28422 days             77.0

Again, the pickle files are saved with an extension of .p, .pkl, or .pickle.

2.5.2 Comma-Separated Values (CSV)

Comma-separated values (CSV) are the most flexible data storage type. For each row, the column information is separated with a comma. The comma is not the only type of delimiter, however. Some files are delimited by a tab (TSV) or even a semicolon. The main reason why CSVs are a preferred data format when collaborating and sharing data is because any program can open this kind of data structure. It can even be opened in a text editor. However, the universal storage format does come at a price. CSV files are usually slower and take up more disk space when compared to other binary formats.

The Series and DataFrame have a .to_csv() method to write a CSV file. The documentation for Series12 and DataFrame13 identifies many different ways you can modify the resulting CSV file. For example, if you wanted to save a TSV file because there are commas in your data, you can change the sep parameter (Appendix O).

12. Saving a Series to CSV: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_csv.html

13. Saving a DataFrame to CSV: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

By default, the .index of a DataFrame gets written to the CSV file. This creates a file where the first column does not have a name, and only holds the row numbers of the dataframe being saved. This extraneous column in the CSV becomes problematic when you try to read the CSV back into Pandas. So we typically put in the index=False parameter when saving CSV files to avoid this problem.

# do not write the row names in the CSV output
scientists.to_csv('output/scientists_df_no_index.csv', index=False)
2.5.2.1 Import CSV Data

Importing CSV files was illustrated in Section 1.2. This operation uses the pd.read_csv() function. In the documentation, you can see there are various ways to read in a CSV.14 Look at Appendix O if you need more information on using function parameters.

14. pd.read_csv() documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

2.5.3 Excel

Excel, which is probably the most commonly used data type (or the second most commonly used, after CSVs), has a bad reputation within the data science community, mainly because colors and other superfluous information can easily find its way into the data set, not to mention one-off calculations that ruin the rectangular structure of a data set. Some other reasons are listed at the very beginning of this chapter. The goal of this book isn’t to bash Excel, but to teach you about a reasonable alternative tool for data analytics. In short, the more of your work you can do in a scripting language, the easier it will be to scale up to larger projects, catch and fix mistakes, and collaborate. However, Excel’s popularity and market share are unrivaled. Excel has its own scripting language if you absolutely have to work in it. This will allow you to work with data in a more predictable and reproducible manner.

2.5.3.1 Series

The Series data structure does not have an explicit .to_excel() method. If you have a Series that needs to be exported to an Excel file, one option is to convert the Series into a one-column DataFrame.

Before saving and reading Excel files, make sure you have the openpyxl library installed (using pip install openpyxl See Appendix B).

print(names)
0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object
# convert the Series into a DataFrame
# before saving it to an Excel file
names_df = names.to_frame()

# save to an excel file
names_df.to_excel(
  'output/scientists_names_series_df.xls', engine='openpyxl'
)
2.5.3.2 DataFrames

From the preceding example, you can see how to export a DataFrame to an Excel file. The documentation shows several ways to further fine-tune the output.15 For example, you can output data to a specific “sheet” using the sheet_name parameter.

15. .to_excel() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

# saving a DataFrame into Excel format
scientists.to_excel(
  "output/scientists_df.xlsx",
  sheet_name="scientists",
  index=False
)

2.5.4 Feather

The format called “feather” is used to save DataFrames into a binary object that can also be loaded into other languages (e.g., R). The main benefit of this approach is that it is faster than writing and reading a CSV file between the languages. See the Pandas .to_feather()16 and feather file format documentation17 for more information on storing for backwards compatibility.

16. Pandas to_feather() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_feather.html

17. Feather file format documentation: https://arrow.apache.org/docs/python/feather.html

The feather formatter is installed via conda install -c conda-forge pyarrow or pip install pyarrow. More on installing packages are described in Appendix B.

You can use the .to_feather() method on a dataframe to save the feather objects.

# save to feather file
scientists.to_feather('output/scientists.feather')

# read feather file
sci_feather = pd.read_feather('output/scientists.feather')

print(sci_feather)
                   Name        Born        Died  Age  
0     Rosaline Franklin  1920-07-25  1958-04-16   61
1        William Gosset  1876-06-13  1937-10-16   45
2  Florence Nightingale  1820-05-12  1910-08-13   37
3           Marie Curie  1867-11-07  1934-07-04   77
4         Rachel Carson  1907-05-27  1964-04-14   90
5             John Snow  1813-03-15  1858-06-16   56
6           Alan Turing  1912-06-23  1954-06-07   66
7          Johann Gauss  1777-04-30  1855-02-23   41
           Occupation    born_dt    died_dt   age_days  age_years  
0             Chemist 1920-07-25 1958-04-16 13779 days       37.0
1        Statistician 1876-06-13 1937-10-16 22404 days       61.0
2               Nurse 1820-05-12 1910-08-13 32964 days       90.0
3             Chemist 1867-11-07 1934-07-04 24345 days       66.0
4           Biologist 1907-05-27 1964-04-14 20777 days       56.0
5           Physician 1813-03-15 1858-06-16 16529 days       45.0
6  Computer Scientist 1912-06-23 1954-06-07 15324 days       41.0
7       Mathematician 1777-04-30 1855-02-23 28422 days       77.0
  age_days_assign  age_year_assign
0      13779 days             37.0
1      22404 days             61.0
2      32964 days             90.0
3      24345 days             66.0
4      20777 days             56.0
5      16529 days             45.0
6      15324 days             41.0
7      28422 days             77.0

2.5.5 Arrow

Feather files are part of the Apache Arrow project.18 One of the main goals of Arrow is to have a memory storage format for dataframe objects that work across multiple programming languages without having to convert types for each of them.

18. Apache Arrow: https://arrow.apache.org/docs/index.html]

Arrow has its own Pandas integration19 to convert Pandas DataFrame objects to Arrow objects (from_pandas()20) and from Arrow objects to Pandas DataFrame objects (to_pandas()21). Once the data is in an Arrow format, it can much more efficiently be used in other programming languages.

19. Arrow Pandas integration: https://arrow.apache.org/docs/python/pandas.html

20. Arrow from_pandas(): https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas

21. Arrow to_pandas(): https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas

2.5.6 Dictionary

The Pandas Series and DataFrame objects also have a .to_dict() method. This converts the object into a Python dictionary object. This format is particularly useful if you have a DataFrame or Series and you want to use the data from outside Pandas.

Let’s create a smaller subset of the scientist data so all the dictionary data will display properly

# first 2 rows of data
sci_sub_dict = scientists.head(2)

# convert the dataframe into a dictionary
sci_dict = sci_sub_dict.to_dict()

# using the pretty print library to print the dictionary
import pprint
pprint.pprint(sci_dict)
{'Age': {0: 61, 1: 45},
 'Born': {0: '1920-07-25', 1: '1876-06-13'},
 'Died': {0: '1958-04-16', 1: '1937-10-16'},
 'Name': {0: 'Rosaline Franklin', 1: 'William Gosset'},
 'Occupation': {0: 'Chemist', 1: 'Statistician'},
 'age_days': {0: Timedelta('13779 days 00:00:00'),
              1: Timedelta('22404 days 00:00:00')},
 'age_days_assign': {0: Timedelta('13779 days 00:00:00'),
                     1: Timedelta('22404 days 00:00:00')},

 'age_year_assign': {0: 37.0, 1: 61.0},
 'age_years': {0: 37.0, 1: 61.0},
 'born_dt': {0: Timestamp('1920-07-25 00:00:00'),
             1: Timestamp('1876-06-13 00:00:00')},
 'died_dt': {0: Timestamp('1958-04-16 00:00:00'),
             1: Timestamp('1937-10-16 00:00:00')}}

Once the dictionary output is created, we can read it back into Pandas.

# read in the dictionary object back into a dataframe
sci_dict_df = pd.DataFrame.from_dict(sci_dict)
print(sci_dict_df)
                Name        Born        Died  Age    Occupation  
0  Rosaline Franklin  1920-07-25  1958-04-16   61       Chemist
1     William Gosset  1876-06-13  1937-10-16   45  Statistician
     born_dt    died_dt   age_days  age_years age_days_assign  
0 1920-07-25 1958-04-16 13779 days       37.0      13779 days
1 1876-06-13 1937-10-16 22404 days       61.0      22404 days
  age_year_assign
0            37.0
1            61.0

2.5.7 JSON (JavaScript Objectd Notation)

JSON data is another common plain text file format. The benefit of using the .to_jsion() is that it can convert dates and times for you to read back into Pandas. By using orient='records' we can either pass in the variable or copy and paste from the output to load it back into Pandas. The indent=2 allows the output to print a bit nicer to the screen (and book).

# convert the dataframe into a dictionary
sci_json = sci_sub_dict.to_json(
  orient='records', indent=2, date_format="iso"
)
pprint.pprint(sci_json)
('[
'
 '  {
'
 '    "Name":"Rosaline Franklin",
'
 '    "Born":"1920-07-25",
'
 '    "Died":"1958-04-16",
'
 '    "Age":61,
'
 '    "Occupation":"Chemist",
'
 '    "born_dt":"1920-07-25T00:00:00.000Z",
'
 '    "died_dt":"1958-04-16T00:00:00.000Z",
'
 '    "age_days":"P13779DT0H0M0S",
'
 '    "age_years":37.0,
'
 '    "age_days_assign":"P13779DT0H0M0S",
'
 '    "age_year_assign":37.0
'
 '  },
'
 '  {
'
 '    "Name":"William Gosset",
'
 '    "Born":"1876-06-13",
'
 '    "Died":"1937-10-16",
'
 '    "Age":45,
'
 '    "Occupation":"Statistician",
'
 '    "born_dt":"1876-06-13T00:00:00.000Z",
'
 '    "died_dt":"1937-10-16T00:00:00.000Z",
'
 '    "age_days":"P22404DT0H0M0S",
'
 '    "age_years":61.0,
'
 '    "age_days_assign":"P22404DT0H0M0S",
'
 '    "age_year_assign":61.0
'
 '  }
'
 ']')
# copy the string to re-create the dataframe
sci_json_df = pd.read_json(
  ('[
'
 '  {
'
 '    "Name":"Rosaline Franklin",
'
 '    "Born":"1920-07-25",
'
 '    "Died":"1958-04-16",
'
 '    "Age":61,
'
 '    "Occupation":"Chemist",
'
 '    "born_dt":"1920-07-25T00:00:00.000Z",
'
 '     "died_dt":"1958-04-16T00:00:00.000Z",
'
 '    "age_days":"P13779DT0H0M0S",
'
 '    "age_years":37.0,
'
 '    "age_days_assign":"P13779DT0H0M0S",
'
 '    "age_year_assign":37.0
'
 '  },
'
 '  {
'

 '    "Name":"William Gosset",
'
 '    "Born":"1876-06-13",
'
 '    "Died":"1937-10-16",
'
 '    "Age":45,
'
 '    "Occupation":"Statistician",
'
 '    "born_dt":"1876-06-13T00:00:00.000Z",
'
 '    "died_dt":"1937-10-16T00:00:00.000Z",
'
 '    "age_days":"P22404DT0H0M0S",
'
 '    "age_years":61.0,
'
 '    "age_days_assign":"P22404DT0H0M0S",
'
 '    "age_year_assign":61.0
'
 '  }
'
 ']'),
  orient="records"
)
print(sci_json_df)
                Name        Born        Died   Age    Occupation 
0  Rosaline Franklin  1920-07-25  1958-04-16    61       Chemist
1     William Gosset  1876-06-13  1937-10-16    45  Statistician
                    born_dt                   died_dt  
0  1920-07-25T00:00:00.000Z  1958-04-16T00:00:00.000Z
1  1876-06-13T00:00:00.000Z  1937-10-16T00:00:00.000Z
         age_days  age_years age_days_assign  age_year_assign
0  P13779DT0H0M0S         37  P13779DT0H0M0S               37
1  P22404DT0H0M0S         61  P22404DT0H0M0S               61

Notice how the dates are all different from the original values? That’s because we choose to convert the dates into ISO 8601 string format.

print(sci_json_df.dtypes)
Name               object
Born               object
Died               object
Age                 int64
Occupation         object
born_dt            object
died_dt            object
age_days           object
age_years           int64
age_days_assign    object
age_year_assign     int64
dtype: object

If we want the original datetime object back, we need to convert that representation back into a date.

sci_json_df["died_dt_json"] = pd.to_datetime(sci_json_df["died_dt"])

print(sci_json_df)
                Name        Born        Died  Age    Occupation  
0  Rosaline Franklin  1920-07-25  1958-04-16   61       Chemist
1     William Gosset  1876-06-13  1937-10-16   45  Statistician

                    born_dt                   died_dt  
0  1920-07-25T00:00:00.000Z  1958-04-16T00:00:00.000Z
1  1876-06-13T00:00:00.000Z  1937-10-16T00:00:00.000Z

         age_days  age_years  age_days_assign  age_year_assign  
0  P13779DT0H0M0S         37   P13779DT0H0M0S               37
1  P22404DT0H0M0S         61   P22404DT0H0M0S               61

                died_dt_json
0  1958-04-16 00:00:00+00:00
1  1937-10-16 00:00:00+00:00
print(sci_json_df.dtypes)
Name                             object
Born                             object
Died                             object
Age                               int64
Occupation                       object
born_dt                          object
died_dt                          object
age_days                         object
age_years                         int64
age_days_assign                  object
age_year_assign                   int64
died_dt_json        datetime64[ns, UTC]
dtype: object

Working with dates and times is always tricky. We talk more about them in Chapter 12.

2.5.8 Other Data Output Types

There are many ways Pandas can export and import data. Indeed, .to_pickle(), .to_csv(), .to_excel(), .to_feather(), .to_dict() are only some of the data formats that can make their way into Pandas DataFrames. Table 2.4 lists some of these other output formats.

Table 2.4 DataFrame Export Methods

Export Method

Description

.to_clipboard()

Save data into the system clipboard for pasting

.to_dense()

Convert data into a regular “dense” DataFrame

.to_dict()

Convert data into a Python dict

.to_gbq()

Convert data into a Google BigQuery table

.to_hdf()

Save data into a hierarchal data format (HDF)

.to_msgpack()

Save data into a portable JSON-like binary

.to_html()

Convert data into a HTML table

.to_json()

Convert data into a JSON string

.to_latex()

Convert data into a LATEX tabular environment

.to_records()

Convert data into a record array

.to_string()

Show DataFrame as a string for stdout

.to_sparse()

Convert data into a SparceDataFrame

.to_sql()

Save data into a SQL database

.to_stata()

Convert data into a Stata dta file

Conclusion

This chapter went into a little more detail about how the Pandas Series and DataFrame objects work in Python. There were some simpler examples of data cleaning shown, along with a few common ways to export data to share with others. Chapter 1 and Chapter 2 should give you a good basis on how Pandas works as a library.

The next chapter covers the basics of plotting in Python and Pandas. Data visualization is not only used at the end of an analysis to plot results, but also is heavily utilized throughout the entire data pipeline.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.136.226