© Randy Betancourt, Sarah Chen 2019
R. Betancourt, S. ChenPython for SAS Usershttps://doi.org/10.1007/978-1-4842-5001-3_3

3. pandas Library

Randy Betancourt1  and Sarah Chen2
(1)
Chadds Ford, PA, USA
(2)
Livingston, NJ, USA
 

In this chapter we introduce a new library called pandas which in turn offers the DataFrame data structure. The pandas library evolved from the use of arrays in Python’s NumPy package used in mathematical and scientific computing applications. The pandas library opens a new world of possibilities for data analysis. The main structure from the pandas library is the Series and DataFrame offering a structure to organize dissimilar types of data (strings, integers, and floats) into a single data structure and has the ability to easily apply methods or functions to all or portions of the data.

Some of the key features for pandas library include
  • Provisions two array-like data structures, Series and DataFrame

  • Easy detection and handling of missing data

  • Easy slicing and subsetting operations

  • Merging and joining of multiple DataFrames

  • Automatic index objects for both rows and columns with multi-level/hierarchical indexing

  • Date range capabilities for custom date frequencies used in time Series analysis

  • Loaders for input/output accepting a wide range of tabular data sources

  • Wide range of built-in analysis methods

  • Moving window statistics for rolling averages, rolling standard deviations, and so on

In many ways pandas DataFrames are very similar to and behave much like SAS datasets. See Table 3-1.
Table 3-1

pandas and SAS Nomenclature

pandas

SAS

DataFrame

Dataset

Row

Observation

Column

Variable

Groupby

By-Group

NaN

. (period)

Slice

Subset

For the remainder of the book, pandas are central to understanding Python in a data analysis context. pandas are not built-in to Python Standard Library and must be imported as shown in Listing 3-1. For our introduction to pandas, we import the numpy and pandas library. We also import the numpy.random.randn random number generator to draw a sample from the “standard normal” distribution.1
>>> import numpy as np
>>> import pandas as pd
>>> from numpy.random import randn
Listing 3-1

Importing Libraries

The first two import statements are followed by the keyword as followed by an identifier functioning as a short-hand for calling methods and functions associated with the library. The third import statement loads the randn random number generator from the numpy.random library . The numpy.random library provides an extensive set of generators for creating distributions and permutations. On line 3 the from statement imports just the randn generator feature. Later if we need to draw a sample using a binomial distribution, then the import statement to use is
from numpy.random import binomial

We begin by examining the Series data structure followed by the detailed examination of the DataFrame. The Series is a one-dimensional array container for a collection of either numerics or strings (but not both together). In a Series the numerics must be the same data type. A DataFrame is a container for a Series. DataFrames may contain strings and numerics with different data types.

Column Types

Before examining the details for Series and DataFrames, an overview of column types is useful. Previously in Chapter 2, “Python Types and Formatting,” we discussed Python types. The pandas library is built on top of Python’s NumPy library and inherits many of the NumPy types. Table 3-2 lists the column types encountered in this book.
Table 3-2

Common pandas Types

pandas Data Type

Python Data Type

Usage

Object

Str

Sequence of text

int64

Int

Integer value

float64

Float

Floating-point value

bool

Boolean

True/False

datetime64

N/A

Date and time values

timedelta

N/A

Difference between two datetimes

category

N/A

Corresponds to statistical categorical variables, e.g, gender

In general, we do not need to concern ourselves with types. Occasionally a program raises an error, perhaps as a result of an illegal operation using mismatched types. An example is summing numerical values from two different columns where unbeknownst to you, both columns have a type of “object”. Rather than resulting in the summing of values, the results are concatenated values since the plus (+) symbol is used for string concatenation. Clearly not the results you are excepting. Later in the book, we encounter the DataFrame astype() method for converting types.

In the following examples, we encounter the types object, int64, and float64. In Chapter 7, “Date and Time,” we discuss the datetime64 and timedelta types. The pandas type object is similar to a SAS character variable used to manipulate string values. The int64 and float64 are pandas types for integers and floating-point numbers respectfully. Both of these types are similar to a SAS numeric.

Series

From a SAS context, a Series can be thought of as a one-dimensional array with labels. A Series includes a default index used as a key to locate element values. Let’s begin by constructing a Series with ten randomly generated numeric elements. Listing 3-2 illustrates the construction of a Series by calling the numpy.random function randn(). The randn() function draws a sample from the “standard normal” distribution. In order to make the Series reproducible, the random number generator is seeded with an initial value of 54321.

In this example, s1 is the Series name containing ten numeric elements with a type of float64. Notice the assignment statement on the second line. pd.Series is being called using pd as the identifier assigned following the keyword as from the import statement in Listing 3-1.

The print() method is combined with the head() function with the argument 5 to display the first five elements in the Series. Series are automatically indexed beginning at position 0. This index position is displayed to the left of the element values.
>>> np.random.seed(54321)
>>> s1 = pd.Series(randn(10))
>>> print(s1.head(5))
0    0.223979
1    0.744591
2   -0.334269
3    1.389172
4   -2.296095
dtype: float64
Listing 3-2

Create Series of Random Values

Contrast the Series created in Listing 3-2 with Listing 3-3.
4 data _null_;
5    call streaminit(54321);
6
7  array s2 {10};
8     do i = 1 to 10;
9       s2{i} = rand("Normal");
10      if i <= 5 then put s2{i};
11    end;
12 run;
OUTPUT:
-1.364866914
1.9696792198
0.5123294653
-0.597981097
-0.895650739
Listing 3-3

SAS Array of Random Values

Unlike most other programming languages, an array in the SAS language is not a data structure. Rather, an array in SAS is used as a convenience to iterate over groups of similar variables or values assigned to the array elements. While our analogy is imperfect, it remains useful for contrasting the two languages.

In Listing 3-3 we use a _NULL_ Data Step with a DO/END loop to generate ten random values from the normal random number generator. Similar to the preceding Python example, the call STREAMINIT function is used to supply a seed value to enable a reproducible stream of random numbers.

Line 7
array s2 {10};

defines a one-dimensional array called s2 containing ten elements.

Line 9
s2{i} = rand("Normal");

iterates over the array where i is the loop index variable calling the SAS rand(" Normal") function and loads the generated values into the ten array elements.

Listing 3-4 illustrates the creation of a Series and with a user-defined index. Elements from a Series can be returned by the index value as illustrated here.

Index values are assigned to the Series using the Python list ([ ]) structure as a container. A Python list is a mutable structure for holding a collection of items. In this example, the syntax
index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
creates a Python list called index to hold these values.
>>> np.random.seed(54321)
>>> s2 = pd.Series(randn(10), index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
>>> print(s2.head(5))
a    0.223979
b    0.744591
c   -0.334269
d    1.389172
e   -2.296095
dtype: float64
Listing 3-4

Series Index Labels

Series elements are returned either by their default index position or by a user-defined index value shown in Listing 3-5. In this example, the first element of the Series is returned.
>>> print(s2[0])
0.22397889127879958
>>> print(s2['a'])
0.22397889127879958
Listing 3-5

Returning Series Elements

In contrast, arrays in SAS must use a non-zero index start position for element retrieval. The same logic is shown using SAS in Listing 3-6.
4 data _null_;
5   call streaminit(54321);
6
7 array s2 {10} ;
8    do i = 1 to 10;
9      s2{i} = rand("Normal");
10     if i = 1 then put s2{i};
11   end;
12 run;
OUTPUT:
-1.364866914
Listing 3-6

SAS Array Indexed Retrieval

The DO/END block executes once based on the IF statement and writes the array element value to the SAS log.

Retrieval of values from a Series follows the string-slicing pattern presented in Chapter 2, “Python Types and Formatting.” Listing 3-7 illustrates this syntax.
>>> print(s2[:3])
a    0.223979
b    0.744591
c   -0.334269
dtype: float64
>>> print(s2[:'c'])
a    0.223979
b    0.744591
c   -0.334269
dtype: float64
Listing 3-7

Series Returning First Three Elements

The value to the left of the colon (:) separator is the start position for the Series’ index location, and values to the right identify the stop position for the element location. An empty value at the start position defaults to the first element (position 0), and an empty value for the stop value defaults to the last element (Series length –1).

Listing 3-8 is the SAS program with similar logic. It uses an IF statement within the DO/END block to return the first three array elements and print their values to the SAS log.
4 data _null_;
5   call streaminit(54321);
6
7 array s2 {10} ;
8    do i = 1 to 10;
9      s2{i} = rand("Uniform");
10     if i <= 3 then put s2{i};
11    end;
12 run;
OUTPUT:
0.4322317771
0.5977982974
0.7785986471
Listing 3-8

Array Returning First Three Elements

pandas allows mathematical operations to be combined with Boolean comparisons as a condition for returning elements from a Series. This is illustrated in Listing 3-9.
>>> s2[s2 < s2.mean()]
2   -0.334269
4   -2.296095
7   -0.082760
8   -0.651688
9   -0.016022
dtype: float64
Listing 3-9

Series Mathematical Operation

In the preceding example, the Boolean < operator is used to return any element in the Series whose value is less than the arithmetic mean value for all the Series elements.

DataFrames

You can think of a pandas DataFrame as a collection of Series into a relational-like structure with labels. There are a number of different constructor methods for creating DataFrames. DataFrames have a number of different readers to load values from a wide range of input sources including .csv files, DBMS tables, web scrapping, REST APIs, and SAS datasets (.sas7bdat files), to name a few. These DataFrame readers are similar to the SAS/Access product line.

The following is a basic introduction to the DataFrame reader read_csv() method . Chapter 6, “pandas Readers and Writers,” covers this topic in more detail. This example uses the publicly available UK Accidents Report Data from January 1, 2015, to December 31, 2015, available from

https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data

Listing 3-10 illustrates the syntax.

Note

Since the time of this writing, the data available at this web site has changed, and the remaining examples utilize a copy of this data at

https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/uk_accidents.csv

>>> import pandas as pd
>>> file_loc = "C:\Data\uk_accidents.csv"
>>> df = pd.read_csv(file_loc)
Listing 3-10

DataFrame read_csv Method, Example 1

Line 2 from this example defines the file_loc object to identify the path location to read the .csv file from the local filesystem. Note the double backslashes (\) in the pathname for normalizing the Windows pathname. Without the double backslashes, a syntax error is raised. Line 3 constructs the DataFrame df by calling the read_csv() method. Unlike SAS, the read operation is silent with no status information returned after the read is completed unless an error is raised.

Listing 3-11 illustrates reading the same .csv file directly without needing to download the file manually.
>>> df = pd.read_csv("https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/uk_accidents.csv")
>>> print(df.shape)
(266776, 27)
Listing 3-11

DataFrame read_csv Method, Example 2

The remainder of the examples in this chapter are dependent on executing Listing 3-11.

The print() function displays the value from the DataFrame shape attribute to indicate the number of rows and columns in the DataFrame.

Listing 3-12 illustrates the syntax for PROC IMPORT from Base SAS software to read the UK accident .csv file stored on GitHub.
4  filename git_csv temp;
5  proc http
6     url="https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/uk_accidents.csv"
7     method="GET"
8     out=git_csv;
9
10 proc import datafile = git_csv
11      dbms=csv
12      out=uk_accidents;
13  run;
NOTE: 266776 records were read from the infile GIT_CSV.
      The minimum record length was 65.
      The maximum record length was 77.
NOTE: The dataset WORK.UK_ACCIDENTS has 266776 observations and 27 variables.
266776 rows created in WORK.UK_ACCIDENTS from GIT_CSV.
NOTE: WORK.UK_ACCIDENTS dataset was successfully created.
Listing 3-12

SAS PROC IMPORT

The filename statement on line 4 defines the fileref git_csv as a temporary location on the local filesystem to output the .csv file. The syntax
proc import datafile = git_csv

reads the .csv file downloaded by PROC HTTP .

DataFrame Validation

pandas provide a number of methods to begin the process of validation. As an example, the shape attribute returns a DataFrame’s row and column count. This is illustrated in Listing 3-13.
 >>> print(df.shape)
(266776, 27)
Listing 3-13

DataFrame Characteristics

In this example, the DataFrame df has 266,777 rows and 27 columns.

In order to understand how the data is physically organized, use the info() method to display information such as column names, column count, index structure, and so on.

Listing 3-14 illustrates the use of the info() method to return detailed information on the contents of a DataFrame. The DataFrame info() method returns output similar to that from PROC CONTENTS for understanding physical characteristics of a SAS dataset.
>>> print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266776 entries, 0 to 266775
Data columns (total 27 columns):
Accident_Severity               266776 non-null int64
Number_of_Vehicles              266776 non-null int64
Number_of_Casualties            266776 non-null int64
Day_of_Week                     266776 non-null int64
Time                            266752 non-null object
Road_Type                       266776 non-null int64
Speed_limit                     266776 non-null int64
Junction_Detail                 266776 non-null int64
Light_Conditions                266776 non-null int64
Weather_Conditions              266776 non-null int64
Road_Surface_Conditions         266776 non-null int64
Urban_or_Rural_Area             266776 non-null int64
Vehicle_Reference               266776 non-null int64
Vehicle_Type                    266776 non-null int64
Skidding_and_Overturning        266776 non-null int64
Was_Vehicle_Left_Hand_Drive_    266776 non-null int64
Sex_of_Driver                   266776 non-null int64
Age_of_Driver                   266776 non-null int64
Engine_Capacity__CC_            266776 non-null int64
Propulsion_Code                 266776 non-null int64
Age_of_Vehicle                  266776 non-null int64
Casualty_Class                  266776 non-null int64
Sex_of_Casualty                 266776 non-null int64
Age_of_Casualty                 266776 non-null int64
Casualty_Severity               266776 non-null int64
Car_Passenger                   266776 non-null int64
Date                            266776 non-null object
dtypes: int64(25), object(2)
memory usage: 55.0+ MB
None
Listing 3-14

Examining the DataFrame’s Organization

The first line of output identifies the DataFrame as belonging to the object class pandas.core.frame.DataFrame.

Next, information about the RangeIndex object is presented. The RangeIndex object is used as a row label. Like a SAS dataset, DataFrames can be subset by rows, by columns, or some combination of both. DataFrame row labels are identifiers for performing certain row-based subsetting operations similar to the behavior of the SAS automatic variable _N_. Chapter 5, “Data Management,” covers this topic in more detail.

When a DataFrame is constructed and no column, or collection of columns is defined as the index, then by default, the RangeIndex object is created using a start position of 0 and an end position of DataFrame length –1. Alternatively, a DataFrame can be constructed designating a column or collection of columns as the index using the syntax
df = pd.read_csv('d:\data\Customers.csv', index_col = ['Last_Name, First_Name'])

Further, the DataFrame index can be altered dynamically. The DataFrame method set_index() is used to designate any column or collection of columns as the index after the DataFrame is created.

The bulk of the output from the info() method displays the column attribute information. The last column from the info() method output displays the DataFrame column types. In this example, there are 25 numeric columns with type int64 and 2 columns with type object, similar to a character variable in a SAS dataset.

The dtype attribute is used to return the type for any DataFrame column as shown in Listing 3-15.
>>> df['Date'].dtype
dtype('O')
Listing 3-15

DataFrame dtype Attribute

This rather sparse output indicates the Date column in the df DataFrame is type object; in other words the column values are strings. Of course, in order to do any meaningful time-based analysis with the Date column, it must have datetime64 as its type.

In Listing 3-10 and Listing 3-11 examples, the read_csv() method uses a default type mappings when constructing the DataFrame from the .csv file. In Chapter 6, “pandas Readers and Writers,” we will illustrate in detail how to take advantage of date parsers in order to load datetime values stored as strings directly into DataFrame columns with a datetime64 type.

For example, the syntax
file_loc = "C:\Data\uk_accidents.csv"
df = pd.read_csv(file_loc, parse_dates=['Date'])

uses the parse_dates= list data structure to designate the .csv column named Date is read using the default date parser and results in a datetime64 DataFrame column type without the need for an intermediate conversion step.

DataFrame Inspection

Inspection refers to the task of displaying portions of the values contained by the DataFrame to increase the understanding of the data contents. With SAS this is accomplished with features like PROC PRINT and PROC SQL . The head() and tail() functions display by default the first ten rows and the last ten rows, respectively, of a DataFrame. If you are familiar with Unix, then you will recognize that these DataFrame function names derive from the head command to display the first five rows of a file and the tail function to display the last five rows of a file. You can pass an integer argument to the tail() function, for example:
df.tail(24)

displays the last 24 rows of the df DataFrame.

Listing 3-16 illustrates use of an integer argument of 10 to return the last ten rows.
>>> df.tail(10)
        Accident_Severity     ...            Date
266766                  2     ...       8/30/2015
266767                  3     ...      11/29/2015
266768                  3     ...      11/29/2015
266769                  3     ...      11/29/2015
266770                  3     ...       7/26/2015
266771                  3     ...       7/26/2015
266772                  3     ...      12/31/2015
266773                  3     ...       7/28/2015
266774                  3     ...       7/28/2015
266775                  3     ...       7/15/2015
[10 rows x 27 columns]
Listing 3-16

DataFrame tail() method

Notice the first column of output from the tail(10) function displays the RangeIndex values. As an analogy, SAS uses the FIRSTOBS and OBS dataset option followed by a value for _N_ with most procedures as a means to determine which observations to include for processing.

For example, consider Listing 3-17.
4 proc print data = uk_accidents (firstobs = 266767);
5   var Accident_Severity Date;
6 run;
NOTE: There were 10 observations read from the dataset WORK.UK_ACCIDENTS.
Listing 3-17

PROC PRINT Firstobs =

The FIRSTOBS= dataset option is set to 266767 in order to display the last ten observations in the SAS dataset WORK.uk_accidents created in Listing 3-12. The SAS Note indicates ten rows were read from the dataset. Figure 3-1 displays the output.
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig1_HTML.jpg
Figure 3-1

Print Last Ten OBS

Notice that the last row label for the DataFrame in Listing 3-16, DataFrame tail function is 266775, while the last observation number in the SAS output is 266776. Recall the default row index value for a DataFrame starts at 0, and the SAS automatic variable _N_ starts at 1.

As one would expect, the head(5) function returns the first five rows from a DataFrame as illustrated in Listing 3-18.
>>> df.head(5)
   Accident_Severity    ...           Date
0                  3    ...       1/9/2015
1                  3    ...       1/9/2015
2                  3    ...      2/23/2015
3                  3    ...      2/23/2015
4                  3    ...      2/23/2015
[5 rows x 27 columns]
Listing 3-18

DataFrame head() Function

The syntax for displaying the first five observations in a SAS dataset is
proc print data=uk_accidents (obs = 5);
Listing 3-19 illustrates the DataFrame describe() function to report measures of central tendencies, dispersion, and shape for all numeric types. Due to space limitations, only the first two columns of output are displayed. The describe() function display is similar to output produced with PROC MEANS in SAS.
>>> df.describe()
       Accident_Severity  Number_of_Vehicles      ...
count      266776.000000       266776.000000      ...
mean            2.871529            4.103994      ...
std             0.361889            4.388495      ...
min             1.000000            1.000000      ...
25%             3.000000            2.000000      ...
50%             3.000000            2.000000      ...
75%             3.000000            3.000000      ...
max             3.000000           37.000000      ...
[8 rows x 25 columns]
Listing 3-19

DataFrame describe() Function

Listing 3-20 returns output similar to that of the describe() function.
4 proc means data = uk_accidents;
5    var Accident_Severity Number_of_Vehicles;
6 run;
Listing 3-20

PROC MEANS

The output from Listing 3-20 is shown in Figure 3-2.
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig2_HTML.jpg
Figure 3-2

PROC MEANS Output

DataFrames have a variety of methods to slice and dice (subset) data. We will review these features in detail in Chapter 4, “Indexing and GroupBy.” For now we introduce the slicing operator [[ ]] to select a set of rows and/or columns from a DataFrame.

Listing 3-21 illustrates the [[ ]] slicing operator returning columns by labels chained with the head() function to display the first ten rows.
>>> df[['Sex_of_Driver', 'Time']].head(10)
   Sex_of_Driver   Time
0              1  19:00
1              1  19:00
2              1  18:30
3              2  18:30
4              1  18:30
5              1  17:50
6              1  17:50
7              1   7:05
8              1   7:05
9              1  12:30
Listing 3-21

DataFrame Column Slicing

In SAS, the syntax for displaying the first ten observations for the Sex_of_Driver and Time variables is
proc print data=uk_accidents (obs = 10);
   var Sex_of_Driver Time;

As one would expect, there are a large number of statistical and visualization techniques that can be applied to all or portions of a DataFrame to derive meaningful information and insights. One such example is the hist() method to render a histogram.

Listing 3-22 displays the Python code used to create the histogram in Figure 3-3.
>>> import matplotlib.pyplot as plt
>>> df.hist(column='Sex_of_Driver', grid=False)
>>> plt.show()
Listing 3-22

Histogram of Accident Rates by Gender

../images/440803_1_En_3_Chapter/440803_1_En_3_Fig3_HTML.jpg
Figure 3-3

Histogram of Accident Rates by Gender

From the supplied metadata, we know the variable Sex_of_Driver value of 1 maps to males, 2 to females, 3 to not known and –1 to data missing or out of range.

From this simple histogram, we see males have an accident rate over twice that of females.

Missing Data

One of the more challenging aspects for data analysis is the treatment of missing data. Data may be missing for a wide range of reasons. It may be the data is not available or the data was not collected. Sometime values are recorded inaccurately. Whatever the reasons for encountering missing data, we need a good strategy for dealing with this challenge.

pandas use two built-in values or sentinel values for handling missing data, the Python None object and NaN (not a number) object. The None object is often used to indicate missing values in Python code. The None object is used as a missing value indicator for DataFrame columns with a type of object (character strings). In contrast NaN is a special floating-point value used to indicate missing values for the float64 type. This means NaN is specifically for floating-point values with no equivalent for strings, integers, or other pandas types.

However, this is generally not an issue. To use missing data with an integer column, the column will be upcasted to a type float. pandas does this conversion transparently for you if you assign the None object into a column of data type int.

As an example consider Listing 3-23 and Listing 3-24 together.
>>> import pandas as pd
>>> df1 = pd.DataFrame([['cold', 9],
...                     ['warm', 4],
...                     [None  , 4]],
...                     columns=['Strings', 'Integers'])
>>> print(df1)
  Strings  Integers
0    cold         9
1    warm         4
2    None         4
>>> print(df1.dtypes)
Strings     object
Integers     int64
Listing 3-23

Construct df1 DataFrame

Listing 3-23 creates the df1 DataFrame with two columns labeled Strings and Integers. The type for the Strings column is object and the type for the column Integers is int64.

The first print() function displays the df1 DataFrame. The second print() function returns the column types. Since all values for the Integers column are integer values, the type for the column is naturally int64. On row 2 the value for the Strings column is missing as indicated by the None object for its value.

Now consider Listing 3-24. This example illustrates how pandas automatically upcasted the DataFrame’s column Integers from int64 to float64 in order to accommodate the update of the value 9 to None indicating a missing value.
>>> df1.loc[df1.Integers == 9, 'Integers'] = None
>>> print(df1)
  Strings  Integers
0    cold       NaN
1    warm       4.0
2    None       4.0
>>> print(df1.dtypes)
Strings      object
Integers    float64
Listing 3-24

Update df1 DataFrame

The first line of this example uses the DataFrame loc() indexer and Boolean logic to locate and update the Integers column on the row where its value is 9 and updates this value to missing using the None object. The corresponding SAS syntax in a Data Step is used to update a numerical value to missing is
if integers = 9 then integers = .;

In Listing 3-23 the type for the Integers column is int64. As a result of the in-place update of the value 9 to the Python object None, the type for this column is now float64. Also notice how the output from the print() function in Listing 3-24 displays the updated value as a NaN, the missing value indicator for float64 even though the None object is assigned. This is a result of automatically upcasting the column type to accommodate missing values and how NaNs are used as a missing data indicator for columns with type float64.

To further the understanding of missing value treatment by pandas, consider Listing 3-25 to create the df2 DataFrame .
>>> import pandas as pd
>>> df2 = pd.DataFrame([['cold','slow', None, 2.7, 6.6, 3.1],
...                   ['warm', 'medium', 4.2, 5.1, 7.9, 9.1],
...                   ['hot', 'fast', 9.4, 11.0, None, 6.8],
...                   ['cool', None, None, None, 9.1, 8.9],
...                   ['cool', 'medium', 6.1, 4.3, 12.2, 3.7],
...                   [None, 'slow', None, 2.9, 3.3, 1.7],
...                   [None, 'slow', None, 2.9, 3.3, 1.7]],
...                   columns=['Temp', 'Speed', 'Measure1', 'Measure2', 'Measure3', 'Measure4'])
>>> print(df2)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
>>> print(df2.dtypes)
Temp         object
Speed        object
Measure1    float64
Measure2    float64
Measure3    float64
Measure4    float64
Listing 3-25

pandas Missing Values

In this example, the Python None object is used to indicate NA’s or missing values for columns with type object (character data) and columns with the type float64 (numeric data). Even though we use the Python None object to indicate a missing value for columns Measure1Measure4, the sentinel value NaN is used to represent these missing values. That is because columns Measure1Measure4 have a type of float64.

Missing Value Detection

With this basic knowledge of missing values for pandas, the next step is understanding missing value detection along with missing value replacement. Table 3-3 lists the functions applied to a pandas DataFrame or Series to detect and replace missing values.
Table 3-3

pandas Missing Replacement Functions

Function

Action Taken

 isnull( )

Generates a Boolean mask indicating missing values

 notnull( )

Opposite of isnull( )

dropna( )

Returns a filtered copy of the original DataFrame

fillna( )

Returns a copy of the original DataFrame with missing values filled or imputed

Each of these functions is examined in detail in the following text.

In SAS there are a number of different approaches to use for missing value detection and replacement. One approach with SAS is to write a Data Step to traverse along variables and use IF/THEN logic to test for and replace missing values reading the entire dataset observation by observation.

Python permits a similar approach using the isnull() function illustrated in Listing 3-26.
>>> for col_name in df2.columns:
...      print (col_name, end="---->")
...      print (sum(df2[col_name].isnull()))
...
Temp---->2
Speed---->1
Measure1---->4
Measure2---->1
Measure3---->1
Measure4---->0
Listing 3-26

Count Missing Values Iteratively

In this example, a Python for loop iterates over the columns in the df2 DataFrame (created in Listing 3-25) and uses the sum() method combined with the isnull() method to return a count of missing values for each column. While this example returns the correct results, there is a more Pythonic approach for achieving the same results.

As an aside, if you find yourself devising programming algorithms in Python using iterative methods, stop and take time to do research. Chances are a method or function for the object, in this case, a DataFrame, already exists.

Listing 3-27 is a case in point. Rather than iterating over the columns in df2 DataFrame shown in Listing 3-26, a more direct and efficient approach is to use the DataFrame isnull() method combined with the sum() method for the entire df2 DataFrame.
>>> df2.isnull().sum()
Temp        2
Speed       1
Measure1    4
Measure2    1
Measure3    1
Measure4    0
Listing 3-27

isnull Returning Missing Count

Only three lines of Python code in Listing 3-26 are used to iterate over the df2 columns and call the isnull() method combined with the sum() method. In contrast, Listing 3-27 requires only one line of code.

An approach to identifying missing values with SAS is illustrated in Listing 3-28. In this example, PROC FORMAT bins all values into missing and non-missing “buckets.” By default SAS uses a period (.) for missing numeric values and a blank (ASCII 32) to indicate missing for character variables. Therefore, a user-defined format is needed for both numeric and character variables.
4  data df;
5     infile cards;
6     input temp $4.
7           speed $7.
8       @14 measure1
9       @18 measure2
10      @23 measure3
11      @28 measure4 ;
12  list;
13  datalines;
RULE:      ----+----1----+----2----+----3----+----4----+----5
14         cold slow    .   2.7  6.6  3.1
15         warm medium  4.2 5.1  7.9  9.1
16         hot  fast    9.4 11.0 .    6.8
17         cool         .   .    9.1  8.9
18         cool medium  6.1 4.3  12.2 3.7
19              slow    .   2.9  3.3  1.7
20              slow    .   2.9  3.3  1.7
NOTE: The dataset WORK.DF has 7 observations and 6 variables.
21  ;;;;
22  proc print;
NOTE: There were 7 observations read from the dataset WORK.DF.
23  proc format;
24    value $missfmt ' '='Missing' other='Not Missing';
NOTE: Format $MISSFMT has been output.
25    value  missfmt  . ='Missing' other='Not Missing';
NOTE: Format MISSFMT has been output.
26 proc freq data=df;
27    format _CHARACTER_ $missfmt.;
28    tables _CHARACTER_ / missing missprint nocum nopercent;
29
30    format _NUMERIC_ missfmt.;
31    tables _NUMERIC_ / missing missprint nocum nopercent;
32 run;
Listing 3-28

PROC FORMAT to Identify Missing Values

The user-defined $missfmt. format bins character variable values into two buckets: those with a blank value (ASCII 32) to indicate a missing value labeled “Missing” and all other values labeled “Not Missing”. Similarly, the user-defined missfmt. format bins numeric variables into two buckets: those with a period (.) labeled “Missing” and all other values labeled “Not Missing”.

Figure 3-4 displays the output from PROC PRINT. The character variables temp and speed use blanks (ASCII 32) to indicate missing values. The numeric variables measure1measure4 use a period (.) to indicate missing values.
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig4_HTML.jpg
Figure 3-4

SAS Dataset df

Figure 3-5 displays output from PROC FREQ calling the user-defined $missfmt. format for binning missing/non-missing character variables and the missfmt. for binning missing/non-missing numeric variables. The missing option for table statement is required in order for PROC FREQ to include missing values in the output.
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig5_HTML.jpg
Figure 3-5

SAS Missing Value Detection with PROC FREQ

isnull() Method
Calling the isnull() method returns a DataFrame of Boolean values. This is illustrated in Listing 3-29.
>>> print(df2)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
>>> df2.isnull()
    Temp  Speed  Measure1  Measure2  Measure3  Measure4
0  False  False      True     False     False     False
1  False  False     False     False     False     False
2  False  False     False     False      True     False
3  False   True      True      True     False     False
4  False  False     False     False     False     False
5   True  False      True     False     False     False
6   True  False      True     False     False     False
Listing 3-29

isnull() Boolean Mask

The isnull() method returns a DataFrame of Boolean values with None objects and NaN’s returned as True while non-missing values are returned as False. Unlike SAS, empty or blank values for DataFrame columns with an object type are not considered missing.

The notnull() method is the inverse of the isnull() method. It produces a Boolean mask where null or missing values are returned as False and non-missing values are returned as True.

The behavior of missing values for DataFrames used in mathematical operations and functions is similar to SAS’ behavior. In the case of DataFrames
  • Row-wise operations on missing data propagate missing values.

  • For methods and function, missing values are treated as zero.

  • If all data values are missing, the results from methods and functions will be 0.

To illustrate, consider the following examples. Listing 3-30 illustrates the syntax for creating a new column labeled Sum_M3_M4 with a row-wise summing of columns Col4 and Col5 together.
>>> df2['Sum_M3_M4'] = df2['Measure3'] + df2['Measure4']
>>> print(df2[['Measure3', 'Measure4', 'Sum_M3_M4']])
   Measure3  Measure4  Sum_M3_M4
0       6.6       3.1        9.7
1       7.9       9.1       17.0
2       NaN       6.8        NaN
3       9.1       8.9       18.0
4      12.2       3.7       15.9
5       3.3       1.7        5.0
6       3.3       1.7        5.0
Listing 3-30

Addition Data with NaN’s

In a row-wise arithmetic operation, missing values (NaNs) are propagated. Also observe the print() function used to display a subset of columns from DataFrame df2. Chapter 5, “Data Management,” details indexing and slicing operations for DataFrames. The syntax
print(df2[['Measure3', 'Measure4', 'Sum_M3_M4']])

is a form of DataFrame slicing [[ ]], in this case, columns Measure3, Measure4, and Sum_M3_M4 are returned as a DataFrame, then passed to the print() method.

This is analogous to the var statement in PROC PRINT for SAS:
proc print data=df2;
   var measure3 measure4 sum_m3_m4;

Also recall how variable names in Python are case-sensitive and case-insensitive in SAS.

Listing 3-31 returns the arithmetic sum for the column labeled Sum_M3_M4. The sum() method is applied to the column Sum_M3_M4, and in this operation, a NaN is mapped to zero. In general Python methods will ignore missing data; however, that is not always the case.
>>> print(df2[['Sum_M3_M4']])
   Sum_M3_M4
0        9.7
1       17.0
2        NaN
3       18.0
4       15.9
5        5.0
6        5.0
>>> df2['Sum_M3_M4'].sum()
70.6
Listing 3-31

Sum Function with NaN’s, Example 1

In this example, the sum() method is combined with the column df2[Sum_M3_M4] returning the value 70.6. Strictly speaking, df2[Sum_M3_M4] is a Series.

Listing 3-32 illustrates this same behavior with respect to missing values.
4  proc sql;
5    create table df_sum as
6       select *
7              , (measure3 + measure4) as Sum_M3_M4
8      from df;
NOTE: Table WORK.DF_SUM created, with 7 rows and 7 columns.
9
10      select Sum_M3_M4
11      from df_sum;
NOTE: Writing HTML Body file: sashtml.htm
12
13      select 'Sum Function Applied to Measure 5'
14             ,sum(Sum_M3_M4)
15      from df_sum;
16 quit;
Listing 3-32

SAS Sum Function with Missing

The PROC SQL SELECT statement creates the variable Sum_M3_M4 on the SAS dataset df_sum. Observe how this sum operation is analogous to Listing 3-31. Like the Python example, the addition of SAS performs on observation 3 with
(measure3 + measure4) as Sum_M3_M4

as the second variable in the SELECT statement propagates a missing value. For observation 3, measure3 is missing and measure4 has a value of 6.8 resulting in a missing value for the variable Sum_M3_M4 displayed as the first piece of output in Figure 3-6.

The second SELECT statement displays the values for the variable Sum_M3_M4. The third SELECT statement uses the SUM function to add the values in the column Sum_M3_M4. The output is displayed as the second piece of output in Figure 3-6.
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig6_HTML.jpg
Figure 3-6

SAS SUM Function with Missing Values

Consider Listing 3-33. As discussed previously the default behavior for the sum() method is to map the NaN value on row 2 to the value 0. Listing 3-33, illustrates how to override this default behavior.
>>> print(df2[['Sum_M3_M4']])
   Sum_M3_M4
0        9.7
1       17.0
2        NaN
3       18.0
4       15.9
5        5.0
6        5.0
>>> df2['Sum_M3_M4'].sum(skipna=False)
nan
Listing 3-33

Sum Method with NaN’s, Example 2

The parameter skipna=False forces the sum() method to return a NaN if a value of NaN is encountered in the operation.

Listing 3-34 illustrates adding the column labeled Measure5 to DataFrame df2 with all values set to missing using the None object. Applying the sum() method to Measure5 returns a value of zero (0) since all values in this column are missing.
>>> df2['Measure5'] = None
>>> print(df2['Measure5'])
0    None
1    None
2    None
3    None
4    None
5    None
6    None
Name: Measure5, dtype: object
>>> df2['Measure5'].sum()
0
Listing 3-34

Sum Method with All Missing Values

Likewise, the behavior for the SAS SUM function returns zero in the case where all input values are missing. Consider Listing 3-35.
4   data _null_;
5     do i= 1 to 7;
6        measure5 = .;
7        put measure5;
8        sum_m5 = sum(measure5);
9        sum_0_m5 = sum(0, measure5);
10    end;
11    put 'SUM(measure5) Returns:' sum_m5;
12    put 'SUM(0, measure5) Returns:' sum_0_m5;
OUTPUT:
.
.
.
.
.
.
.
SUM(measure5) Returns:.
SUM(0, measure5) Returns:0
Listing 3-35

SAS Sum Function with All Missing Values

A _NULL_ Data Step iterates seven times by setting all instances for variable measure5 to missing value and prints the default SAS sentinel missing value for numerics, a period (.) to the log. The statement on line 8
sum_m5 = sum(measure5);
illustrates the default behavior of the SAS SUM function returning missing (.) since all values for the variable measure5 are set to missing. The syntax
sum_0_m5 = sum(0, measure5);

illustrates overriding this default behavior and having the SUM function return zero (0) when all input values to the function are missing.

Dropping Missing Values

One method for dealing with missing values is to drop the row or column in which the missing value is found. This approach is illustrated in Listing 3-36.
>>>df2['Measure5'] = 0
>>>df3 = df2.dropna()
>>> print(df3)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
1  warm  medium       4.2       5.1       7.9       9.1
4  cool  medium       6.1       4.3      12.2       3.7
Listing 3-36

dropna() Method

Calling the dropna() method without any parameters drops those rows where one or more values are missing regardless of the column’s type. The default behavior for dropna() is to operate along rows or axis 0 and return a new DataFrame with the original DataFrame unmodified. In this example, DataFrame df3 is derived from DataFrame df2 and excludes dropped values from DataFrame df2. Alternatively, the default inplace=False parameter can be set to True in order to perform an in-place update to the original DataFrame.

Listing 3-37 illustrates an in-place update for the dropna() method dropping rows with missing values.
>>> import pandas as pd
>>> df4 = pd.DataFrame([['cold','slow', None, 2.7, 6.6, 3.1],
...                   ['warm', 'medium', 4.2, 5.1, 7.9, 9.1],
...                   ['hot', 'fast', 9.4, 11.0, None, 6.8],
...                   ['cool', None, None, None, 9.1, 8.9],
...                   ['cool', 'medium', 6.1, 4.3, 12.2, 3.7],
...                   [None, 'slow', None, 2.9, 3.3, 1.7],
...                   [None, 'slow', None, 2.9, 3.3, 1.7]],
...                   columns=['Temp', 'Speed', 'Measure1',
                      'Measure2', 'Measure3', 'Measure4'])
>>> print(df4)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
>>> df4.dropna(inplace=True)
>>> print(df4)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
1  warm  medium       4.2       5.1       7.9       9.1
4  cool  medium       6.1       4.3      12.2       3.7
Listing 3-37

dropna() Update in Place

Unlike Listing 3-34, this example does not make an assignment to create a new DataFrame. Instead, the original DataFrame df4 is updated in place with the inplace=True parameter.

The SAS CMISS function can be used to detect and then delete observations containing one or more missing values as illustrated in Listing 3-38.
4   data df4;
5      set df;
6      if cmiss(of _all_) then delete;
7
NOTE: There were 7 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.DF4 has 2 observations and 6 variables.
8   proc print data = df4;
9   run;
Listing 3-38

CMISS Delete Observation with Any Missing Values

The argument to the CMISS function uses the automatic SAS variable _ALL_ to designate all variables in the dataset. In this case, the of parameter to the CMISS function is required; otherwise, the code produces the error
Cannot use _all_ as a variable name.
Figure 3-7 displays the resulting SAS dataset df4.
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig7_HTML.jpg
Figure 3-7

Results from CMISS Function

The dropna() method also works along a column axis. DataFrames refer to rows as axis 0 and columns as axis 1. The default behavior for the dropna() method is to operate along axis 0 or rows.

Listing 3-39 illustrates dropping any column containing missing values.
>>> print(df2)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
>>> df2.dropna(axis=1)
   Measure4
0       3.1
1       9.1
2       6.8
3       8.9
4       3.7
5       1.7
6       1.7
Listing 3-39

dropna() Along Axis 1

This example returns a DataFrame by dropping any column containing one or more missing values. The same results can be accomplished with the following syntax.
df2.dropna(axis = 'columns')

The parameter axis = 'columns' is an alternative to axis = 1.

Listing 3-40 illustrates dropping any variable containing one or more missing values. One approach is to
  • Create an ODS output table using PROC FREQ NLEVELS option to identify those variables in the SAS dataset containing one or more missing values. Figure 3-8 displays the results of this step.

  • Execute the PROC SQL SELECT INTO : to create the Macro variable &drop containing the variable names with missing values separated by a blank (ASCII 32). The FROM clause identifies the output dataset created by the ODS Table Output created with PROC FREQ.

  • A SAS Data Step creates the output dataset df2, in this case, by reading the input dataset df using the SAS dataset option DROP followed by the Macro variable &drop containing the DROP list created in the PROC SQL step. Figure 3-9 displays the resulting SAS dataset.

4 ods output nlevels=nlvs;
5   proc freq data=df nlevels;
6      tables _all_;
7 ods select nlevels;
NOTE: The dataset WORK.NLVS has 6 observations and 4 variables.
NOTE: There were 7 observations read from the dataset WORK.DF.
8 proc print data=nlvs;run;
NOTE: There were 6 observations read from the dataset WORK.NLVS.
9  proc sql noprint ;
10    select tablevar into :drop separated by ' '
11        from nlvs
12     where NMissLevels ne 0;
13 quit;
14
15 data df2;
16    set df(drop=&drop);
NOTE: There were 7 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.DF2 has 7 observations and 1 variables.
17  proc print data=df2;run;
NOTE: There were 7 observations read from the dataset WORK.DF2.
Listing 3-40

Dropping SAS Variables with Missing Values

The SAS statements
ods output nlevels=nlvs;
   proc freq data=df nlevels;
      tables _all_;
ods select nlevels;

opens a SAS dataset called nlvs. PROC FREQ with the NLEVELS option executes to produce the desired output. By default, PROC FREQ produces a number of different pieces of tabular output. The statement ODS SELECT NLEVELS selects the NLEVELS object from PROC FREQ and outputs its values as a Series of observations and variables into the SAS dataset nlvs.

Figure 3-8 illustrates the results of executing the syntax
proc print data=nlvs;
run;
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig8_HTML.jpg
Figure 3-8

ODS Table Output from PROC FREQ

The PROC SQL step uses the
select tablevar into :drop separated by ' '
to build the Macro variable &drop containing names of the variables to be dropped separated by a blank (ASCII 32). The WHERE clause
NMissLevels ne 0

selects those variable names containing one or more missing values. The variable tablevar holds the names of the variables in the SAS dataset df.

Figure 3-9 displays the PROC PRINT output for the SAS dataset df2 created by dropping the variables with missing values held in the Macro variable &drop.
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig9_HTML.jpg
Figure 3-9

Results from Dropping Variables with Missing Values

Listing 3-41 illustrates the removal of duplicate rows in a DataFrame.
>>> df6 = df2.drop_duplicates()
>>> print(df6)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
Listing 3-41

drop_duplicates()

This example creates the DataFrame df6 by applying the drop_duplicates() method to the df2 DataFrame. In DataFrame df2 rows 5 and 6 have identical values.

PROC SORT in SAS has a number of different options for handling duplicate observations in a SAS dataset. Listing 3-42 illustrates the NODUPRECS option for PROC SORT.
4       proc sort data = df
5          out = df6
6          noduprecs;
7       by measure1;
NOTE: 7 observations were read from "WORK.df"
NOTE: 1 duplicate observation were found
NOTE: Dataset "WORK.df6" has 6 observation(s) and 6 variable(s)
8       proc print data=df6;run;
NOTE: 6 observations were read from "WORK.df6"
Listing 3-42

Drop Duplicate Records with SAS

The NODUPRECS option identifies observations with identical values for all variables and removes from the output dataset.

As you can see from the preceding examples, the dropna() method drops a fair amount of “good” data. Rather than dropping the entire row or column if a missing value is encountered, the dropna() method uses the thresh= parameter to specify the minimum number of non-missing values for a row or column to be kept when dropping missing values. Listing 3-43 illustrates this feature.
>>> print(df2)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
>>> df7 = df2.dropna(thresh=4)
>>> print(df7)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
Listing 3-43

dropna(thresh=4)

The example creates the DataFrame df7 by setting the thresh= parameter to 4. The thresh=4 parameter iterates through all rows and keeps each row that has at least four non-missing values. Row 3 is dropped since it contains only three non-missing values.

Imputation

Of course, rather than dropping an entire row or column, missing values can be replaced or imputed using mathematical and statistical functions. The fillna() method returns a Series or DataFrame by replacing missing values with derived values. One approach is to replace all missing numerical values (NaN) with zeros. Listing 3-44 illustrates the fillna() method .
>>> print(df2)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
>>> df8 = df2.fillna(0)
>>> print(df8)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       0.0       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       0.0       6.8
3  cool       0       0.0       0.0       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5     0    slow       0.0       2.9       3.3       1.7
6     0    slow       0.0       2.9       3.3       1.7
Listing 3-44

Replacing NaN’s with Zeros

In this example, the fillna() method is applied to the entire DataFrame. The None objects in columns Temp and Speed are replaced with the string zero (‘0’). The NaN’s in columns Measure1Measure4 are replaced with the numeric value zero (0).

Listing 3-45 illustrates mapping NaN's for the numeric columns only.
>>> print(df2)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
>>> df9 = df2[['Measure1', 'Measure2', 'Measure3', 'Measure4']].fillna(0)
>>> print(df9)
   Measure1  Measure2  Measure3  Measure4
0       0.0       2.7       6.6       3.1
1       4.2       5.1       7.9       9.1
2       9.4      11.0       0.0       6.8
3       0.0       0.0       9.1       8.9
4       6.1       4.3      12.2       3.7
5       0.0       2.9       3.3       1.7
6       0.0       2.9       3.3       1.7
Listing 3-45

fillna() Method Over Column Subset

The fillna(0) method with a parameter value of zero (0) replaces NaN’s in columns Measure1Measure4 with a numeric zero (0).

Listing 3-46 passes a Series of different replacement values into a single DataFrame to handle missing values. The fillna() method accepts a Dictionary of values as a parameter.

A Python Dictionary is a collection of key:value pairs where keys have unique values within the Dictionary. A pair of braces creates an empty Dictionary: { }. Placing a comma-separated list of key:value pairs within the braces adds key:value pairs to the Dictionary.
>>> df10 = df2.fillna({
...     'Temp'  : 'cold',
...     'Speed' : 'slow',
...     'Measure1' : 0,
...     'Measure2' : 0,
...     'Measure3' : 0,
...     'Measure4' : 0,
... })
>>> print(df10)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       0.0       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       0.0       6.8
3  cool    slow       0.0       0.0       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  cold    slow       0.0       2.9       3.3       1.7
6  cold    slow       0.0       2.9       3.3       1.7
Listing 3-46

fillna() Method Using a Dictionary

In the Temp column, the string 'cold' replaces the None object, and in the Speed column, the string 'slow' replaces the None object. Columns Measure1Measure4 map the NaN’s to zero (0).

Another imputation method is to replace NaN's with the arithmetic mean from a column having few or no missing values. This assumes the columns have nearly equal measures of dispersion. Listing 3-47 illustrates this approach.
>>> print(df2)
   Temp   Speed  Measure1  Measure2  Measure3  Measure4
0  cold    slow       NaN       2.7       6.6       3.1
1  warm  medium       4.2       5.1       7.9       9.1
2   hot    fast       9.4      11.0       NaN       6.8
3  cool    None       NaN       NaN       9.1       8.9
4  cool  medium       6.1       4.3      12.2       3.7
5  None    slow       NaN       2.9       3.3       1.7
6  None    slow       NaN       2.9       3.3       1.7
>>> df11 = df2[["Measure1", "Measure2", "Measure3"]].fillna(df2.Measure4.mean())
>>> print(df11)
   Measure1  Measure2  Measure3
0       5.0       2.7       6.6
1       4.2       5.1       7.9
2       9.4      11.0       5.0
3       5.0       5.0       9.1
4       6.1       4.3      12.2
5       5.0       2.9       3.3
6       5.0       2.9       3.3
Listing 3-47

fillna() with Arithmetic Mean

In this example, NaN’s in columns Measure1Measure3 are replaced with the arithmetic mean value derived from the column Measure4 which is 5.0.

Listing 3-48 illustrates the same logic in SAS.
4 data df2;
5   infile cards dlm=',';
6   input Temp $
7         Speed $
8          Measure1
9          Measure2
10         Measure3
11         Measure4 ;
12 list;
13 datalines;
RULE:      ----+----1----+----2----+----3----+----4----+----5-
14         cold, slow, ., 2.7, 6.6, 3.1
15         warm, medium, 4.2, 5.1, 7.9, 9.1
16         hot, fast, 9.4, 11.0, ., 6.8
17         cool, , ., ., 9.1, 8.9
18         cool, medium, 6.1, 4.3, 12.2, 3.7
19          , slow, ., 2.9, 3.3, 1.7
20          , slow, ., 2.9, 3.3, 1.7
NOTE: The dataset WORK.DF2 has 7 observations and 6 variables.
21 ;;;;
22 proc sql noprint;
23    select mean(Measure4) into :M4_mean
24    from df2;
25  quit;
26 data df11(drop=i);
27    set df2;
28 array x {3} Measure1-Measure3;
29   do i = 1 to 3;
30      if x(i) = . then x(i) = &M4_mean;
31   end;
32 format Measure1-Measure4 8.1;
NOTE: There were 7 observations read from the dataset WORK.DF2.
38 proc print data = df11; run;
NOTE: There were 7 observations read from the dataset WORK.DF11.
Listing 3-48

Replace Missing Values with Arithmetic Mean

This example uses the PROC SQL SELECT INTO : syntax to create the Macro variable &M4_mean. The SAS Data Step in lines 26 to 31 create an ARRAY with variables Measure1Measure3 as elements. The DO/END block iterates over the ARRAY executing the statement
if x(i) = . then x(i) = &M4_mean;

If any observation value in the variables Measure1Measure3 equals missing, the observation value of the variable is assigned the value from the Macro variable &M4_mean which is 5.0.

The results from this program are displayed in Figure 3-10.
../images/440803_1_En_3_Chapter/440803_1_En_3_Fig10_HTML.jpg
Figure 3-10

Replace Missing Using PROC SQL

Summary

In this chapter we introduced the pandas library and the roles for DataFrame which is essential to using Python for data analysis tasks. We covered critical topics such as column types and how pandas handle missing values along with discussing methods for detecting and replacing missing values. With this foundation set, we can begin the exploration of new methods to expand the repertoire of data exploration and analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.15.84