How to do it...

Execute the following steps to investigate and deal with missing values in the dataset.

Import the libraries:

import pandas as pd 
import missingno
from sklearn.impute import SimpleImputer

Inspect the information about the DataFrame:

X.info()

Running the code results in the following table:

Visualize the nullity of the DataFrame:

missingno.matrix(X)

Running the line of code results in the following plot:

The white bars visible in the columns represent missing values. The line on the right side of the plot describes the shape of data completeness. The two numbers indicate the maximum and minimum nullity in the dataset (there are 23 columns in total, and the row with the most missing values contains 2—hence the 21).

Define columns with missing values per data type:

NUM_FEATURES = ['age']
CAT_FEATURES = ['sex', 'education', 'marriage']

Impute numerical features:

for col in NUM_FEATURES:
    num_imputer = SimpleImputer(strategy='median')
    num_imputer.fit(X_train[[col]])
    X_train.loc[:, col] = num_imputer.transform(X_train[[col]])
    X_test.loc[:, col] = num_imputer.transform(X_test[[col]])

Impute categorical features:

for col in CAT_FEATURES:
    cat_imputer = SimpleImputer(strategy='most_frequent')
    cat_imputer.fit(X_train[[col]])
    X_train.loc[:, col] = cat_imputer.transform(X_train[[col]])
    X_test.loc[:, col] = cat_imputer.transform(X_test[[col]])

Verify that there are no missing values:

X_train.info()

We can inspect the output, to confirm that there are no missing values in X.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...