How it works...

In Step 1, we imported the pandas library. Then, we used the bash command head (with an additional argument -n 5) to preview the first five rows of the CSV file. This can come in handy when we want to determine what kind of data we are dealing with, without opening a potentially large text file.

Inspecting a few rows of the dataset raised the following questions:

What is the separator?
Are there any special characters that need to be escaped?

Are there any missing values and, if so, what scheme (NA, empty string, nan and so on) is used for them?
What variables types (floats, integers, strings) are in the file? Based on that information, we can try to optimize memory usage while loading the file.

In Step 3, we loaded the CSV file by using the pd.read_csv function. When doing so, we indicated that the first column (zero-indexed) contained the index, and empty strings should be interpreted as missing values. In the last step, we separated the features from the target by using the pop method. It assigned the given column to the new variable, while removing it from the source DataFrame.

In the following, we provide a simplified description of the variables:

limit_bal: The amount of the given credit (NT dollar)
sex: Gender
education: Level of education
marriage: Marital status
age: Age of the customer
payment_status_{month}: Status of payments in one of the previous 6 months
bill_statement_{month}: The amount of bill statements (NT dollars) in one of the previous 6 months
previous_payment_{month}: The amount of previous payments (NT dollars) in one of the previous 6 months

The target variable indicates whether the customer defaulted on the payment in the following month.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...