How it works...

In Step 1, we imported the pandas library. Then, we used the bash command head (with an additional argument -n 5) to preview the first five rows of the CSV file. This can come in handy when we want to determine what kind of data we are dealing with, without opening a potentially large text file.

Inspecting a few rows of the dataset raised the following questions:

  • What is the separator?
  • Are there any special characters that need to be escaped?
  • Are there any missing values and, if so, what scheme (NA, empty string, nan and so on) is used for them?
  • What variables types (floats, integers, strings) are in the file? Based on that information, we can try to optimize memory usage while loading the file.

In Step 3, we loaded the CSV file by using the pd.read_csv function. When doing so, we indicated that the first column (zero-indexed) contained the index, and empty strings should be interpreted as missing values. In the last step, we separated the features from the target by using the pop method. It assigned the given column to the new variable, while removing it from the source DataFrame.

In the following, we provide a simplified description of the variables:

  • limit_bal: The amount of the given credit (NT dollar)
  • sex: Gender
  • education: Level of education
  • marriage: Marital status
  • age: Age of the customer
  • payment_status_{month}: Status of payments in one of the previous 6 months
  • bill_statement_{month}: The amount of bill statements (NT dollars) in one of the previous 6 months
  • previous_payment_{month}: The amount of previous payments (NT dollars) in one of the previous 6 months

The target variable indicates whether the customer defaulted on the payment in the following month.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.25.217