Operating our problem

The database contains a column indicating the date of the financial statements for each bank (called the Reporting Period field). Each bank can appear several times in the dataset, once a quarter from December 2002 to December 2016.

However, this field is not recognized as a date format in R:

class(database$'Reporting Period')
## [1] "character"

Let's transform this field into a date format:

  1. First, extract the left part of the Reporting Period column. The first 10 characters are extracted in a new variable called Date:
database$Date<-substr(database$'Reporting Period',1,10)
  1. Let's convert this new column into a date using the as.Date command:
database$Date<-as.Date(database$Date,"%m/%d/%Y")
  1. Finally, remove the Reporting Period field as it is no longer relevant:
database$'Reporting Period'<-NULL

We have information about all the quarters from 2002 to 2016, but we are only interested in the financial information provided on year-end.

Let's filter the dataset to consider information from December of each year:

database<-database[as.numeric(format(database$Date, "%m"))==12,]

After the preceding line of code, our database contains 110239 observations:

print("Observations in the filtered dataset:")
## [1] "Observations in the filtered dataset:"
nrow(database)
## [1] 110239

In addition, it contains 1494 variables, as shown in the following code block:

print("Columns in the filtered dataset:")
## [1] "Columns in the filtered dataset:"
ncol(database)
## [1] 1494

At this point, let's save a backup of the workspace:

save.image("Data2.RData")

You can now take a look at all the variables in the dataset:

database_names<-data.frame(colnames(database))

As the number of variables is substantially high, it is recommended to save the name of variables in an Excel file:

write.csv(database_names,file="database_names.csv")
rm(database_names)

As you can see, there are variables in the dataset whose names are a kind of code. And we also know that it is possible to obtain the meaning of each variable in the FDIC website. This situation is really normal, especially in credit risk applications, where information gives details about account movements or transactions.

It is important to understand in some way the meaning of variables, or at least how they are generated. If not, we can include some variables very close to the target variable as predictors, or even include variables that will not be available at the moment of model implementation. However, we know that there is no apparent target in the dataset. So let's collect the target variable for our problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.214.32