Understanding the results

Now we will create an auxiliary data frame to count the number of failed banks and check whether this ratio is informed or not:

missing_analysis<-train[,c("UBPRE628","Default")]

Now we will create a flag to check whether the variable is missing:

missing_analysis$is_miss<-ifelse(is.na(missing_analysis$UBPRE628),"missing_ratio","complete_ratio")

Finally, let's sum up the number of existing defaults in the dataset for both the cases: the presence or lack of the missing values in this ratio:

aggregate(missing_analysis$Default, by = list(missing_analysis$is_miss), sum)

## Group.1 x
## 1 complete_ratio 319
## 2 missing_ratio 14

According to this table, only 14 failed banks displayed a missing value in this ratio. Apparently, we could conclude from this that a bank could intentionally not report a specific ratio because the calculated ratio could alert others about a bad economic situation of this bank. In this case, we don't observe a high proportion of bad banks if a missing value is observed.

Missing values will be estimated by calculating the mean of the ratio of the non-missing observations on the training dataset. This means that, if missing values are present in the validation dataset, they may also be present in the training dataset. Let's see an example:

train_nomiss<-train
test_nomiss<-test

for(i in 2:(ncol(train_nomiss)-2))
{
train_nomiss[is.na(train_nomiss[,i]), i] <- mean(train_nomiss[,i], na.rm = TRUE)
test_nomiss[is.na(test_nomiss[,i]), i] <- mean(train_nomiss[,i], na.rm = TRUE)
}

We can check whether the process has worked using the Amelia package on both training and validation samples (it may take a few minutes). For example, you can check whether there are missing values in the training sample after the process has executed:

missmap(train_nomiss[,2:(ncol(train_nomiss)-2)], main = "Missing values vs observed",col=c("black", "grey"),,legend = FALSE)

Here is the output of the preceding code:

Carry out the same checks for the test sample:

missmap(test_nomiss[,2:(ncol(train_nomiss)-2)], main = "Missing values vs observed",col=c("black", "grey"),,legend = FALSE)

Again, a new output is displayed:

Two maps are plotted in a gray color, indicating that there are no missing values.

Now we are going to make a new backup of our workspace and remove all the unnecessary tables:

rm(list=setdiff(ls(), c("Model_database","train","test","train_nomiss","test_nomiss")))
save.image("Data7.RData")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.133.245