Dimensionality reduction technique

You should consider that a principal component assumes the linear transformations of variables, but there are other non-linear dimensionality reduction techniques.

For me, one of the most interesting techniques is the t-SNE developed by Laurens van der Maaten, who says this:

"As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but t-SNE doesn’t, I am fairly sure you did something wrong."

Let’s see an example of how t-SNE is applied on our dataset. As usual, it is recommended that you fix a seed:

set.seed(1234)

We will need to use the Rtsne package. This package contains the Rtsne function that performs the algorithm. The most important parameters are as follows:

  • pca: This establishes whether a principal component analysis is carried out before running t-SNE .
  • perplexity: This is a measure for information (defined as 2 to the power of the Shannon entropy). The perplexity parameter establishes the number of the nearest neighbors in each observation. This parameter is useful to the algorithm, as it enables it to find a balance between local and global relations in the observations in your data.

The code to run the algorithm is as follows:

library(Rtsne)

tsne= Rtsne(as.matrix(train[,3:ncol(train)]), check_duplicates=TRUE, pca=TRUE, perplexity=75, theta=0.5, dims=2,max_iter = 2000,verbose=TRUE)

It takes a few minutes for this to finish running. Additional information on how the algorithm works is also included in the package documentation and its references.

In general, the complete dataset is reduced to only two vectors:

tsne_vectors = as.data.frame(tsne$Y)

head(tsne_vectors)
## V1 V2
## 1 -4.300888 -14.9082526
## 2 4.618766 44.8443129
## 3 21.554283 3.2569812
## 4 45.518532 0.7150365
## 5 12.098218 4.9833460
## 6 -14.510530 31.7903585

Let’s plot our train dataset according to its vectors:

ggplot(tsne_vectors, aes(x=V1, y=V2)) +
geom_point(size=0.25) +
guides(colour=guide_legend(override.aes=list(size=6))) +
xlab("") + ylab("") +
ggtitle("t-SNE") +
theme_light(base_size=20) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank()) +
scale_colour_brewer(palette = "Set2")

The preceding code generates the following plot:

Now let's plot it again, giving a color to each target value and failed and non-failed banks:

plot(tsne$Y, t='n', main="tsne",xlab="Vector X",ylab="Vector y")
text(tsne$Y, labels=as.vector(train$Default), col=c('red', 'blue')[as.numeric(train$Default)])

The preceding code generates the following plot:

We can see that many failed banks are placed in the same part of the resulting bi-dimensional map. Nevertheless, one of the main weaknesses of t-SNE is the black-box type nature of the algorithm. It is not possible to make inferences on additional data based on the results, which does not happen using PCA.

t-SNE is mainly used for exploratory data analysis and it is also used as an input for clustering algorithms.

In this real case, and trying to be accurate with analysis and processes used in credit risk, we will ignore the results of PCA and t-SNE, and we will continue with our original dimensionality.

Once we have selected the most predictive variables, we will try to combine them using different algorithms. The aim is to develop a model with the highest accuracy to predict the future insolvency of banks.

Before continuing, let’s save the workspace:

rm(list=setdiff(ls(), c("Model_database","train","test","table_iv")))

save.image("~/Data12.RData")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.234.157