Dimensionality reduction

Dimensionality projection, or feature projection, consists of converting data in a high-dimensional space to a space of fewer dimensions.

High dimensionality increases the computational complexity substantially, and could even increase the risk of overfitting.

Dimensionality reduction techniques are useful for featuring selection as well. In this case, variables are converted into other new variables through different combinations. These combinations extract and summarize the relevant information from a complex database with fewer variables.

Different algorithms exist, with the following being the most important:

Principal Component Analysis (PCA)
Sammon mapping
Singular value decomposition (SVD)
Isomap
Local linear embedding (LLE)
Laplacian eigenmaps
t-distributed Stochastic Neighbor Embedding (t-SNE)

Although dimensionality reduction is not very common in cases such as failure prediction models or credit risk, we will see an example of this in our data.

We will also see the application of PCA and t-SNE, which are the most used algorithms.

PCA is a method for extracting important variables on a dataset by making linear transformations of variables. Thus, we can define a principal component as a normalized linear combination of the original variables.

The first principal component is the linear combination of variables that captures the maximum variance in the dataset. The larger the variability captured in the first component, the larger the information captured by the component. The first component best summarizes the largest information of our data in just one line. The second and subsequent principal components are also linear combinations of original variables that capture the remaining variance in the data.

PCA is also used when variables are highly correlated. One of the main properties of this method is that correlations among different components are zero.

Let’s see the implementation in R. For that, we use the prcomp function included in the rstat package:

pca <- prcomp(train[,3:ncol(train)], retx=TRUE, center=TRUE, scale=TRUE)

Before implementing the PCA approach, variables should be standardized. This means that we should see to it that the variables have a mean that should be equal to zero and that they should have a standard deviation equal to 1.

This can be done using the scale and center options as parameters in the same function:

names(pca)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

The center and scale vectors contain the mean and standard deviation of the variables we have used.

The rotation measure returns the principal components. We obtain the same number of principal components as the variables that we have in the sample.

Let’s print how these components look. For example, the first rows of the four first components are shown here:

pca$rotation[1:10,1:4]
 ##                  PC1          PC2         PC3          PC4
 ## UBPRE395 -0.05140105  0.027212743  0.01091903 -0.029884263
 ## UBPRE543  0.13068409 -0.002667109  0.03250766 -0.010948699
 ## UBPRE586  0.13347952 -0.013729338  0.02583513 -0.030875234
 ## UBPRFB60  0.17390861 -0.042970061  0.02813868  0.016505787
 ## UBPRE389  0.07980840  0.069097429  0.08331793  0.064870471
 ## UBPRE393  0.08976446  0.115336263  0.02076018 -0.012963786
 ## UBPRE394  0.16230020  0.119853462  0.07177180  0.009503902
 ## UBPRE396  0.06572403  0.033857693  0.07952204 -0.005602078
 ## UBPRE417 -0.06109615 -0.060368186 -0.01204455 -0.155802734
 ## UBPRE419  0.08178735  0.074713474  0.11134947  0.069892907

Each of these components explains a proportion of the total variance. The proportion of variance explained by each component is easy to compute as follows:

Let's first calculate the variance of each component:

pca_variances =pca$sdev^2

Then each variance is divided by the sum of the component variances:

prop_var_explained <- pca_variances/sum(pca_variances)
 
head(prop_var_explained,10)

 ##  [1] 0.10254590 0.06510543 0.04688792 0.04055387 0.03637036          0.03576523
 ##  [7] 0.02628578 0.02409343 0.02305206 0.02091978

The first principal component explains about 10% of the variance. The second component explains 6% of the variance, and so on.

We can observe both the total variances and their contribution graphically using this code:

plot(pca, type = "l",main = " Variance of Principal components")

The preceding code generates the following:

Let's run the code to plot variance:

plot(prop_var_explained, xlab = "Principal Component",
              ylab = "Proportion of Variance Explained",
              type = "b")

The preceding code generates the following plot:

The previous screenshots are helpful to determine what the number of variables or principal components are that explain an important part of the total variance.

Consequently, these components could be used for modeling instead of using the full list of variables. It is interesting to plot the cumulative explained variance:

plot(cumsum(prop_var_explained), xlab = "Principal Component",
 ylab = "Cumulative Proportion of Variance Explained",
 type = "b")

The preceding code generates the following plot:

According to the preceding screenshot, the first 20 components explain about 60% of the total variance in our dataset.

We could opt to use these 20 components to create our model. This approach is not commonly used in credit risk models, so we will not use these transformations.

Nevertheless, it is important to assess what our dataset looks like. In the following screenshot, we show a graphical representation of our data using the first two components.

Moreover, we classify each bank in the graph in accordance with its corresponding target variable. This time, we use the ggfortify package:

library(ggfortify)
 
train$Default<-as.factor(train$Default)
autoplot(pca, data = train, colour = 'Default'

This screenshot shows the classified graph of the failed and non-failed banks:

It is quite interesting to see only two components. Although these components only explain about 17% of total variance, failed and non-failed banks are to some extent differentiated between.

Table of Contents for Dimensionality reduction

Create new playlist

Sign In

Sign Up

Table of Contents for
Dimensionality reduction