Removing correlated variables

Correlated variables can produce results that over-emphasize the contribution of the variables. In regression exercises, this has the effect of increasing the value of R^2, and does not accurately represent the actual performance of the model. Although many classes of machine learning algorithms are resistant to the effects of correlated variables, it deserves some mention as it is a common topic in the discipline.

The premise of removing such variables is related to the fact that redundant variables do not add incremental value to a model. For instance, if a dataset contained height in inches and height in meters, these variables would have a near exact correlation of 1, and using one of them is just as good as using the other. Practical exercises that involve variables that we cannot judge intuitively, using methods of removing correlated variables, can greatly help in simplifying the model.

The following example illustrates the process of removing correlated variables. The dataset, Pima Indians Diabetes, contains vital statistics about the diet of Pima Indians and an outcome variable called diabetes.

During the course of the examples in successive chapters, we will refer to this dataset often. A high level overview of the meaning of the different columns in the dataset is as follows:

pregnant Number of times pregnant 
glucose  Plasma glucose concentration (glucose tolerance test) 
pressure Diastolic blood pressure (mm Hg) 
triceps  Triceps skin fold thickness (mm) 
insulin  2-Hour serum insulin (mu U/ml) 
mass     Body mass index (weight in kg/(height in m)^2) 
pedigree Diabetes pedigree function 
age            Age (years) 
diabetes Class variable (test for diabetes)

We are interested in finding out if any of the variables, apart from diabetes (which is our outcome variable) are correlated. If so, it may be useful to remove the redundant variables.

Install the packages mlbench and corrplot in RStudio and execute the commands as shown here:

install.packages("mlbench") 
install.packages("corrplot") 
 
library(corrplot) 
library(mlbench) 

data (PimaIndiansDiabetes)
diab <- PimaIndiansDiabetes 
 
# To produce a correlogram 
corrplot(cor(diab[,-ncol(diab)]), method="color", type="upper")
 
# To get the actual numbers 
corrplot(cor(diab[,-ncol(diab)]), method="number", type="upper")

The command will produce a plot as shown here using the corrplot package from http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram:

The darker the shade, the higher the correlation. In this case, it shows that age and pregnancy have a relatively high correlation. We can find the exact values by using method="number" as shown. You can also view the plot at http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram.

We can also use functions such as the following to directly find the correlated variables without plotting the correlograms:

correlated_columns<- findCorrelation(cor(diab[,-ncol(diab)]), cutoff = 0.5) 
correlated_columns

Table of Contents for Removing correlated variables

Create new playlist

Sign In

Sign Up

Table of Contents for
Removing correlated variables