Finding correlation between the features 

In a linear model, the correlation between the features increases the variance for the associated parameters (the parameters related to those variables). The more correlation we have, the worse it is. The situation is even worse when we have almost perfect correlation between a subset of variables: in that case, the algorithm that we use to fit linear models doesn't even work. The intuition is the following: if we want to model the impact of a discount (yes-no) and the weather (rain–not rain) on the ice cream sales for a restaurant, and we only have promotions on every rainy day, we would have the following design matrix (where Promotion=1 is yes and Weather=1 is rain):  

Promotion Weather
1 1
1 1
0 0
0 0

 

This is problematic, because every time one of them is 1, the other is 1 as well. The model cannot identify which variable is driving the sales. The correlation here is actually 1, and if we want to invert the matrix , we wouldn't be able to do so. The only possible solution is to remove one of these variables.  

The correlation problem might not appear just between two variables but between a linear combination of variables and a variable. For example, imagine we now have two promotion types and either one of them is executed if and only if the day is rainy. The design matrix then would be as follows: 

Promotion A Promotion B Weather 
1 0 1
0 1 1
0 0 0
0 0 0

The correlation between Promotion A-Promotion B and the Weather is 1. This is an equivalent situation to what we had before. In practice, this is slightly worse because the inverse is computed numerically: even if the correlation is not 100% between the variables (or a linear combination of variables and a variable), this can bring numerical instability up to the point that the inverse won't be properly calculated. The previous paragraphs describe the degenerate situation where the inverse cannot be even computed. For instance, if we model the prices of the properties in terms of the size of the property and the number of bathrooms, we will find out that these two variables will be naturally correlated (larger properties will have more bathrooms, just because they are expected to accommodate more people).  Without any loss of generality, we will have models with the following structure: 

If V2 is correlated (that is 70%) to with V3+V4, the standard errors for V2, V3, and V4 will be larger than what they would be in the absence of such a correlation. What this means is that the model won't be very sure of which variable the effect should be attributed to. It is tempting to just exclude, the V2 variable, but that brings other problems (maybe even worse ones). By using asymptotic statistical theory, it can be shown that excluding a variable from the model that is correlated with a variable that we are keeping in the model biases the coefficient for the variable we are keeping. Why? Imagine we model the sales of a product in terms of the day of the week (so we have a dummy variable for each day 1-7), and a variable that stores the information of whether we did a discount on a specific day. Imagine discounts are done on a Friday, and we exclude this variable (the one that flags whether a discount was done) from the model. What would happen with the Friday dummy variable?

It would get inflated, because it will not only capture the Friday effect, but also part of the discount effect (since most of it is concentrated on Fridays). So, we have a delicate balance: if the variable is relevant and correlated, removing it biases our estimates. If we keep this variable in our model, it inflates our variances for the estimated coefficients (that are correlated). Both problems are quite serious: biased coefficients mean that the coefficient will never reach its true value 0.8, and inflated variances means that, if we get more data, the estimated coefficient will change dramatically. In practice, coefficients are removed if the correlation is between 0.7 and 1, or between -0.70 and -1. Usually, combining them is slightly preferred (if the new group makes sense)—for example, if we have two promotion types, and they are correlated, we can group them inside a combined_promotion variable. This avoids the remove/keep variable dilemma but makes the interpretation much more difficult than before.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.3.167