Variable selection 

A fundamental question when doing linear regression is how to choose the best subset of variables that we have already included. Every variable that is added to a model changes the standard errors of the other variables already included. Consequently, the p-values also change, and the order is relevant. This happens because in general the variables are correlated, causing the coefficients' covariance matrix to change (hence changing the standard errors). Sandwich estimators use a different formula for the standard errors.  Note the Ω which is the new element here. This matrix is estimated by the sandwich package. This formula also explicits why this is called the sandwich method (the Ω gets sandwiched between two equal expressions). Sandwich estimators use a different formula for the standard errors.  Note the Ω which is the new element here. This matrix is estimated by the sandwich package. This formula also explicits why this is called the sandwich method (the Ω gets sandwiched between two equal expressions). There are two major metrics that can be used for this: the AIC (Akaike criterion) and the p-values for each variable. There are four possible ways of doing variable selection: 

  • Compute all possible models and choose the one that maximizes the adjusted R square or AIC. This is the best approach, but usually not practical, since the combinations grow exponentially with respect to the number of variables.
  • Start with an empty model and add the best variable sequentially. 
  • Start with a saturated model (containing all possible regressors) and remove the worst variable sequentially. 
  • Start with an empty model and add one variable at a time. We add the best variable, and we then remove the worst variable. We iterate over all variables doing this. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.72.245