Parametric approach to dimension reduction

To some extent, we touched upon model-based dimension reduction in Chapter 4, Regression with Automobile Data, Logistic regression, where we tried to implement regression modelling and in that process we tried to reduce the data dimension by applying Akaike Information Criteria (AIC). Bayesian Information Criteria (BIC) such as AIC can also be used to reduce data dimensions. As far as the model-based method is concerned, there are two approaches:

  • The forward selection method: In forward selection method, one variable at a time is added to the model and the model goodness of fit statistics and error are computed. If the addition of a new dimension reduces error and increases the model goodness of fit, then that dimension is retained by the model, else that dimension is removed from the model. This is applicable across different supervised based algorithms, such as random forest, logistic regression, neural network, and support vector machine-based implementations. The process of feature selection continues till all the variables get tested.
  • The backward selection method: In backward selection method, the model starts with all the variables together. Then one variable is deleted from the model and the model goodness of fit statistics and error (any loss function pre-defined) is computed. If the deletion of a new dimension reduces error and increases the model goodness of fit, then that dimension is dropped from the model, else that dimension is kept by the model.

The problem in this chapter we are discussing is a case of supervised learning classification, where the dependent variable is default or no default. The logistic regression method, as we discussed in Chapter 4, Regression with Automobile DataLogistic regression, uses a step wise dimension reduction procedure to remove unwanted variables from the model. The same exercise can be done on the dataset we discussed in this chapter.

Apart from the standard methods of data dimension reduction, there are some not so important methods available which can be considered, such as missing value estimation method. In a large data set with many dimension sparsity problem will be a common scenario, before applying any formal process of dimensionality reduction, if we can apply the missing value percentage calculation method on the dataset, we can drop many variables. The threshold to drop the variables failing to meet the minimum missing percentage has to be decided by the analyst.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.27.93