5
The Significance of Feature Selection Techniques in Machine Learning

N. Bharathi1, B.S. Rishiikeshwer2, T. Aswin Shriram2,

B. Santhi2* and G.R. Brindha2

1Department of CSE, SRM Institute of Science and Technology, Vadapalani, Chennai, Tamil Nadu, India

2School of Computing, SASTRA Deemed University, Thanjavur, Tamil Nadu, India

Abstract

Current digital era with tons of raw data and extracting insight from this is a significant process. Initial significant step is to pre-process the available data set. The pre-processed input is to be fed to the proper Machine Learning (ML) model to extract the insight or decision. The performance of the model purely depends on the features given to the model. Without the knowledge of feature selection process, perfect model building will be a question mark. Proper selection of feature is essential for building precise model. Plethora of techniques are available in the literature for feature extraction and feature selection. Irrelevant features may drastically decrease the performance of the model and increase the complexity. Though features are describing the record in an effective way, by representing the record with lesser number of features through optimal approach for predicting unseen record precisely is a complex task. To handle such complexities, appropriate feature selection methods are used. Hence, this chapter concentrates on different feature selection techniques with its merits and limitations. The discussion is supported with case study using python. Using the essence of this chapter, as plug and play tool, the model builder can design a precise model.

Keywords: Feature selection, pre-processing, machine learning, deep learning, dimension reduction, attribute subset selection

5.1 Introduction

Machine learning algorithms use data set from different modalities with labels and without labels. With labels are used by supervised learning algorithms in which data set is viewed as independent variable (input) and dependent variable (output). Algorithm maps the relationship between the independent and dependent variables (features/attributes/characteristics). Performance of this algorithm depends on the proper selection of input variables.To extract the most relevant attributes, the selection of feature selection method should be compatible with chosen machine learning algorithm related to an application. Since the feature selection has significant contribution toward the precise outcome of the application, the focused attention to this area is inevitable. The selected features are validated through proper metrics by observing the model performance. ML models attempt to determine construction in the data, predictive relationship between the dependent and independent variables. Traditional algorithm trusted on handcrafted features but in deep learning architecture itself is having capacity to extract features from the data. In this chapter, several feature extraction methods, selection methods and its evaluation metrics are discussed. Challenges in pattern recognition applications are selecting the attributes from the different modalities which are relevant for classification. In the classification problem, records are collection of attributes in high dimension. Generally, for any applications, the part of the features is directly relevant to the class and some features can be transformed to relevant set by applying transformation techniques. The remaining features are irrelevant. Classifiers are designed to decide the related attributes automatically but this requires domain knowledge. Most of the time, this depends on the pre-processing stage. Proper features are used by ML to build the proper models which achieves proper objective.

5.2 Significance of Pre-Processing

Nowadays, enormous data sets are available in open source. However, most of the data sets require pre-processing before using them for model building. Multifaceted data investigation and analysis on large amounts of data consumes more time, making such analysis unfeasible or impracticable. The dimensionality of a data set is the number of input variables or features in the data set. The considerable increase in the number of input features leads to increase in difficulty level of the development of predictive models and this conceptual challenge is generally referred to as the curse of dimensionality. More numbers of input features than necessary may cause deprived performance when applying to ML algorithms. Though the data set may contain large number of input features, it can be reduced to an optimized set of features to yield better results in ML algorithms. Dimension reduction is a technique that reduces the number of input features in a data set. It is often used in the visualization of data and also in ML approaches to make the further processing simpler such as classification and regression.

5.3 Machine Learning System

Input to the ML algorithm should be treated before feeding it to the algorithm. Missing value is the major issue in ML models. Researchers advise many imputation techniques.

5.3.1 Missing Values

Removing missing value row from data set affect the model performance in case of smaller data set. In such cases, replacement of processed values is needed. Some of such processes are as follows:

  • • Numerical data values: filling all missing values with 0 or mean of the column/median of the column.
  • • Categorical data values: filling with maximum frequency value is replaced.
  • • Goal of imputation is to infer missing values suitable to the nature of data type, which is referred as matrix completion procedure.

5.3.2 Outliers

Outlier detection is another challenge. Detection of outliers using statistical methods and visualization is available. Using box plot (five-point summary), outliers are easily detected. Success of ML method depends on feature selection techniques. Model of an application is as good as its features. Extracting discriminative information from the given data set for the application is laborious task. Earlier to the application of ML method, it is suggested to perform exploratory data analysis for the given data set to visualize the nature of the data by plotting it.

When we prepare a flexible ML model for an application, we need to concentrate on the overfitting concept. The gap between training error and testing error is called overfitting. To handle this challenge, regularization techniques are applied. We should avoid every minor variation in the input.

5.3.3 Model Selection

Different algorithms are applied with given data set for preparing model for an application. A variety of models with different complexity are constructed for selecting suitable appropriate model through the computation of misclassification rate. Model is memorizing the data and getting minimal error on the training set. So, best model focuses on to minimizing the generalization error. Misclassification rate is minimum on large independent test set. Correct model complexity is selected through validation set. Based on “no free lunch” theorem, the set of assumptions in model preparation is well-suited for one domain, but may work poorly in another domain. Devising suitable model depending upon the nature of selecting appropriate feature using feature engineering is the backbone of the model building.

5.4 Feature Extraction Methods

The reduction in dimension of data set can be done through removing irrelevant, less relevant, and redundant input data or variables from the data set. The feature extraction improves the performance of predictive models by reducing the dimension through extracting salient features from data set [1, 2]. It is highly increasing the training speed and also infers the outcome soon. Various methods of feature extraction exist and new features are produced by transformations and manipulations on the original input set [3–5]. Example is the extraction of features from the images like color, texture, shape, and pixel value.

The various broad categories of feature extraction which is shown in Figure 5.1 based on data reduction are dimensionality reduction, parametric reduction, non-parametric reduction, and data compression. The dimensionality reduction has three types, and it essentially transforms the data set onto a lesser space to make the processing and manipulation easier. Among the three types, wavelet transforms and principal component analysis (PCA) map the data set onto the lesser dimension, whereas the third type attribute subset selection leads to removal of existing irrelevant and redundant data from data set.

Schematic illustration of the classification of feature extraction methods.

Figure 5.1 Classification of feature extraction methods.

Parametric reduction generates a model to estimate the data set, and hence, storage of model and data parameters is only required than the entire data. Non-parametric reductions are used to store the visual representations such as histograms and reduced representations such as clustering, sampling, and data cube aggregation. Data compression techniques involve the reduction of data by applying aggregation techniques and reconstructed through approximation techniques. Reconstructing the data, if any data loss exists, is called lossy compression. Whereas, if there is no data loss during reconstruction, then it is called lossless compression. The further discussion digs deeper into the dimensionality reduction, parametric, and non-parametric reductions.

5.4.1 Dimension Reduction

Dimension reduction is a basic technique which is often used in various ML applications to reduce the number of input features. It is more common and depends on the nature of the data alone and not specific to which application the data is used for. As an example, dimensionality reduction leads to the projection of data into a smaller dimension space and this may help or not to a ML application.

Data reduction is achieved by obtaining the reduced data set comparatively very less volume than the original data set and also maintaining the integrity of the actual data set [6, 7]. The manipulation and visualization of data with the reduced data set is efficient and also generate the same or almost same outcomes. Dimension reduction is categorized into two types based on the reduction procedure on input variables. Feature selection and feature extraction are the categories in which former simply includes or excludes the given features without modifying the input variables. The latter category is transforming the input features into lesser dimension. Further sections discuss the feature selection and feature extraction methods in detail.

5.4.1.1 Attribute Subset Selection

Data sets consist of more and more attributes, in which some attributes are contributing redundant and irrelevant information for the focused application. It is an easier task for an expert to decide the attributes which need to be included in the prediction. However, it is time-consuming process when the nature of the data is not known. The attribute subset selection is to identify the minimum number of attributes that contributes closer to the probability distribution of the original data set. The possible subsets of attributes with n number of attributes in the original data set are 2n. The exhaustive approach is time-consuming and costlier to determine the optimal subset of attributes [8]. Hence, the heuristic approach is followed which provides closer accuracy with the original data set [9, 10]. General methods of the heuristic approach are forward selection, backward elimination, combination of both, and decision trees induction. For feature construction and selection process, PCA and K means methods are used respectively.

5.4.1.1.1 Forward Selection Method

Forward selection method has empty set initially and proceeds by adding the best attributes one by one from the actual data set to form the reduced data set. Adding the best attributes from the remaining data set is continued till the model reaches the optimum level in producing the model with best accuracy.

5.4.1.1.2 Backward Elimination Method

Backward elimination method works with the full set of attributes initially and proceeds by eliminating the worst attributes one by one from the full set of attributes which eventually results in the best predictive model. The combination of forward selection and backward elimination combines the benefits of both methods.

5.4.1.1.3 Decision Tree Induction Method

Decision tree induction method constructs a tree-like structure in which the non-leaf internal nodes indicate the decision assessment on the attribute, the branches indicate the inference of the assessments, and the leaf nodes indicates the prediction class. Also, in the attribute subset selection, the new attributes can be constructed from the existing attributes to improve the accuracy. The threshold to stop the iterations in the above methods may vary based on the method chosen and the application focused.

5.4.2 Wavelet Transforms

The data set is transformed using the discrete wavelet transform (DWT) in which the data vector is mapped to another vector of wavelet coefficients. Though the length of each vector is same, the application of DWT on the data set will considerably reduce the number of data to a small proportion. The wavelet coefficients larger than the predefined threshold alone is stored and others are set to zero resulting in a sparse data set. Obviously, the sparse data set takes less computation time in wavelet space and also reduced memory storage with optimization techniques. Also, this transform removes the noise and ensure to proceed for the efficient data cleaning process.

The regeneration of data set by applying inverse of the DWT is also achieved with good results in comparison with discrete Fourier transform (DFT) in regaining the data as well as the compressed data set occupies less space. Also, a family of DWT such as Haar, Daubechies, Symlets, Coiflets, Biorthogonal, Gaussian, Mexican hat, Shannon, etc., is available for transforming the data set unlike DFT which has only one. The python package and retrieving the python supported wavelets by using pywt, pywt.families(). The DWT is applied on a data set using hierarchical pyramid algorithm which reduces the data set by half of its number in each iteration of execution. The DWT can be applied to the hypercubes generated with multidimensional data. The process of applying is so simple by transforming each dimension at a time and proceed to next dimension.

5.4.3 Principal Components Analysis

Maximum variance is preserved by PCA and segregates the features into orthogonal components. It projects on the principal eigenvectors from the covariance matrix of the data set.

The n-dimensional data set is reduced into k n-dimensional vectors that are orthogonal and optimal representation of the data set with k < n. The dimensionality reduction is achieved by projection of the large data set into smaller space with principal components.The optimal combination of variables and the extraction of its essential features reduce the dimension [11]. The relationship even which are not suspected previously is revealed by PCA. The steps involved in data reduction using PCA are as follows:

Step 1: Normalize the input data and ensure the participation of all attributes in data reduction.

Step 2: Identification of principal components which are unit vectors in each dimension and they possess the orthonormality among themselves.

Step 3: Sort the principal components in decreasing order of their contribution to the variance on the data.

Step 4: Eliminate the weaker components with respect to the required reduced data size.

Step 5: Use the principal components to regenerate the original data and check the accuracy of the PCA.

5.4.4 Clustering

The tuples in the data set are considered as objects and partitioned into groups called clusters. The clustering techniques are based on how likely the objects are related to each other. A distance function is used to calculate the similarity between objects. The clusters are described by its largest distance between the objects, cluster diameter, cluster core object, attributes, etc. The distance between the multiple cluster centers called centroids also serves as a cluster quality. The data reduction is performed by representing clusters instead of the data in the clusters. The efficiency of the clustering techniques depends on the nature of the data in the data set.

The clustering techniques are categorized into four different methods. They are (i) partition methods, (ii) hierarchical methods (iii), density- based methods, and (iv) grid-based methods. Partition methods are formed by mutually exclusive objects with distance functions. This method works well in medium and smaller data set. Hierarchical methods constitute multiple levels with splits from top to bottom and merge from bottom to top like a tree structure. Density-based clustering methods are based on the regions where the objects are closely related or very rare as outliers. Grid-based methods use grid data structure to form the clusters and it involves fast processing time.

5.5 Feature Selection

It is very rare to find all variables in a data set which are useful to build a model. Adding redundant values reduces the capability of model to generalize and reduce overall accuracy. Adding more variables to a model increases the complexity of the model [12–14]. Every feature of the given data set does not need to be used for creating an algorithm. The feature selection methods are majorly classified into supervised [15] and unsupervised [16–18]. Feature selection enables faster training, reduces the complexity of the model, improves accuracy, and reduces overfitting.

The goal of feature selection is to find the best set of features to build useful models that could apply for various applications [19–21]. Different techniques classified under supervised are as follows:

  • • Filter methods
  • • Wrapper methods
  • • Embedded methods

5.5.1 Filter Methods

Filter methods are faster, less complex, and much cheaper than other methods. This method is generally used as a pre-processing step [22, 23]. It uses univariate statistics instead of cross validation performance. Features are selected based on the various statistical test scores. There are different techniques in filter methods like information gain, Chi-square test, fishers score, ANOVA, and LDA. These filter methods do not remove multicollinearity which makes the user to deal with it before training the models. It is very rare for the data sets to be exactly collinear but if it happens; there are methods to remove the similar data sets before training.

5.5.2 Wrapper Methods

Wrapper methods usually give better accuracy than filter methods. It requires some method to search for all possible subsets. Based on the results we get from the previous model; the user decides to add or remove features from the subset. Boruta package is one the best ways to implement this. It adds randomness to the data and creates copies of all features (shadow features). Then, it applies random forest and checks for the important features in the data set based on scores. If the real feature has higher Z-score than shadow feature, Boruta package removes the unimportant features. At last, the algorithm stops at specified random forest runs or when every feature gets evaluated. There are techniques under wrapper methods like forward feature selection, backward feature selection, exhaustive feature selection, and recursive feature selection.

5.5.3 Embedded Methods

This method brings the benefits of both filter and wrapper methods under one single umbrella by including feature interactions and keeping the computational costs under check. It takes care of the iteration of the model and carefully analyses and extracts those features which play major part for a particular iteration [24]. To analyze the significance of feature set, techniques such as decision tree and random forest are used.

Random forest is a tree-based ML algorithm that leverages the power of multiple decision trees for making decisions. Random forest combines the output of individual decision trees to generate the final output. In terms of final coding output, random forest outperforms decision tree due to the ensemble nature of random forest. Decision tree model gives high importance to particular feature set, whereas random forest chooses random features during the process. Therefore, the random forest trumps over decision trees in generalization of data. This makes random forest more accurate than its counterpart.

Imagine that a teacher has to segregate students into star category as 1 star, 2 stars, and 3 stars based on academic performance. Consider the basic and important feature as the start of decision tree, Grade Score (GS)—dividing the tree as students who have good GS and bad GS. The second feature to consider is research papers, where if a student has good GS and good research papers then he gets 3 stars, whereas good GS with no research papers gets 2 stars. Student who has bad GS and good research papers gets 2 stars, whereas bad GS with no research papers gets 1 star. As you can see in this decision tree, the primary priority is given to the GS for the first branching to happen; this clearly tells us that decision tree always gives priority to a particular set of features. Having said that, a combination of GS and research papers defines the star rating of the students in the class, which ultimately is the result of this decision tree. This is the process behind one decision tree and randomly combines the result of many similar decision trees to give out the final output.

Features are considered as telescopes through which nature of data can be visualized. “Features are work horses of Machine Learning”; hence, the relevant features can be evaluated by constructing models using different set of features and can be finalized through metrics of models. Individual merits of features are evaluated through filter approach and sets of features are evaluated by wrapper approach.

5.6 Merits and Demerits of Feature Selection

A healthier approach is to estimate the same transform and model with different size of feature set and pick the number of features which results in the best average performance. For a given data set and model, best input features are picked using PCA or other dimensionality reduction methods. To demonstrate the merits of PCA, let us consider 2,000 random numbers with 30 features. The PCA components of the given data are fed to the logistic regression. The cross validated results are visualized using box plot. The model is run for 30 iterations and the optimum number of components is identified through the consistent best accuracy value. For this simulated 2,000 samples with 30 features, accuracy ranges from 0.504 (first component) to 0.864 (25th component). The consistency is from 25th iteration onward. Hence, the first 25 components are enough to get best accuracy for this data set.

Even though feature selection techniques are having high impact on ML modeling area, the limitations also exist. There is no general feature selection method which suits to all data sets in real time. Based on the nature of data, whether it is noisy or unstructured, apply suitable pre- processing techniques earlier to the feature selection/extraction methods. The selection method depending on the data set (input) and its corresponding expected output (target).

Feature selection method also depends on the data type of features. For numeric data, feature selection such as PCA or correlation coefficient can be used and for categorical data, Chi-square or mutual information statistics can be applied. For the Breast cancer data from UCI repository, the model built using logistic regression. When all features are used, the accuracy is 75.79%, and by applying chi-squared four features out of nine yields 74.74%, using mutual information statistics with top four features, the accuracy is 76.84%.

5.7 Conclusion

Effective ML algorithms lead to important application in the active research area. This chapter gave the pavement to improve the learning process by focusing on feature engineering concept by borrowing fruitful idea from different studies. Sparse coding techniques are reducing high-dimensional data into small numbers of basic methods and projecting into the lower dimension space. In spite of the numerous open research challenges, the enhancement made undoubtedly fine tune the future of ML and AI system. Method such as Filter feature selection employ a statistical measure to allot score to each feature. The features are ranked using score. Based on the score selection of features or rejection is decided. Wrapper methods apply searching techniques to select set of features. Through the search, different combinations are prepared and compared based on evaluation of model. Yet, adding or eliminating features from the data create model and evaluate on the validation set, become very time-intensive and costly. Regularization methods (penalization methods) come under embedded methods that host additional constraints into the optimization of a predictive algorithm that bias the model toward lower complexity. Feature learning algorithms identify a common pattern which is essential to discriminate between the class labels.

References

1. Zhang, L., Frank, S., Kim, J., Jin, X., Leach, M., A systematic feature extraction and selection framework for data-driven whole-building automated fault detection and diagnostics in commercial buildings. Build. Environ., 186, 107338, 2020 Dec 1.

2. Sharma, G., Umapathy, K., Krishnan, S., Trends in audio signal feature extraction methods. Appl. Acoust., 158, 107020, 2020 Jan 15.

3. Lu, J., Lai, Z., Wang, H., Chen, Y., Zhou, J., Shen, L., Generalized Embedding Regression: A Framework for Supervised Feature Extraction. IEEE Trans. Neural Networks Learn. Syst., 1–15, 2020 Nov 4.

4. Sarumathi, C.K., Geetha, K., Rajan, C., Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms. Soft Comput., 24, 1, 627–36, 2020 Jan 1.

5. Marques, A.E., Prates, P.A., Pereira, A.F., Oliveira, M.C., Fernandes, J.V., Ribeiro, B.M., Performance Comparison of Parametric and Non-Parametric Regression Models for Uncertainty Analysis of Sheet Metal Forming Processes. Metals, 10, 4, 457, 2020 Apr.

6. Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D., Saeed, J., A Comprehensive Review of Dimensionality Reduction Techniques for Feature Selection and Feature Extraction. J. Appl. Sci. Technol. Trends, 1, 2, 56–70, 2020 May 15.

7. Li, M., Wang, H., Yang, L., Liang, Y., Shang, Z., Wan, H., Fast hybrid dimensionality reduction method for classification based on feature selection and grouped feature extraction. Expert Syst. Appl., 150, 2020, https://doi.org/10.1016/j.eswa.2020.113277.

8. Rouzdahman, M., Jovicic, A., Wang, L., Zucherman, L., Abul-Basher, Z., Charoenkitkarn, N., Chignell, M., Data Mining Methods for Optimizing Feature Extraction and Model Selection, in: Proceedings of the 11th International Conference on Advances in Information Technology, 2020 Jul 1, pp. 1–8.

9. Mutlag, W.K., Ali, S.K., Aydam, Z.M., Taher, B.H., Feature Extraction Methods: A Review. J. Phys.: Conf. Ser., 1591, 1, 012028, 2020 Jul 1, IOP Publishing.

10. Koduru, A., Valiveti, H.B., Budati, A.K., Feature extraction algorithms to improve the speech emotion recognition rate. Int. J. Speech Technol., 23, 1, 45–55, 2020 Mar.

11. Garate-Escamilla, A.K., Hassani, A.H., Andres, E., Classification models for heart disease prediction using feature selection and PCA. Inf. Med. Unlocked, 27, 100330, 2020 Apr.

12. Al-Tashi, Q., Abdulkadir, S.J., Rais, H.M., Mirjalili, S., Alhussian, H., Approaches to Multi-Objective Feature Selection: A Systematic Literature Review. IEEE Access, 8, 125076–125096, 2020.

13. Toğaçar, M., Cömert, Z., Ergen, B., Classification of brain MRI using hyper column technique with convolutional neural network and feature selection method. Expert Syst. Appl., 149, 113274, 2020 Jul 1.

14. Haider, F., Pollak, S., Albert, P., Luz, S., Emotion recognition in low-resource settings: An evaluation of automatic feature selection methods. Comput. Speech Lang., 1, 101119, 2020 Jun.

15. Wu, X., Xu, X., Liu, J., Wang, H., Hu, B., Nie, F., Supervised feature selection with orthogonal regression and feature weighting. IEEE Trans. Neural Networks Learn. Syst., 32, 1831–1838, 2020 May 14.

16. Martarelli, N.J. and Nagano, M.S., Unsupervised feature selection based on bio-inspired approaches. Swarm Evol. Comput., 52, 100618, 2020 Feb 1.

17. Pandit, A.A., Pimpale, B., Dubey, S., A Comprehensive Review on Unsupervised Feature Selection Algorithms, in: International Conference on Intelligent Computing and Smart Communication, Springer, Singapore, pp. 255–266, 2020.

18. Zheng, W. and Jin, M., Comparing multiple categories of feature selection methods for text classification. Digit. Scholarsh. Humanit., 35, 1, 208–24, 2020 Apr 1.

19. Açıkoğlu, M. and Tuncer, S.A., Incorporating feature selection methods into a machine learning-based neonatal seizure diagnosis. Med. Hypotheses, 135, 109464, 2020 Feb 1.

20. Al-Kasassbeh, M., Mohammed, S., Alauthman, M., Almomani, A., Feature Selection Using a Machine Learning to Classify a Malware, in: Handbook of Computer Networks and Cyber Security, pp. 889–904, Springer, Cham, 2020.

21. Zheng, W., Zhu, X., Wen, G., Zhu, Y., Yu, H., Gan, J., Unsupervised feature selection by self-paced learning regularization. Pattern Recognit. Lett., 132, 4–11, 2020 Apr 1.

22. Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., Lang, M., Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal., 143, 2020, https://doi.org/10.1016/j.csda.2019.106839.

23. Alirezanejad, M., Enayatifar, R., Motameni, H., Nematzadeh, H., Heuristic filter feature selection methods for medical datasets. Genomics, 112, 2, 1173– 81, 2020 Mar 1.

24. Elhariri, E., El-Bendary, N., Taie, S.A., Using Hybrid Filter-Wrapper Feature Selection With Multi-Objective Improved-Salp Optimization for Crack Severity Recognition. IEEE Access, 8, 84290–315, 2020 May 1.

  1. *Corresponding author: [email protected]
  2. Corresponding author: [email protected]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.126.248