Feature selection using hypothesis testing

Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values.

A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion.

In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance.

To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code:

# SelectKBest selects features according to the k highest scores of a given scoring function
from sklearn.feature_selection import SelectKBest

# This models a statistical test known as ANOVA
from sklearn.feature_selection import f_classif

# f_classif allows for negative values, not all do
# chi2 is a very common classification criteria but only allows for positive values
# regression has its own statistical tests

SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.55.76