Interpreting the p-value

The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it.

For a more in-depth handling of statistical testing, check out Principles of Data Science, https://www.packtpub.com/big-data-and-business-intelligence/principles-data-science, by Packt Publishing.

The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.103.15