Slimming the classifier

It is always worth looking at the actual contributions of the individual features. For logistic regression, we can directly take the learned coefficients (clf.coef_) to get an impression of the feature's impact. The higher the coefficient of a feature is, the more the feature plays a role in determining whether the post is good or not. Consequently, negative coefficients tell us that the higher values for the corresponding features indicate a stronger signal for the post to be classified as bad:

Slimming the classifier

We see that LinkCount and NumExclams have the biggest impact on the overall classification decision, while NumImages and AvgSentLen play a rather minor role. While the feature importance overall makes sense intuitively, it is surprising that NumImages is basically ignored. Normally, answers containing images are always rated high. In reality, however, answers very rarely have images. So although in principal it is a very powerful feature, it is too sparse to be of any value. We could easily drop this feature and retain the same classification performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.35.122