Rating the importance of features

What I haven't told you yet is how you pick the features along which to split the data. The preceding root node split the data according to Na <= 0.72, but who told the tree to focus on sodium first? Also, where does the number 0.72 come from anyway?

Apparently, some features might be more important than others. In fact, scikit-learn provides a function to rate feature importance, which is a number between 0 and 1 for each feature, where 0 means not used at all in any decisions made and 1 means perfectly predicts the target. The feature importances are normalized so that they all sum to 1:

In [27]: dtc.feature_importances_
Out[27]: array([ 0. , 0. , 0. , 0.13554217, 0.29718876,
0.24096386, 0. , 0.32630522, 0. , 0. ])

If we remind ourselves of the feature names, it will become clear which feature seems to be the most important. A plot might be most informative:

In [28]: plt.barh(range(10), dtc.feature_importances_, align='center',
... tick_label=vec.get_feature_names())

This will result in the following bar plot:

Now, it becomes evident that the most important feature for knowing which drug to administer to patients was actually whether the patient had a normal cholesterol level. Age, sodium levels, and potassium levels were also important. On the other hand, gender and blood pressure did not seem to make any difference at all. However, this does not mean that gender or blood pressure are uninformative. It only means that these features were not picked by the decision tree, likely because another feature would have led to the same splits.

But, hold on. If cholesterol level is so important, why was it not picked as the first feature in the tree (that is, in the root node)? Why would you choose to split on the sodium level first? This is where I need to tell you about that ominous gini label in the diagram earlier.

Feature importances tell us which features are important for classification, but not which class label they are indicative of. For example, we only know that the cholesterol level is important, but we don't know how that led to different drugs being prescribed. In fact, there might not be a simple relationship between features and classes.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.157.142