Preprocessing the data

For our data to be understood by the decision tree algorithm, we need to convert all categorical features (sex, BP, and cholesterol) into numerical features. What is the best way to do that?

Exactly: we use scikit-learn's DictVectorizer. Like we did in the previous chapter, we feed the dataset that we want to convert to the fit_transform method:

In [10]: from sklearn.feature_extraction import DictVectorizer
... vec = DictVectorizer(sparse=False)
... data_pre = vec.fit_transform(data)

Then, data_pre contains the preprocessed data. If we want to look at the first data point (that is, the first row of data_pre), we match the feature names with the corresponding feature values:

In [12]: vec.get_feature_names()
Out[12]: ['BP=high', 'BP=low', 'BP=normal', 'K', 'Na', 'age',
... 'cholesterol=high', 'cholesterol=normal',
... 'sex=F', 'sex=M']
In [13]: data_pre[0]
Out[13]: array([ 1. , 0. , 0. , 0.06, 0.66, 33. , 1. , 0. ,
1. , 0. ])

From this, we can see that the three categorical variables—blood pressure (BP), cholesterol level (cholesterol), and gender (sex)—have been encoded using one-hot coding.

To make sure that our data variables are compatible with OpenCV, we need to convert everything into floating point values:

In [14]: import numpy as np
... data_pre = np.array(data_pre, dtype=np.float32)
... target = np.array(target, dtype=np.float32)

Then, all that's left to do is to split the data into training and tests sets, like we did in Chapter 3, First Steps in Supervised Learning. Remember that we always want to keep the training and test sets separate. Since we only have 20 data points to work with in this example, we should probably reserve more than 10 percent of the data for testing. A 15-5 split seems appropriate here. We can be explicit and order the split function to yield exactly five test samples:

In [15]: import sklearn.model_selection as ms
... X_train, X_test, y_train, y_test =
... ms.train_test_split(data_pre, target, test_size=5,
... random_state=42)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.51.228