This chapter introduced several important concepts including data cleanup and handling missing and categorical values, using Spark and H2O to train multi-classification models, and various evaluation metrics for classification models. Furthermore, the chapter brings the notion of model ensembles demonstrated on RandomForest as the ensemble of decision trees.
The reader should see the importance of data preparation, which plays a key role during every model training and evaluation process. Training and using a model without understanding the modeling context can lead to misleading decisions. Moreover, every model needs evaluation with respect to the modeling goal (for example, minimization of false positives). Hence understanding trade-offs of different model metrics of classification models is crucial.
In this chapter, we did not cover all possible modelling tricks for classification models, but there are a few of them still opened for curious readers:
We used a simple strategy to impute missing values in the heart rate column, but there are other possible solutions - for example, mean value imputation, or combining imputation with additional binary column which marks rows with the missing value. Both strategies can improve the accuracy of the model and we will use them later in this book.
Furthermore, the Occam's razor principle suggests that it is good idea to prefer a simpler model than a complex model providing the same accuracy. Hence, a good idea is to define a hyper-space of parameters and use an exploration strategy to find the simplest model (for example, fewer trees, less depth) which provides the same (or better) accuracy as the models trained in this chapter.
To conclude this chapter, it is important to mention that the tree ensemble presented in this chapter is a primitive instance powerful concept of ensembles and super-learners which we are going to introduce later in this book.