Summary

Supervised learning is the predominant technique used in machine learning applications. The methodology consists of a series of steps beginning with data exploration, data transformation, and data sampling, through feature reduction, model building, and ultimately, model assessment and comparison. Each step of the process involves some decision making which must answer key questions: How should we impute missing values? What data sampling strategy should we use? What is the most appropriate algorithm given the amount of noise in the dataset and the prescribed goal of interpretability? This chapter demonstrated the application of these processes and techniques to a real-world problem—the classification problem using the UCI Horse Colic dataset.

Whether the problem is one of classification, when the target is a categorical value, or Regression, when it is a real-valued continuous variable, the methodology used for supervised learning is similar. In this chapter, we have used classification for illustration.

The first step is data quality analysis, which includes descriptive statistics of the features, visualization analysis using univariate, and multivariate feature analysis. With the help of various plot types, we can uncover different trends in the data and examine how certain features may or may not correlate with the label values and with each other. Data analysis is followed by data pre-processing, where the techniques include ways to address noise, as in the case of missing data, and outliers, as well as preparing the data for modeling techniques through normalization and discretization.

Following pre-processing, we must suitably split the data into train, validation, and test samples. Different sampling strategies may be used depending on the characteristics of the data and the problem at hand, for example, when the data is skewed or when we have a multi-class classification problem. Depending on data size, cross-validation is a common alternative to creating a separate validation set.

The next step is the culling of irrelevant features. In the filter approach, techniques that use univariate analysis are either entropy-based (Information Gain, Gains Ratio) or based on statistical hypothesis testing (Chi-Squared). With the main multivariate methods, the aim is reduction of redundant features when considered together, or using the ones that correlate most closely with the target label. In the wrapper approach, we use machine learning algorithms to tell us about the more discriminating features. Finally, some learning techniques have feature selection embedded in the algorithm in the form of a regularization term, typically using ridge or lasso techniques. These represent the embedded approach.

Modeling techniques are broadly classified into linear, non-linear, and ensemble methods. Among linear algorithms, the type of features can determine the algorithms to use—Linear Regression (numeric features only), Naïve Bayes (numeric or categorical), and logistic regression (numeric features only, or categorical transformed to numeric) are the work-horses. The outlined advantages and disadvantages of each method must be understood when choosing between them or interpreting the results of learning using these models.

Decision Tree, k-NN, and SVM are non-linear techniques, each with their own strengths and limitations. For example, interpretability is the main advantage of Decision Tree. k-NN is robust in the face of noisy data, but it does poorly with high-dimensional data. SVM suffers from poor interpretability, but shines even when the dataset is limited, and the number of features is large.

With a number of different models collaborating, ensemble methods can leverage the best of all. Bagging and boosting both are techniques that generalize better in the ensemble compared to the base learner they use and are popular in many applications.

Finally, what are the strategies and methods that can be used in evaluating model performance and comparing models to each other? The role of validation sets or cross-validation is essential to the ability to generalize over unseen data. Performance evaluation metrics derived from the confusion matrix are used universally to evaluate classifiers; some are used more commonly in certain domains and disciplines than others. ROC, Gain, and Lift curves are great visual representations of the range of model performance as the classification threshold is varied. When comparing models in pairs, several metrics based on statistical hypothesis testing are used. Wilcoxon and McNemar's are two non-parametric tests; Paired-t test is an example of a parametric method. Likewise, when comparing multiple algorithms, a common non-parametric test that does not make assumptions about the data distribution is Friedman's test. ANOVA, which are parametric tests, assume normal distribution of the metrics and equal variances.

The final sections of the chapter present the process undertaken using the RapidMiner tool to develop and evaluate models generated to classify test data from the UCI Horse-colic dataset. Three experiments are designed to compare and contrast the performance of models under different data pre-processing conditions, namely, without handling missing data, with replacement of missing data using standard techniques, and finally, with feature selection following null replacement. In each experiment we choose multiple linear, non-linear, and ensemble methods. As part of the overall process, we illustrate how the modeling environment is used. We can draw revealing conclusions from the results, which give us insights into the data as well as demonstrating the relative strengths and weakness of the various classes of techniques in different situations. We conclude that the data is highly non-linear and that ensemble learning demonstrates clear advantages over other techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.73.127