Chapter 2. Practical Approach to Real-World Supervised Learning

The ability to learn from observations accompanied by marked targets or labels, usually in order to make predictions about unseen data, is known as supervised machine learning. If the targets are categories, the problem is one of classification and if they are numeric values, it is called regression. In effect, what is being attempted is to infer the function that maps the data to the target. Supervised machine learning is used extensively in a wide variety of machine learning applications, whenever labeled data is available or the labels can be added manually.

The core assumption of supervised machine learning is that the patterns that are learned from the data used in training will manifest themselves in yet unseen data.

In this chapter, we will discuss the steps used to explore, analyze, and pre-process the data before proceeding to training models. We will then introduce different modeling techniques ranging from simple linear models to complex ensemble models. We will present different evaluation metrics and validation criteria that allow us to compare model performance. Some of the discussions are accompanied by brief mathematical explanations that should help express the concepts more precisely and whet the appetite of the more mathematically inclined readers. In this chapter, we will focus on classification as a method of supervised learning, but the principles apply to both classification and regression, the two broad applications of supervised learning.

Beginning with this chapter, we will introduce tools to help illustrate how the concepts presented in each chapter are used to solve machine learning problems. Nothing reinforces the understanding of newly learned material better than the opportunity to apply that material to a real-world problem directly. In the process, we often gain a clearer and more relatable understanding of the subject than what is possible with passive absorption of the theory alone. If the opportunity to learn new tools is part of the learning, so much the better! To meet this goal, we will introduce a classification dataset familiar to most data science practitioners and use it to solve a classification problem while highlighting the process and methodologies that guide the solution.

In this chapter, we will use RapidMiner and Weka for building the process by which we learn from a single well-known dataset. The workflows and code are available on the website for readers to download, execute, and modify.

RapidMiner is a GUI-based Java framework that makes it very easy to conduct a data science project, end-to-end, from within the tool. It has a simple drag-and-drop interface to build process workflows to ingest and clean data, explore and transform features, perform training using a wide selection of machine learning algorithms, do validation and model evaluation, apply your best models to test data, and more. It is an excellent tool to learn how to make the various parts of the process work together and produce rapid results. Weka is another GUI-based framework and it has a Java API that we will use to illustrate more of the coding required for performing analysis.

The major topics that we will cover in this chapter are:

  • Data quality analysis
  • Descriptive data analysis
  • Visualization analysis
  • Data transformation and preprocessing
  • Data sampling
  • Feature relevance analysis and dimensionality reduction
  • Model building
  • Model assessment, evaluation, and comparison
  • Detailed case study—Horse Colic Classification

Formal description and notation

We would like to introduce some notation and formal definitions for the terms used in supervised learning. We will follow this notation through the rest of the book when not specified and extend it as appropriate when new concepts are encountered. The notation will provide a precise and consistent language to describe the terms of art and enable a more rapid and efficient comprehension of the subject.

  • Instance: Every observation is a data instance. Normally the variable X is used to represent the input space. Each data instance has many variables (also called features) and is referred to as x (vector representation with bold) of dimension d where d denotes the number of variables or features or attributes in each instance. The features are represented as x = (x1,x2,…xd)T, where each value is a scalar when it is numeric corresponding to the feature value.
  • Label: The label (also called target) is the dependent variable of interest, generally denoted by y. In classification, values of the label are well-defined categories in the problem domain; they need not be numeric or things that can be ordered. In regression, the label is real-valued.
  • Binary classification, where the target takes only two values, it is mathematically represented as:

    y ∈ {1,–1}

  • Regression, where the target can take any value in the real number domain, is represented as:
    Formal description and notation
  • Dataset: Generally, the dataset is denoted by D and consists of individual data instances and their labels. The instances are normally represented as set {x1,x2xn}. The labels for each instance are represented as the set y = {y1,y2,…yn}. The entire labeled dataset is represented as paired elements in a set as given by D = {(x1, y1),(x2, y2)…(xn, yn)} where Formal description and notation for real-valued features.

Data quality analysis

There are limitations to what can be learned from data that suffers from poor quality. Problems with quality can include, among other factors, noisy data, missing values, and errors in labeling. Therefore, the first step is to understand the data before us in order that we may determine how to address any data quality issues. Are the outliers merely noise or indicative of interesting anomalies in the population? Should missing data be handled the same way for all features? How should sparse features be treated? These and similar questions present themselves at the very outset.

If we're fortunate, we receive a cleansed, accurately labeled dataset accompanied by documentation describing the data elements, the data's pedigree, and what if any transformations were already done to the data. Such a dataset would be ready to be split into train, validation, and test samples, using methods described in the section on Data Sampling. However, if data is not cleansed and suitable to be partitioned for our purposes, we must first prepare the data in a principled way before sampling can begin. (The significance of partitioning the data is explained later in this chapter in a section dedicated to train, validation, and test sets).

In the following sections, we will discuss the data quality analysis and transformation steps that are needed before we can analyze the features.

Descriptive data analysis

The complete data sample (including train, validation, and test) should be analyzed and summarized for the following characteristics. In cases where the data is not already split into train, validate, and test, the task of data transformation needs to make sure that the samples have similar characteristics and statistics. This is of paramount importance to ensure that the trained model can generalize over unseen data, as we will learn in the section on data sampling.

Basic label analysis

The first step of analysis is understanding the distribution of labels in different sets as well as in the data as a whole. This helps to determine whether, for example, there is imbalance in the distribution of the target variable, and if so, whether it is consistent across all the samples. Thus, the very first step is usually to find out how many examples in the training and test sets belong to each class.

Basic feature analysis

The next step is to calculate the statistics for each feature, such as

  • Number of unique values
  • Number of missing values: May include counts grouped by different missing value surrogates (NA, null, ?, and so on).
  • For categorical: This counts across feature categories, counts across feature categories by label category, most frequently occurring category (mode), mode by label category, and so on.
  • For numeric: Minimum, maximum, median, standard deviation, variance, and so on.

Feature analysis gives basic insights that can be a useful indicator of missing values and noise that can affect the learning process or choice of the algorithms.

Visualization analysis

Visualization of the data is a broad topic and it is a continuously evolving area in the field of machine learning and data mining. We will only cover some of the important aspects of visualization that help us analyze the data in practice.

Univariate feature analysis

The goal here is to visualize one feature at a time, in relation to the label. The techniques used are as follows:

Categorical features

Stacked bar graphs are a simple way of showing the distribution of each feature category among the labels, when the problem is one of classification.

Continuous features

Histograms and box plots are two basic visualization techniques for continuous features.

Histograms have predefined bins whose widths are either fixed intervals or based on some calculation used to split the full range of values of the feature. The number of instances of data that falls within each bin is then counted and the height of the bin is adjusted based on this count. There are variations of histograms such as relative or frequency-based histograms, Pareto histograms, two-dimensional histograms, and so on; each is a slight variation of the concept and permits a different insight into the feature. For those interested in finding out more about these variants, the Wikipedia article on histograms is a great resource.

Box plots are a key visualization technique for numeric features as they show distributions in terms of percentiles and outliers.

Multivariate feature analysis

The idea of multivariate feature analysis is to visualize more than one feature to get insights into relationships between them. Some of the well-known plots are explained here.

  • Scatter plots: An important technique for understanding the relationship between different features and between features and labels. Typically, two-dimensional scatter plots are used in practice where numeric features form the dimensions. Alignment of data points on some imaginary axis shows correlation while scattering of the data points shows no correlation. It can also be useful to identify clusters in lower dimensional space. A bubble chart is a variation of a scatter plot where two features form the dimensional axes and the third is proportional to the size of the data point, with the plot giving the appearance of a field of "bubbles". Density charts help visualize even more features together by introducing data point color, background color, and so on, to give additional insights.
  • ScatterPlot Matrix: ScatterPlot Matrix is an extension of scatter plots where pair-wise scatter plots for each feature (and label) is visualized. It gives a way to compare and perform multivariate analysis of high dimensional data in an effective way.
  • Parallel Plots: In this visualization, each feature is linearly arranged on the x-axis and the ranges of values for each feature form the y axis. So each data element is represented as a line with values for each feature on the parallel axis. Class labels, if available, are used to color the lines. Parallel plots offer a great understanding of features that are effective in separating the data. Deviation charts are variations of parallel plots, where instead of showing actual data points, mean and standard deviations are plotted. Andrews plots are another variation of parallel plots where data is transformed using Fourier series and the function values corresponding to each is projected.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.86.183