Data transformation and preprocessing

In this section, we will cover the broad topic of data transformation. The main idea of data transformation is to take the input data and transform it in careful ways so as to clean it, extract the most relevant information from it, and to turn it into a usable form for further analysis and learning. During these transformations, we must only use methods that are designed while keeping in mind not to add any bias or artifacts that would affect the integrity of the data.

Feature construction

In the case of some datasets, we need to create more features from features we are already given. Typically, some form of aggregation is done using common aggregators such as average, sum, minimum, or maximum to create additional features. In financial fraud detection, for example, Card Fraud datasets usually contain transactional behaviors of accounts over various time periods during which the accounts were active. Performing behavioral synthesis such as by capturing the "Sum of Amounts whenever a Debit transaction occurred, for each Account, over One Day" is an example of feature construction that adds a new dimension to the dataset, built from existing features. In general, designing new features that enhance the predictive power of the data requires domain knowledge and experience with data, making it as much an art as a science.

Handling missing values

In real-world datasets, often, many features have missing values. In some cases, they are missing due to errors in measurement, lapses in recording, or because they are not available due to various circumstances; for example, individuals may choose not to disclose age or occupation. Why care about missing values at all? One extreme and not uncommon way to deal with it is to ignore the records that have any missing features, in other words, retain only examples that are "whole". This approach may severely reduce the size of the dataset when missing features are widespread in the data. As we shall see later, if the system we are dealing with is complex, dataset size can afford us precious advantage. Besides, there is often predictive value that can be exploited even in the "un-whole" records, despite the missing values, as long as we use appropriate measures to deal with the problem. On the other hand, one may unwittingly be throwing out key information when the omission of the data itself is significant, as in the case of deliberate misrepresentation or obfuscation on a loan application by withholding information that could be used to conclusively establish bone fides.

Suffice it to say, that an important step in the learning process is to adopt some systematic way to handle missing values and understand the consequences of the decision in each case. There are some algorithms such as Naïve Bayes that are less sensitive to missing values, but in general, it is good practice to handle these missing values as a pre-processing step before any form of analysis is done on the data. Here are some of the ways to handle missing values.

  • Replacing by means and modes: When we replace the missing value of a continuous value feature with the mean value of that feature, the new mean clearly remains the same. But if the mean is heavily influenced by outliers, a better approach may be to use the mean after dropping the outliers from the calculation, or use the median or mode, instead. Likewise, when a feature is sparsely represented in the dataset, the mean value may not be meaningful. In the case of features with categorical values, replacing the missing value with the one that occurs with the highest frequency in the sample makes for a reasonable choice.
  • Replacing by imputation: When we impute a missing value, we are in effect constructing a classification or regression model of the feature and making a prediction based on the other features in the record in order to classify or estimate the missing value.
  • Nearest Neighbor imputation: For missing values of a categorical feature, we consider the feature in question to be the target and train a KNN model with k taken to be the known number of distinct categories. This model is then used to predict the missing values. (A KNN model is non-parametric and assigns a value to the "incoming" data instance based on a function of its neighbors—the algorithm is described later in this chapter when we talk about non-linear models).
  • Regression-based imputation: In the case of continuous value variables, we use linear models like Linear Regression to estimate the missing data—the principle is the same as for categorical values.
  • User-defined imputation: In many cases, the most suitable value for imputing missing values must come from the problem domain. For instance, a pH value of 7.0 is neutral, higher is basic, and lower is acidic. It may make most sense to impute a neutral value for pH than either mean or median, and this insight is an instance of a user-defined imputation. Likewise, in the case of substitution with normal body temperature or resting heart rate—all are examples from medicine.

Outliers

Handling outliers requires a lot of care and analysis. Outliers can be noise or errors in the data, or they can be anomalous behavior of particular interest. The latter case is treated in depth in Chapter 3, Unsupervised Machine Learning Techniques. Here we assume the former case, that the domain expert is satisfied that the values are indeed outliers in the first sense, that is, noise or erroneously acquired or recorded data that needs to be handled appropriately.

Following are different techniques in detecting outliers in the data

  • Interquartile Ranges (IQR): Interquartile ranges are a measure of variability in the data or, equivalently, the statistical dispersion. Each numeric feature is sorted based on its value in the dataset and the ordered set is then divided into quartiles. The median value is generally used to measure central tendency. IQR is measured as the difference between upper and lower quartiles, Q3-Q1. The outliers are generally considered to be data values above Q3 + 1.5 * IQR and below Q1 - 1.5 * IQR.
  • Distance-based Methods: The most basic form of distance-based methods uses k-Nearest Neighbors (k-NN) and distance metrics to score the data points. The usual parameter is the value k in k-NN and a distance metric such as Euclidean distance. The data points at the farthest distance are considered outliers. There are many variants of these that use local neighborhoods, probabilities, or other factors, which will all be covered in Chapter 3, Unsupervised Machine Learning Techniques. Mixed datasets, which have both categorical and numeric features, can skew distance-based metrics.
  • Density-based methods: Density-based methods calculate the proportion of data points within a given distance D, and if the proportion is less than the specified threshold p, it is considered an outlier. The parameter p and D are considered user-defined values; the challenge of selecting these values appropriately presents one of the main hurdles in using these methods in the preprocessing stage.
  • Mathematical transformation of feature: With non-normal data, comparing the mean value is highly misleading, as in the case when outliers are present. Non-parametric statistics allow us to make meaningful observations about highly skewed data. Transformation of such values using the logarithm or square root function tends to normalize the data in many cases, or make them more amenable to statistical tests. These transformations alter the shape of the distribution of the feature drastically—the more extreme an outlier, the greater the effect of the log transformation, for example.
  • Handling outliers using robust statistical algorithms in machine learning models: Many classification algorithms which we discuss in the next section on modeling, implicitly or explicitly handle outliers. Bagging and Boosting variants, which work as meta-learning frameworks, are generally resilient to outliers or noisy data points and may not need a preprocessing step to handle them.
  • Normalization: Many algorithms—distance-based methods are a case in point—are very sensitive to the scale of the features. Preprocessing the numeric features makes sure that all of them are in a well-behaved range. The most well-known techniques of normalization of features are given here:
    • Min-Max Normalization: In this technique, given the range [L,U], which is typically [0,1], each feature with value x is normalized in terms of the minimum and maximum values, xmax and xmin, respectively, using the formula:
      Outliers
    • Z-Score Normalization: In this technique, also known as standardization, the feature values get auto-transformed so that the mean is 0 and standard deviation is 1. The technique to transform is as follows: for each feature f, the mean value µ(f) and standard deviation σ(f) are computed and then the feature with value x is transformed as:
    Outliers

Discretization

Many algorithms can only handle categorical values or nominal values to be effective, for example Bayesian Networks. In such cases, it becomes imperative to discretize the numeric features into categories using either supervised or unsupervised methods. Some of the techniques discussed are:

  • Discretization by binning: This technique is also referred to as equal width discretization. The entire scale of data for each feature f, ranging from values xmax and xmin is divided into a predefined number, k, of equal intervals, each having the width Discretization. The "cut points" or discretization intervals are:
    Discretization
  • Discretization by frequency: This technique is also referred to as equal frequency discretization. The feature is sorted and then the entire data is discretized into predefined k intervals, such that each interval contains the same proportion. Both the techniques, discretization by binning and discretization by frequency, suffer from loss of information due to the predefined value of k.
  • Discretization by entropy: Given the labels, the entropy is calculated over the split points where the value changes in an iterative way, so that the bins of intervals are as pure or discriminating as possible. Refer to the Feature evaluation techniques section for entropy-based (information gain) theory and calculations.

Data sampling

The dataset one receives may often require judicious sampling in order to effectively learn from the data. The characteristics of the data as well as the goals of the modeling exercise determine whether sampling is needed, and if so, how to go about it. Before we begin to learn from this data it is crucially important to create train, validate, and test data samples, as explained in this section.

Is sampling needed?

When the dataset is large or noisy, or skewed towards one type, the question as to whether to sample or not to sample becomes important. The answer depends on various aspects such as the dataset itself, the objective and the evaluation criteria used for selecting the models, and potentially other practical considerations. In some situations, algorithms have scalability issues in memory and space, but work effectively on samples, as measured by model performance with respect to the regression or classification goals they are expected to achieve. For example, SVM scales as O(n2) and O(n3) in memory and training times, respectively. In other situations, the data is so imbalanced that many algorithms are not robust enough to handle the skew. In the literature, the step intended to re-balance the distribution of classes in the original data extract by creating new training samples is also called resampling.

Undersampling and oversampling

Datasets exhibiting a marked imbalance in the distribution of classes can be said to contain a distinct minority class. Often, this minority class is the set of instances that we are especially interested in precisely because its members occur in such rare cases. For example, in credit card fraud, less than 0.1% of the data belongs to fraud. This skewness is not conducive to learning; after all, when we seek to minimize the total error in classification, we give equal weight to all classes regardless of whether one class is underrepresented compared to another. In binary classification problems, we call the minority class the positive class and the majority class as the negative class, a convention that we will follow in the following discussion.

Undersampling of the majority class is a technique that is commonly used to address skewness in data. Taking credit-card fraud as an example, we can create different training samples from the original dataset such that each sample has all the fraud cases from the original dataset, whereas the non-fraud instances are distributed across all the training samples in some fixed ratios. Thus, in a given training set created by this method, the majority class is now underrepresented compared to the original skewed dataset, effectively balancing out the distribution of classes. Training samples with labeled positive and labeled negative instances in ratios of, say, 1:20 to 1:50 can be created in this way, but care must be taken that the sample of negative instances used should have similar characteristics to the data statistics and distributions of the main datasets. The reason for using multiple training samples, and in different proportions of positive and negative instances, is so that any sampling bias that may be present becomes evident.

Alternatively, we may choose to oversample the minority class. As before, we create multiple samples wherein instances from the minority class have been selected by either sampling with replacement or without replacement from the original dataset. When sampling without replacement, there are no replicated instances across samples. With replacement, some instances may be found in more than one sample. After this initial seeding of the samples, we can produce more balanced distributions of classes by random sampling with replacement from within the minority class in each sample until we have the desired ratios of positive to negative instances. Oversampling can be prone to over-fitting as classification decision boundaries tend to become more specific due to replicated values. SMOTE (Synthetic Minority Oversampling Technique) is a technique that alleviates this problem by creating synthetic data points in the interstices of the feature space by interpolating between neighboring instances of the positive class (References [20]).

Stratified sampling

Creating samples so that data with similar characteristics is drawn in the same proportion as they appear in the population is known as stratified sampling. In multi-class classification, if there are N classes each in a certain proportion, then samples are created such that they represent each class in the same proportion as in the original dataset. Generally, it is good practice to create multiple samples to train and test the models to validate against biases of sampling.

Training, validation, and test set

The Holy Grail of creating good classification models is to train on a set of good quality, representative, (training data), tune the parameters and find effective models (validation data), and finally, estimate the model's performance by its behavior on unseen data (test data).

The central idea behind the logical grouping is to make sure models are validated or tested on data that has not been seen during training. Otherwise, a simple "rote learner" can outperform the algorithm. The generalization capability of the learning algorithm must be evaluated on a dataset which is different from the training dataset, but comes from the same population (References [11]). The balance between removing too much data from training to increase the budget of validation and testing can result in models which suffer from "underfitting", that is, not having enough examples to build patterns that can help in generalization. On the other hand, the extreme choice of allocating all the labeled data for training and not performing any validation or testing can lead to "overfitting", that is, models that fit the examples too faithfully and do not generalize well enough.

Typically, in most machine learning challenges and real world customer problems, one is given a training set and testing set upfront for evaluating the performance of the models. In these engagements, the only question is how to validate and find the most effective parameters given the training set. In some engagements, only the labeled dataset is given and you need to consider the training, validation, and testing sets to make sure your models do not overfit or underfit the data.

Three logical processes are needed for modeling and hence three logical datasets are needed, namely, training, validation, and testing. The purpose of the training dataset is to give labeled data to the learning algorithm to build the models. The purpose of the validation set is to see the effect of the parameters of the training model being evaluated by training on the validation set. Finally, the best parameters or models are retrained on the combination of the training and validation sets to find an optimum model that is then tested on the blind test set.

Training, validation, and test set

Figure 1: Training, Validation, and Test data and how to use them

Two things affect the learning or the generalization capability: the choice of the algorithm (and its parameters) and number of training data. This ability to generalize can be estimated by various metrics including the prediction errors. The overall estimate of unseen error or risk of the model is given by:

Training, validation, and test set

Here, Noise is the stochastic noise, Var (G,n) is called the variance error and is a measure of how susceptible our hypothesis or the algorithm (G) is, if given different datasets. Training, validation, and test set is called the bias error and represents how far away the best algorithm in the model (average learner over all possible datasets) is from the optimal one.

Learning curves as shown in Figure 2 and Figure 3—where training and testing errors are plotted keeping either the algorithm with its parameters constant or the training data size constant—give an indication of underfitting or overfitting.

When the training data size is fixed, different algorithms or the same algorithms with different parameter choices can exhibit different learning curves. The Figure 2 shows two cases of algorithms on the same data size giving two different learning curves based on bias and variance.

Training, validation, and test set

Figure 2: The training data relationship with error rate when the model complexity is fixed indicates different choices of models.

The algorithm or model choice also impacts model performance. A complex algorithm, with more parameters to tune, can result in overfitting, while a simple algorithm with less parameters might be underfitting. The classic figure to illustrate the model performance and complexity when the training data size is fixed is as follows:

Training, validation, and test set

Figure 3: The Model Complexity relationship with Error rate, over the training and the testing data when training data size is fixed.

Validation allows for exploring the parameter space to find the model that generalizes best. Regularization (will be discussed in linear models) and validation are two mechanisms that should be used for preventing overfitting. Sometimes the "k-fold cross-validation" process is used for validation, which involves creating k samples of the data and using (k – 1) to train on and the remaining one to test, repeated k times to give an average estimate. The following figure shows 5-fold cross-validation as an example:

Training, validation, and test set

Figure 4: 5-fold cross-validation.

The following are some commonly used techniques to perform data sampling, validation, and learning:

  • Random split of training, validation, and testing: 60, 20, 20. Train on 60%, use 20% for validation, and then combine the train and validation datasets to train a final model that is used to test on the remaining 20%. Split may be done randomly, based on time, based on region, and so on.
  • Training, cross-validation, and testing: Split into Train and Test two to one, do validation using cross-validation on the train set, train on whole two-thirds and test on one-third. Split may be done randomly, based on time, based on region, and so on.
  • Training and cross-validation: When the training set is small and only model selection can be done without much parameter tuning. Run cross-validation on the whole dataset and chose the best models with learning on the entire dataset.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.131.212