CHAPTER 9: Sampling Data and Evaluating Results

Chapter009.jpg

In the era of big data, sampling has become a popular and essential part of the data science pipeline. Even if you can get all the available data to fit into a large data structure, it may be ill-advised (unless you already have an adequate model at your disposal). Just because the cloud and large computer clusters make it possible, doesn’t mean that you should use all the available data as-is. You shouldn’t even need all of your data to see whether a feature holds value; a sample can make the whole process much more efficient.

Equally important is the evaluation of the results of your models. Although there are some obvious techniques we have used already, there is much more to it, as not all data science problems involve the same goals. Besides, when it comes to evaluating results, there are often conflicting objectives which must be considered.

In this chapter we’ll examine the following topics:

  • The various ways we can perform sampling
  • Evaluation metrics for classifiers
  • Evaluation metrics for regression systems
  • K-fold Cross Validation.

Sampling Techniques

Although we have glanced at sampling throughout the book, we’ve merely scratched the surface. Sampling can be a sophisticated process for certain problems. In this section we’ll examine the two main types of sampling and how we can employ Julia for these tasks. In all cases of sampling used in data science, we make use of randomization, as this yields more unbiased samples.

Basic sampling

This is the type of sampling we have been using so far. It works well for the majority of cases and it is very fast in its implementation, particularly in a language like Julia. Because it is such a simple way to create samples, you don’t need a package for it. However, there are a variety of options available in the StatsBase package, using the sample() function. We recommend you load that function into memory using the import StatsBase.sample command, as we’ll be extending it later with a custom function.

The idea behind basic sampling is to take a random subset of the original data and yield the features and the target variable values that correspond to it. The only parameter we need to use is the number of data points we expect to have in the sample (e.g. 1000). So, if we were to take the magic dataset, for instance, and apply basic sampling, we would type:

In[1]: p = 0.1

    n = round(Int64, 19020*p)

    ind = sample(1:19020, n)

    SampleData1 = data[ind,:]

Where the value multiplied by p is the number of elements in the sample, rounded to the nearest integer. Alternatively, you can use the built-in function randperm(n), which provides you with a random permutation of a range from 1 to n. So, in the case of the magic dataset we could type:

In[1]: ind = randperm(19020)[1:n]

    SampleData2 = data[ind,:]

Both of these methods are fast, but the latter is preferable if you don’t want to worry about external packages (that often tend to be difficult to remember when you need them!).

Stratified sampling

Stratified sampling is a bit trickier as it aims to preserve the class structure of the dataset, when classes are available. However, you can use any discreet variable instead of a class, if you want to apply this type of sampling to a dataset (e.g. the gender variable for a demographics dataset). The objective is to create a sample with a distribution of a given variable similar to that of the original dataset.

Again, the only parameter we need to address is the number of elements that need to be included in the sample. As this is not a trivial task, we need to make use of an external script (like the sample.jl included in this book, which extends the sample() function of the StatsBase package) as currently there is no package that provides this method. So, applying stratified sampling to the magic dataset would look like this:

In[3]: StratifiedSample = sample(data[:,1:10], data[:,11], n)

This works only because we have extended the sample() function using the sample.jl script. If you were to run this snippet with the default sample() function of the StatsBase package, you would get an error. The output of this function yields two distinct arrays of the sample: one for the features and one for the labels. You could easily merge the two by employing the built-in hcat() function:

In[4]: SampleData3 = hcat(StratifiedSample[1], StratifiedSample[2])

In this example stratified sampling may be an overkill, but there are several real-world scenarios where this approach would be appropriate and necessary. If a particular class is very small compared to the other classes, its signal may be weakened or lost altogether through the basic sampling approach.

Performance Metrics for Classification

Once we have completed a classification task, we end up with a vector of predicted labels that correspond to the test set we have run through a classification system. But how close were we to the actual labels? To answer this question in a reliable and insightful way, we employ a series of metrics. Let’s look at each one of them more closely.

Confusion matrix

Although this is not strictly a classification metric, a confusion matrix (or table) is essential for evaluating a classifier’s performance. The matrix should indicate how close the predictions were to the actual labels, and divulge some insight as to where the errors (misclassifications) lie. This will not only allow us to go deeper into the classifier’s performance, but also get a better understanding of the dataset. This is especially true if we run several classifications on the same data, using different parameters and different classifiers. Let us now see how Julia can produce a confusion matrix for a basic classification experiment (so that you can check it by hand, for more in-depth understanding of the methods described in this section). First, let’s create the predictions and the labels’ vectors:

In[5]: t = [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2]

    y = [1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1]

Now, let’s calculate the corresponding confusion matrix:

In[6]: CM = confusmat(2, t, y)

Out[6]: 2x2 Array{Int64,2}:

  1 3

  2 6

The target variable can be anything (e.g. a string, a character), but for this function of the MLBase package to work, it must be encoded as an array of integers. If you are unsure about how to accomplish this, please refer to Chapter 6.

In this simple example, we can deduce that the classifier has mediocre performance, since there are a lot of cases where class 1 was predicted as class 2 and vice versa. In general, the closer the confusion matrix is to a diagonal matrix, the better the performance of the classifier. Let us see how we can quantify this rule of thumb, using a series of metrics based on the confusion matrix.

Accuracy metrics

Basic accuracy

In the beginning of the book, as well as in the “Caffeine for the Force” case study, we looked briefly at this metric for evaluating the performance of various classifiers. It is the simplest metric for this task and it works well for relatively balanced datasets (i.e. datasets where the classes are of comparable size).

In essence, this metric calculates the proportion of correct classifications over the total number of classifications. By correct classifications, we mean cases where the predictions and the labels are the same. This corresponds to the diagonal of the confusion matrix. So, in our previous example, the accuracy of the classification is (1 + 6) / length(t) = 7 / 12 equals roughly 0.583. The same result can be obtained through the correctrate() function of the MLBase package:

In[7]: correctrate(t, y)

Out[7]: 0.58333333334

The same result would be obtained if you reversed the inputs of the function by accident, since Julia has no way of knowing which labels are the ground truth and which are the predicted ones.

Accuracy is often referred to as “accuracy rate” and it always takes a value between 0.0 and 1.0 (usually expressed as a percentage). Usually an accuracy higher than 1/q, where q is the number of classes, is considered satisfactory (although you’d like to get as close to 1.0 as possible). Unfortunately, the accuracy metric performs poorly when the dataset is heavily unbalanced, yielding misleading results. For instance, in a fraud detection scenario, even a horrible classifier is bound to produce accuracy rates close to 1 simply by defaulting to the dominant class, which in this case would be “safe transaction.”

Weighted accuracy

Although accuracy on its own is a very blunt metric (considering every hit or miss of exactly the same value), it has a modified version that works well for even the most challenging datasets: weighted accuracy. One benefit of weighted accuracy is that it makes some data points more important than others, as is often the case in unbalanced datasets, where the smaller class usually holds the most value.

In the previous example, if we were more interested in predicting the first class correctly, we would assign higher weights to the elements of the corresponding test set. The weights are normalized. Their exact values don’t matter all that much, since it is the ratio of the weight of one data point to another that affects the overall score. So, in our example, we could apply weighted accuracy by first defining and normalizing the weights:

In[8]: w = [4, 1]

    w ./= sum(w)

Out[8]: 2-element Array{Float64,1}:

    0.8

    0.2

Now, we need to apply them to the calculation of the accuracy rate:

In[9]: 2*sum(diag(CM) .* w) / 12

Out[9]: 0.333333333333333

Note that “2” corresponds to the number of classes and “12” to the total number of classifications. In general, the formula for the weighted accuracy metric would be:

q*sum(diag(CM) .* w) / N

If you plan to use weighted accuracy often, it would be helpful to code it into a function such as the following:

function wa{T <: Real}(t::Array{Int64, 1}, y::Array{Int64, 1}, w::Array{T, 1})

  q = length(unique(t))

  N = length(t)

  CM = confusmat(q, t, y)

  return q*sum(diag(CM) .* w) / N

end

In the example, we chose the first class to be much more important than the second one, making the overall performance of the classifier vastly different. Before, the classifier seemed decent; after the application of the weights it seems subpar for this particular problem. If we were to reverse the weights and make the second class more important, the weighted accuracy would swing in the opposite direction, taking a value of about 0.833. It is important to choose the weights carefully if you want the weighted accuracy to provide meaningful and insightful results.

Precision and recall metrics

Accuracy often fails to capture how well a classifier performs for a particular class, as it focuses on delivering the classifier’s overall performance. If we wish to zoom in on a particular class we’ll need to make use of alternative metrics, such as precision and recall.

Precision depicts how reliable a classifier is for predicting a particular class. Mathematically it is defined as the number of hits over the total number of predictions for that particular class: TP / (TP + FP). So, for the given example, the precision of the classifier for class 1 is 1 / (1 + 2) = 0.333. In general, you can use the confusion matrix CM to calculate the precision of class c using the snippet:

CM[c,c] / sum(CM[:,c])

Recall shows how many of the cases of the particular class the classifier got right. Mathematically this is equivalent to the number of hits over the total number of cases for that class, predicted or otherwise: TP / (TP + FN). For our example, the recall of the classifier for class 1 is 1 / (1 + 3) = 0.25. The code to calculate recall in the general case is:

CM[c,c] / sum(CM[c,:])

Clearly, both the precision and the recall for class 1 are not excellent. The precision and recall of class 2 will be larger, since overall the performance of the system is 0.583 and there are only two classes. Of course, to know exactly what they are we’ll need to do the corresponding calculations.

F1 metric

Although precision and recall capture important information about the performance of a classifier for a given class, neither of them provides a good snapshot of the bigger picture. One can easily tweak a classifier so that it has amazing precision or amazing recall, but it is extremely difficult to get a very good score for both (unless you actually improve the classifier or the features significantly). Any metric that took into account both precision and recall would be able to deliver a more holistic view of the classifier’s performance.

One such metric (the most popular one) is the F1 metric, which is expressed as a number between precision and recall, leaning toward whichever one is smaller. This is expressed mathematically using the harmonic mean (which you can read more about at http://bit.ly/29p4FpY), or in terms of the confusion matrix: F1 = 1/(1+(FP+FN)/(2TP)) = 2TP/(2TP+FP+FN). In the aforementioned example the F1 of the classifier would be: 2*1/(2*1+2+3) = 2/7 = 0.286. This general formula for the F1 metric can be expressed in Julia as:

2*CM[c,c] / (sum(CM[:,c]) + sum(CM[c,:]))

Alternatively, you can just calculate the harmonic mean of R(CM,c) and P(CM,c).

Not surprisingly, the F1 metric for class 1 of this classification is not good either, as both the precision and the recall for that class are relatively low. Yet even if one of them were satisfactory, the other one would pull the F1 metric to a low value. As an exercise, you can calculate the F1 metric of this classification for the second class of this classification experiment.

Misclassification cost

Rarely do all errors have the same importance. In security applications, for example, if you misclassify a normal person as an intruder, it’s not the end of the world (though that person may get frustrated). If you misclassify an intruder for a normal person, though, you’ll make the headlines–and not in a positive way!

Defining the cost matrix

Mathematically we can express this difference in the importance of errors as different misclassification costs, usually arranged in a matrix. Obviously in such a setup, the diagonal of the matrix would be all zeros (since this corresponds to the correct classifications), with all the remaining elements being some non-negative numbers. This is often referred to as a cost matrix and in our example it could be something like this:

0 10

3  0

You can store this in Julia as:

C = [0 10; 3 0]

Note that this is unrelated to the classifier used and it is usually arbitrary defined (unless we know how much each type of mistake would cost us). Also, the cost is not necessarily in a particular currency; it is usually the relative size of each cost that is important. Even if you double the whole matrix, it shouldn’t affect your view of the classifier’s performance over a series of classification experiments.

Calculating the total misclassification cost

In order to make use of the cost matrix, we need to combine it with the confusion matrix we calculated earlier. You can do this by applying the following simple piece of code:

In[10]: total_cost = sum(CM .* C)

Out[10]: 36

This number may not mean much to us, but if the cost matrix is populated in a meaningful way, this result would be more insightful. For example, if the numbers in C represented thousands in lost revenue, the above output would mean that this classifier’s imperfect performance would cost us $36,000. When optimizing for the total misclassification cost we want as small a number as possible.

Receiver Operating Characteristic (ROC) Curve and related metrics

ROC Curve

The ROC curve is one of the most common methods for assessing a classifier’s performance for cases with just two classes, particularly if you are new to the field. It doesn’t take any parameters and provides you with an insightful plot that you can share with your supervisors and anyone else who doesn’t need to see all of the details.

Contrary to what its name suggests, it’s not usually a curve, but more like a zig-zag line. It basically shows how well a classifier performs in terms of true positives and false positives, under various conditions. Usually the classifier is applied with a different set of parameters and its performance is noted as a point in the plot. Afterward, all the points are connected, resulting in a plot called the ROC curve. The closer the curve lies to the top-left corner of the plot, the better the classifier is overall.

In Figure 9.1 you can see a typical ROC curve. The dotted line across the plot shows the performance of a naive classifier as a reference. The closer the curve is to that line, the worse the classifier’s performance is.

Image027.jpg

Figure 9.1. A typical ROC curve of a classifier.

In order to create a ROC curve, though, you’ll need more than just the predicted labels of the classification. You’ll also require the probabilities of each data point’s classification. This is basically a measure of how confident the classifier is for a particular label, and in many cases is the score that defines the label (this score is sometimes referred to as “degree of confidence”). Let us first generate an array p with these probabilities (this would normally come from the classifier as one of its outputs):

In[11]: p = [0.6, 0.55, 0.65, 0.6, 0.7, 0.65, 0.9, 0.75, 0.55, 0.65, 0.8, 0.45]

You can generate ROC curves in Julia using the following code, after you’ve loaded the ROC package:

In[12]: using ROC

In[13]: z = (y .== t)

    rc = ROC.roc(p, z)

Since the roc() function exists in the MLBase as well, we should make it explicit to Julia that we want to use the function from the ROC package, which is why we call this function by adding “ROC.” in front of it.

Now, let’s create a plot based on the ROC object we have created:

In[14]: plot(rc)

Image028.jpg

Figure 9.2. The ROC curve of the classifier in our example.

The MLBase package also has a few functions for the ROC curve, but if you want to create a plot of the curve itself, the easiest way is through the ROC package. Still, once you are comfortable with Julia, we recommend you give it a try as it is much better documented than the ROC package.

AUC Metric

Since the ROC curve itself doesn’t tell us much about exactly how well a classifier performs, we often need to quantify its insight through a metric. AUC, which stands for area under curve, is the most popular metric as it shows exactly that, taking values between 0 and 1. The higher its value, the better the classifier performed in this particular experiment. Let’s look at how we can calculate AUC using Julia:

In[15]: AUC(rc)

Out[15]: 0.7142857142857143

Usually scores over 0.7 are considered satisfactory. In this fictitious example the classifier performed acceptably based on the AUC metric, though there is plenty of room for improvement. The relatively good performance is partly due to the fact that the classifier was not confident about the cases where it failed to provide a correct prediction. It is important to keep in mind that although this imaginary classifier performed decently overall, it may still not be a good choice for the problem at hand; take the evaluation of AUC with a grain of salt.

Gini Coefficient

This is another metric that stems from the ROC curve; it shows how a classifier’s performance compares to the random classifier represented by the straight line in the ROC plot. It takes values between -1 and 1, with 0 representing a performance as good as a random guess. The Gini Coefficient is calculated by the formula G = 2AUC - 1, where AUC is the area under the ROC curve metric, as we saw previously. Generally, a positive Gini Coefficient is good. However, it is not an absolute metric, since we may be interested in how a classifier performs for a particular class–something that this metric fails to capture.

In Julia there is no function that yields the Gini Coefficient of a classification, but if you want to calculate it you can use the AUC() function as a basis:

In[16]: function GC(roc_curve) = 2*AUC(roc_curve) - 1

In[17]: GC(rc)

Out[17]: 0.4285714285714286

Clearly, evaluation of this fictitious classification experiment is pretty good (about 43% better than chance), though there is still a lot of room for improvement.

Performance Metrics for Regression

Since the regression aspect of machine learning is a whole new animal, it comes with its own set of metrics. This is because the output of a regressor is a continuous variable, just like the target variable that it tries to predict. None of the classification metrics would work properly (if at all).

Regression performance metrics basically compute how different the prediction is from the target (this difference is also known as error). The lower they are, the better the performance of the regressor. The most widely-used metrics are the mean square error (MSE) and the sum squared error (SSE). There are a few reasons that performance metrics for regression tend to focus on squared errors:

  • There’s no need to worry about the sign of the error, since the square function eliminates it
  • They take care of extreme errors as these are magnified
  • The error function is easy to apply a derivative on, making it ideal for certain optimization methods.

MSE Metric and its variant, RMSE

By taking all the aforementioned squared errors and averaging them, we obtain the MSE metric. If we were to take the square root of this value, we would get the RMSE metric, which is a variant of MSE. RMSE is often used in physics, particularly in signal processing.

In Julia we can compute the MSE and RMSE metric as follows. First we need to get some sample data going:

In[18]: t = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 2.0, 3.0, 2.5, 1.5, 3.0]

    y = [-0.5, 0.6, 1.4, 1.2, 1.9, 2.6, 3.1, 2.4, 2.9, 2.5, 1.3, 2.8]

The first vector (t) has the target values, while the second (y) has the corresponding predictions. Next, we need to calculate the squared errors (se) based on these vectors:

In[19]: se = (t - y).^2

Now, we can calculate MSE and RMSE easily by executing the following code:

In[20]: mse = mean(se)

Out[20]: 0.06583333333333334

In[21]: rmse = sqrt(mse)

Out[21]: 0.2565800719723442

Here are a couple of functions for calculating the MSE and RMSE metrics swiftly. Unfortunately, they aren’t yet contained in a reliable package.

MSE(t::Array{Float64, 1}, y::Array{Float64, 1}) = mean((t-y).^2)

RMSE(t::Array{Float64, 1}, y::Array{Float64, 1}) = sqrt(mean((t-y).^2))

SSE Metric

If we were to add the squared errors instead of taking their average, we would end up with the SSE metric. This is also popular when evaluating the performance of a regression system. Though it correlates perfectly with the MSE metric (so there is no need to calculate both), it may be useful in some cases where you need to calculate performance several times (e.g. through an optimization process), as it is slightly faster to compute. Here is how Julia can help you make use of this metric by using the squared errors vector we calculated previously:

In[22]: sse = sum(se)

Out[22]: 0.79

Should you want to wrap this whole process in a function, you could use something like this:

SSE(t::Array{Float64, 1}, y::Array{Float64, 1}) = sum((t-y).^2)

Other metrics

All the performance metrics for regression are basically variants of the Euclidean distance between the predicted targets and the actual ones. So it is not difficult to create your own performance metrics, if you want to dig deeper into the evaluation of your regressor’s performance.

For example, instead of the vanilla-flavored mean used in MSE, you could use the harmonic or even the reverse harmonic mean (which is not canon in statistics, despite its utility) on the squared errors. These two alternative metrics will be milder and harsher on the extreme errors performed by the regression system being assessed, respectively. Here are the corresponding functions for these metrics:

function harmean{T<:Real}(X::Array{T,1}, tol::Float64 = 0.1)

# tol = tolerance parameter for avoiding “division by 0” errors

  return length(X) / sum(1./(X + tol)) - tol

end

function revharmean{T<:Real}(X::Array{T,1}, tol::Float64 = 0.1)

# Reverse Harmonic Mean

  return maximum(X) + minimum(X) - harmean(X, tol)

end

function HSE{T<:Real}(y::Array{T,1}, t::Array{T,1})

# Harmonic mean of Squared Error

SE = (y-t).^2

return harmean(SE)

end

function RHSE{T<:Real}(y::Array{T,1}, t::Array{T,1})

# Reverse Harmonic mean of Squared Error

SE = (y-t).^2

return revharmean(SE)

end

Feel free to explore other alternatives as well. At the end of the day, the data science field is work in progress, so who is to say that the best performance metric for regression isn’t yet to be discovered? Besides, it may be the case that a custom performance metric works better for your data science problem.

K-fold Cross Validation (KFCV)

KFCV is an evaluation strategy that allows you to check whether a particular machine learning system achieves good generalization by performing a series of experiments on it. The experiments make use of the whole dataset by partitioning it into K roughly equal parts, which are used as the test set in K different iterations. In order to obtain useful results from KFCV in the case of classification problems, we often must perform stratified sampling.

The logic of KFCV is that if a machine learning system happens by chance to perform well at a particular sample, it might fool us into believing that it is a reliable one. However, it would be practically impossible to repeat this lucky break again and again, over mutually exclusive test sets. So, if a classifier or regressor has a lucky run, it may influence the results of a single experiment, but not of the whole series of experiments dictated by KFCV.

The selection of the K parameter in KFCV can be tricky. There is no “right answer” since it depends, as usual, on the problem at hand. Usually a K value of 10 provides adequate results (which is why it is the default option in the corresponding script for this task).

If we were to choose the maximum possible value for K (i.e. the total number of data points in the dataset), we would end up with a special evaluation technique that is referred to as Leave-One-Out validation. This is particularly useful for very small datasets, where sampling the data doesn’t make much sense as it would compromise the signal in the data. Although well-documented, this particular evaluation method is rarely used in data science (you could say that data scientists tend to “leave it out” of their toolbox!).

If you want to learn more about KFCV, we recommend studying a set of slides prepared by Professor Gutierrez-Osuna; they can be found at http://bit.ly/1gOxGwp. Also, even though you can use any kind of sampling for KFCV when dealing with classification problems, we strongly recommend that you go with stratified sampling; it is easy to get a biased sample at some point, after repeating the process K times.

Applying KFCV in Julia

You can use Julia to apply K-fold Cross Validation to your data using the provided script KFCV.jl (unfortunately, there isn’t a decent implementation of the KFCV method in any of the available packages at this time). Here is how you can do this for the magic dataset, using K = 10.

First you need to make sure that the KFCV.jl file is on your working directory. Then you can go ahead and load it:

In[23]: include(“KFCV.jl”)

Finally, you can apply it on your data as follows:

In[24]: P, T, PT, TT = KFCV(data[:,1:(end-1)], data[:,end], 10);

The semicolon at the end is optional, but can help keep your notebook more usable (if you don’t use the semicolon, Julia will flood your notebook with data that you don’t need to view).

The outputs of this function are all arrays consisting of 10 elements. Each one of these outputs are respectively:

  • P: The training inputs (features) for each KFCV experiment
  • T: The training outputs (labels) for each KFCV experiment
  • PT: The testing inputs (features) for each KFCV experiment
  • TT: The testing outputs (labels) for each KFCV experiment.

KFCV tips

Although KFCV is better than any single test you can run on a machine learning system (even with stratified sampling), it is not a panacea. There are times that a classifier or a regressor would perform relatively well on a series of KFCV experiments; this is why we recommend you run a number of KFCVs—enough to yield statistical significance. However, as the data you use for developing your model gets larger, the need for repeated KFCV runs diminishes.

Also, if you want to compare two classifiers or two regressors to each other, you are better off using the same KFCV partitioning for both systems. This is particularly important if both machine learning systems are of about the same level of performance.

Summary

Sampling:

  • Sampling provides a smaller dataset that is more practical to work with, while at the same time preserving (to some extent) the signals of the original dataset.
  • Basic sampling is indifferent to the class structure and works fine with relatively balanced datasets. It is the most common way of sampling for regression problems.
  • Stratified sampling takes into account the number of elements in each class and produces a sample that has the same class distribution. It is particularly suitable for unbalanced datasets and for K-fold Cross Validation.
  • Evaluate results by determining the performance of a classifier or a regressor.

Classification:

  • There are several metrics for evaluating a classifier, the most important of which are accuracy (basic and weighted), precision, recall, F1, total cost (based on a cost matrix), and the ROC curve (along with its related metrics). Most of these are related to the confusion matrix of a particular classification.
  • Confusion matrix is a q x q matrix showing how well a classifier performs in predicting the correct labels and where the various errors lie (q = number of classes). The elements in the diagonal correspond to all the correct classifications.
  • Accuracy is a basic metric that takes the ratio of the number of the correct classifications over the total number of predictions. It takes values between 0 and 1 (the higher the better). It works well for balanced datasets.
  • Weighted accuracy is a variant that takes into account the importance of each class and shifts the value of the metric accordingly. This is good for unbalanced datasets, or datasets where we care more for the predictions related to certain classes mainly. The weights are defined arbitrarily.
  • Precision is a metric that shows how reliable a classifier is for a particular class.
  • Recall is a measure that depicts how many of the elements of a particular class are classified correctly.
  • F1 is the harmonic mean of precision and recall, expressing the performance of a classifier for a given class.
  • Cost matrix is a q x q matrix showing how much each misclassification costs (its diagonal is full of zeros). This is very similar to the weighted accuracy and is used to calculate the overall cost of all the errors made by a given classifier for a particular dataset. The costs are defined in an arbitrary way.
  • An ROC curve is a plot that shows how well a classifier does overall, when it comes to problems with two classes.
  • The AUC metric (short for area under curve) is a measure of a binary classifier’s performance, based on the ROC curve. It takes values between 0 and 1 (the higher the better).
  • The Gini Coefficient is another metric that is based on the ROC curve and shows how well a binary classifier performs compared to random guessing. It takes values between -1 and 1 (the higher the better), with any positive value being better than random classification.

Regression:

  • Regression is evaluated using some variant of the distance of the predicted values to the actual values. The most common metrics used are mean square error (MSE), and sum of square error (SSE).
  • Square error is a vector consisting of the errors of the various data points the regressors has predicted, raised to the power of two. The errors themselves are the differences between the actual values and the predicted values.
  • MSE is the arithmetic average of the squared errors of in a regression problem.
  • RMSE is the square root of the MSE in a regression problem.
  • SSE is the sum of all the squared errors in a regression problem. This is equivalent to the square of distance between the predictions and the actual values, expressed as vectors.

K-fold Cross Validation (KFCV):

  • This evaluation strategy involves several runs of a classification or regression experiment in such a way that we obtain a holistic view of its performance, ensuring that the results are not because of the particular training and testing sets used.
  • With KFCV, the dataset is partitioned into K approximately equal parts. Each one of them is used as a test set in the K experiments that follow.
  • The value of K in KFCV depends on the dataset size. Larger datasets require a small K, while smaller datasets are better off with a larger K.
  • The Leave-One-Out evaluation strategy is a special case of KFCV, where the K parameter is equal to the size of the dataset. This is suitable for very small datasets and is not often used in data science.

Chapter Challenge

  1. 1. Can you use stratified sampling for a dataset with a continuous variable as a target?
  2. 2. Which sampling method would you use for a heavily unbalanced dataset? Why?
  3. 3. What aspect of sampling yields the least bias in the sample created?
  4. 4. Is it possible to have a normalized version of the total cost (based on the misclassification cost approach), taking values between 0 and 1? How would you go about implementing this?
  5. 5. Could you use a ROC curve for a three-class problem?
  6. 6. Would it be possible to use KFCV for any problem? Give an example to justify your answer.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.64.89