Chapter 10. Improving Models and Data Extraction

How do you go about improving upon a simple machine learning algorithm such as Naive Bayesian Classifiers, SVMs, or really any method? That is what we will delve into in this chapter, by talking about four major ways of improving models:

  • Feature selection

  • Feature transformation

  • Ensemble learning

  • Bootstrapping

I’ll outline the benefits of each of these methods but in general they reduce entanglement, overcome the curse of dimensionality, and reduce correction cascades and sensitivity to data changes.

They each have certain pros and cons and should be used when there is a purpose behind it. Sometimes problems are so sufficiently complex that tweaking and improvement are warranted at this level, other times they are not. That is a judgment people must make depending on the business context.

Debate Club

I’m not sure if this is common throughout the world, but in the United States, debate club is a high school fixture. For those of you who haven’t heard of this, it’s a simple idea: high schoolers will take polarizing issues and debate their side. This serves as a great way for students who want to become lawyers to try out their skills arguing for a case.

The fascinating thing about this is just how rigorous and disciplined these kids are. Usually they study all kinds of facts to put together a dossier of important points to make. Sometimes they argue for a side they don’t agree with but they do so with conviction.

Why am I telling you this? These debate club skills are the key to making machine learning algorithms (and many cases any algorithm) work better:

  • Collecting factual and important data

  • Arguing different points of view in multiple ways

As you can imagine, if we could collect important or relevant data to feed into our models, and try different methods or approaches to the same problem, we will iteratively get better as we find the best model combination.

This gets us into what we will be talking about: picking better data or arguing for solutions more effectively.

Picking Better Data

In this section we’ll be discussing how to pick better data. Basically we want to find the most compact, simplest amount of data that backs up what we are trying to solve. Some of that intuitively means that we want the data that supports our conclusion, which is a bit of cart before the horse; regardless, there are two great methods to improve the data one is using: feature selection and feature transformation algorithms.

This sounds like a great idea, but what is the motivation behind picking better data?

Generally speaking, machine learning methods are better suited for smaller dimensions that are well correlated with the data. As we have discussed, data can become extremely overfit, entangled, or track improperly with many dimensions. We don’t want to under- or overfit our data, so finding the best set to map is the best use of our time.

Feature Selection

Let’s think about some data that doesn’t make a whole lot of sense. Say we want to measure weather data and want to be able to predict temperature given three variables: “Matt’s Coffee Consumption,” “Ice Cream Consumption,” and “Season” (see Table 10-1 and Figure 10-1).

Table 10-1. Weather data for Seattle
Average temperature (°F) Matt’s coffee consumption (cups) Ice cream consumption (scoops) Month

47

4.1

2

Jan

50

4

2

Feb

54

4

3

Mar

58

4

3

Apr

65

4

3

May

70

4

3

Jun

76

4

4

Jul

76

4

4

Aug

71

4

4

Sep

60

4

3

Oct

51

4

2

Nov

46

4.1

2

Dec

tmlp 1001
Figure 10-1. A graph comparing my consumption of ice cream (in scoops) and coffee (in cups) with the temperature

Obviously you can see that I generally drink about 4 cups of coffee a day. I tend to eat more ice cream in the summertime and it’s generally hotter around that time.

But what can we do with this data? There are at most N choose K solutions to any data set, so given N dimensions, we can find an enormous number of combinations of various-sized subsets.

At this point we want to reduce the amount of dimensions we are looking at but don’t know where to start. In general we want to minimize the redundancy of our data while maximizing the relevancy. As you can imagine this is a tradeoff: if we keep all the data, then we’ll know 100% that we have relevant data whereas if we reduce some number of dimensions we might have redundancy—especially if we have lots and lots of dimensions.

We have talked about this before as being an entanglement problem with having too many data points that point to the same thing.

In general, redundancy and relevancy are calculated using the same metrics and on a spectrum:

  • Correlation

  • Mutual information

  • Distance from some point (Euclidean distance from reference)

So they actually end up measuring the same thing. How do we solve this?

Let’s first take a step back and think about what would happen if we just looked at all possibilities.

Exhaustive Search

Let’s imagine that in this case we want to find the best possible dimensions to train on. We could realistically just search through all possibilities. In this case we have three dimensions which would equate to seven models (123, 12, 13, 23, 1, 2, 3). From here we could say that we want to find the model that has the highest accuracy (Figure 10-2).

tmlp 1002
Figure 10-2. Exhaustive search for best features

This unfortunately doesn’t work as well as you go up in dimensions. If for instance you have 10 dimensions, the possibilities from selecting 10 dimensions, to 1 dimension would be 210 – 1. This can be denoted in Pascal’s triangle (Figure 10-3) as a sum of combinations where:

1010+109++101
tmlp 1003
Figure 10-3. Pascal’s triangle

Pascal’s triangle shows all combinations for a given row. Since each row sums up to 2n, all we need to do is subtract 1, so we don’t account for zero dimensions.

So as you add dimensions you would have to account for 2n – 1 possible data sets. If you had 3,000 dimensions (which would be a good reason to use feature selection), you would have roughly a trecentillion (10903) models to run through!

Surely there is a better way. We don’t need to try every model. Instead, what if we just randomly selected features?

Random Feature Selection

A lot of the time random feature selection will be useful enough for our purposes. Reducing the features by half or a certain amount is an excellent way of improving data overfitting. The added benefit is that you really don’t have to think about it much and instead try out a random feature selection of a certain percent.

Say for instance you want to reduce the features by 25%. You could randomly see how it performs for accuracy, precision, or recall. This is a simple way of selecting features, but there is one major downside: what if training the model is slow? You are still brute-forcing your way to finding features. This means that you are arbitrarily picking a number and hoping for the best. Surely there is a better way.

A Better Feature Selection Algorithm

Instead of relying on random feature selection, let’s think a little more in terms of what we want to improve with our model. We want to increase relevancy while reducing redundancy. Relevancy is a measure of how relevant the dimension in question is versus the classification whereas redundancy is a measure of how redundant the dimension is compared to all the other dimensions. Usually for relevancy and redundancy you either use correlation or mutual information.

Correlation is useful for data that is continuous in nature and not nominal. By contrast, mutual information gives us a discrete measure of the mutual information shared between the two dimensions in question.

Using our earlier example, correlation would look like the results in Table 10-2 for relevancy and Table 10-3 for redundancy.

Table 10-2. Relevancy using correlation
Dimension Correlation to temperature

Matt’s coffee consumption

–0.58

Ice cream

0.93

Month

0.16

Table 10-3. Redundancy using correlation
Dimension Matt’s coffee consumption Ice cream Month

Matt’s coffee consumption

1

–0.54

0

Ice cream

–0.54

1

0.14

Month

0

0.14

1

As you can see from these two tables, ice cream is highly correlated with temperature and my coffee consumption is somewhat negatively correlated with temperature; the month seems irrelevant. Intuitively we would think month would make a huge difference, but since it runs on a modular clock it’s hard to model using linear approximations. The redundancy is more interesting. Taken out of context my coffee consumption and month seem to have low redundancy, while coffee and ice cream seem more redundant.

So what can we do with this data? Next I’m going to introduce a significant algorithm that brings this all together.

Minimum Redundancy Maximum Relevance Feature Selection

To bring all of these competing ideas together into one unified algorithm there is minimum redundancy maximum relevance (mRMR) feature selection, which aims to maximize relevancy while minimizing redundancy. We can do this using a maximization (minimization) problem using NumPy and SciPy.

In this formulation we can just minimize the following function:

Equation 10-1. mRMR definition
maxRelevancy-Redundancy
Equation 10-2. Relevancy definition
Relevancy=i=1ncixii=1nxi
Equation 10-3. Redundancy definition
Redundancy=i,j=1naijxixj(i=1nxi)2

More importantly in code we have:

from scipy.optimize import minimize
import numpy as np

matrix = np.array([
  [47, 4.1, 2, 1],
  [50, 4, 2, 2],
  [54,4,3,3],
  [58,4,3,4],
  [65,4,3,5],
  [70,4,3,6],
  [76,4,4,7],
  [76,4,4,8],
  [71,4,4,9],
  [60,4,3,10],
  [51,4,2,11],
  [46,4.1,2,12]
])

corrcoef = np.corrcoef(np.transpose(matrix))
relevancy = np.transpose(corrcoef)[0][1:]

# Set initial to all dimensions on
x0 = [1,1,1]

# Minimize the redundancy minus relevancy

fun = lambda x: sum([corrcoef[i+1, j+1] * x[i] * x[j] for i in range(len(x)) 
                     for j in range(len(x))])/ 
                     (sum(x) ** 2) - 
                     (sum(relevancy * x) / sum(x))

res = minimize(fun, x0, bounds=((0,1), (0,1), (0,1)))

res.x

array([ 0.29820206,  1.        ,  0.1621906 ])

This gives us almost exactly what we expected: my ice cream consumption models the temperature quite well. For bonus points you could use an integer programming method to get the values to be either 0 or 1, but for these purposes it’s obvious which features should be selected to improve the model.

Feature Transformation and Matrix Factorization

We’ve actually already covered feature transformation quite well in the previous chapters. For instance, clustering and the kernel trick are both feature transformation methods, effectively taking a set of data and projecting it into a new space, whether it’s a cluster number or an expanded way of looking at the data. In this section, though, we’ll talk about another set of feature transformation algorithms that are rooted in linear algebra. These are generally used to factor a matrix down to a smaller size and generally can be used to improve models.

To understand feature transformation, let’s take a look at a few algorithms that transform a matrix into a new, more compressed or more verbose version of itself: principal component analysis and independent component analysis.

Principal Component Analysis

Principal component analysis (PCA) has been around for a long time. This algorithm simply looks at the direction with the most variance and then determines that as the first principal component. This is very similar to how regression works in that it determines the best direction to map data to. Imagine you have a noisy data set that looks like Figure 10-4.

tmlp 1004
Figure 10-4. Graphical PCA from Gaussian

As you can see, the data has a definite direction: up and to the right. If we were to determine the principal component, it would be that direction because the data is in maximal variance that way. The second principal component would end up being orthogonal to that, and then over iterations we would reduce our dimensions by transforming them into these principal directions.

Another way of thinking about PCA is how it relates to faces. When you apply PCA to a set of faces, an odd result happens known as the Eigenfaces (see Figure 10-5).

tmlp 1005
Figure 10-5. Eigenfaces (Source: AT&T Laboratories)

While these look quite odd, it is fascinating that what comes out is really an average face summed up over all of the training data. Instead of implementing PCA now, we’ll wait until the next section where we implement an algorithm known as independent component analysis (ICA), which actually relies on PCA as well.

Independent Component Analysis

Imagine you are at a party and your friend is coming over to talk to you. Near you is someone you hate who won’t shut up, and on the other side of the room is a washing machine that keeps making noise (see Figure 10-6).

tmlp 1006
Figure 10-6. Cocktail party example

You want to know what your friend has been up to, so you listen to her closely. Being human, you are adept at separating out sounds like the washing machine and that loudmouth you hate. But how could we do that with data?

Let’s say that instead of listening to your friend, you only had a recording and wanted to filter out all of the noise in the background. How would you do something like that? You’d use an algorithm called ICA.

Technically, ICA minimizes mutual information, or the information shared between the two variables. This makes intuitive sense: find me the signals in the aggregate that are different.

Compared to our face recognition example in Figure 10-5, what does ICA extract? Well, unlike Eigenfaces, it extracts features of a face, like noses, eyes, and hair.

PCA and ICA are useful for transforming data and can analyze information even better (see Figure 10-7). Then we can use this more succinct data to feed our models more useful and relevant information, which will improve our models beyond just cross-validation.

tmlp 1007
Figure 10-7. ICA extraction example

Now that we know about feature transformation and feature selection, let’s discuss what we can do in terms of better arguing for a classiciation or regression point.

Ensemble Learning

Up until this point we have discussed selecting dimensions as well as transforming dimensions into new ones. Both of these approaches can be quite useful when improving models or the data we are using. But there is yet another way of improving our models: ensemble learning.

Ensemble learning is a simple concept: build multiple models and aggregate them together. We have already encountered this with random forests in Chapter 5.

A common example of ensemble learning is actually weather. When you hear a forecast for the next week, you are most likely hearing an aggregation of multiple weather models. For instance, the European model (ECMWF) might predict rain and the US model (GFS) might not. Meterologists take both of these models and determine which one is most likely to hit and deliver that information during the evening news.

When aggregating multiple models, there are two general methods of ensemble learning: bagging, a naive method; and boosting, a more elegant one.

Bagging

Bagging or bootstrap aggregation has been a very useful technique. The idea is simple: take a training set and generate new training sets off of it.

Let’s say we have a training set of data that is 1,000 items long and we split that into 50 training sets of 100 a piece. (Because we sample with replacement, these 50 training sets will overlap, which is okay as long as they are unique.) From here we could feed this into 50 different models.

Now at this point we have 50 different models telling us 50 different answers. Like the weather report just mentioned, we can either find the one we like the most or do something simpler, like average all of them.

This is what bootstrap aggregating does: it averages all of the models to yield the average result off of the same training set. The amazing thing about bagging is that in practice it ends up improving models substantially because it has a tendency to remove some of the outliers.

But should we stop here? Bagging seems like a bit of a lucky trick and also not very elegant. Another ensemble learning tool is even more powerful: boosting.

Boosting

Instead of splitting training data into multiple data models, we can use another method like boosting to optimize the best weighting scheme for a training set.

Given a binary classification model like SVMs, decision trees, Naive Bayesian Classifiers, or others, we can boost the training data to actually improve the results.

Assuming that you have a similar training set to what we just described with 1,000 data points, we usually operate under the premise that all data points are important or that they are of equal importance. Boosting takes the same idea and starts with the assumption that all data points are equal. But we intuitively know that not all training points are the same. What if we were able to optimally weight each input based on what is most relevant?

That is what boosting aims to do. Many algorithms can do boosting but the most popular is AdaBoost.

To use AdaBoost we first need to fix up the training data just a bit. There is a requirement that all training data answers are either 1 or –1. So, for instance, with spam classification we would say that spam is 1 and not spam is –1. Once we have changed our data to reflect that, we can introduce a special error function:

E(f(x),y,i)=e-yif(xi)

This function is quite interesting. Table 10-4 shows all four cases.

Table 10-4. Error function in all cases
f(x) y e-yif(xi)

1

1

1e

–1

1

e

1

–1

e

–1

–1

1e

As you can see, when f(x) and y equal, the error rate is minimal, but when they are not the same it is much higher.

From here we can iterate through a number of iterations and descend on a better weighting scheme using this algorithm:

  • Choose a hypothesis function (either SVMs, Naive Bayesian Classifiers, or something else)

    • Using that hypothesis, sum up the weights of points that were missclassified:

      ϵ=h(x)yw
    • Choose a learning rate based on the error rate:

      α=12ln(1-ϵϵ)
  • Add to the ensemble:

    F(x)=Ft-1(x)+αht(x)
  • Update weights:

    wi,t+1=wi,te-yiαtht(xi)

    for all weights

  • Renormalize weights by making sure they add up to 1

What this does is converge on the best possible weighting scheme for the training data. It can be shown that this is a minimization problem over a convex set of functions.

This meta-heuristic can be excellent at improving results that are mediocre from any weak classifier like Naive Bayesian Classification or others like decision trees.

Conclusion

You’ve learned a few different tricks of the trade with improving existing models: feature selection, feature transformation, ensemble learning, and bagging. In one big graphic it looks something like Figure 10-8.

tmlp 1008
Figure 10-8. Feature improvement in one model

As you can see, ensemble learning and bagging mostly focus on building many models and trying out different ideas, while feature selection and feature transformation are about modifying and studying the training data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.222.239