CHAPTER 8: Manipulating the Fabric of the Data Space

Chapter008.jpg

Contrary to what many people think, more data is not always better. This is particularly true when it comes to additional features (or dimensions). Take for example Sherlock Holmes, the famous fictional detective, who managed to solve the most baffling cases. Although his observation skills were unparalleled, a big part of his success was due to his ability to focus on the most essential aspects of the problem, eliminating all the unnecessary information. This is the human equivalent of dimensionality reduction in all its glory.

Of course, keeping all the features in a dataset can be a good thing in certain cases. Most of the benchmark datasets used to test a new machine learning system are dense in feature sets, with each one of their features contributing to the end result. In data science, though, the datasets worked with are often very large–not just in the number of data points, but also in their dimensionality. This makes the signal seem like a needle in a haystack of noise. A good data scientist takes this haystack and removes as much of the hay as possible before starting his modeling work, making the job much easier.

There are many ways to accomplish this “haystack thinning.” Some methods entail merging features together into more meaningful meta-features, while others involve examining and evaluating the original features and keeping only those that seem most relevant.

An alternative method focuses on whether we should use the features by themselves, or use them in tandem with the target variable. These two methods are usually referred to as unsupervised and supervised respectively.

In this chapter we will cover the following topics:

  • Principal Component Analysis as one of the popular methods of reducing dimensionality
  • Feature evaluation and selection as a worthwhile alternative to dimensionality reduction
  • Additional dimensionality reduction approaches.

Let us now examine some specific methods based on these approaches to dimensionality reduction.

Principal Components Analysis (PCA)

One of the most popular methods used for dimensionality reduction is PCA, a method developed in statistics. Its premise is that because information in a dataset is expressed as variance, the axes upon which most of this variance is measured must be the ones richest in information; therefore it is these axes we should focus on. This method has the following advantages:

  • PCA requires no assumptions about the distribution(s) of the dataset.
  • PCA has a high compression rate (with good fidelity when information is uncompressed).
  • There is no need for labels in the dataset as PCA works using only the inputs.
  • PCA has a strong mathematical background.
  • There is no need to normalize the data before applying PCA.

The output of this method is a feature set consisting of a number of meta-features, ordered by importance (with the first one being the most information-rich). The exact number of meta-features, also referred to as “principal components,” is usually determined by a parameter of the method. As you would expect, the higher this number, the less information is lost in the process. However, adding more meta-features follows the law of diminishing returns; this is why, in most cases, we focus on a small number of meta-features.

Fortunately, the implementation of PCA in Julia is optimized so as to find the best number of meta-features on its own, making the implementation of the method even more straightforward. Of course, you can always add an extra parameter for a more aggressive dimensionality reduction, if you so prefer. We’ll look into the specifics of this method shortly.

PCA is mainly used for visualization purposes, and is popular among all kinds of analysts due to its simplicity. However, the method is not free of drawbacks, including:

  • The resulting meta-features are linear combinations of the original ones that often lack interpretability
  • PCA doesn’t scale well, making it impractical for very large datasets containing thousands of features.

Regardless of the inherent weaknesses of this method, it is a powerful tool that is also easy to use. Should you wish to learn more about it, we suggest that you read about it in any statistics book, or check out L.I. Smith’s comprehensive online tutorial at http://bit.ly/MU8WV7.

One last thing to keep in mind about PCA is that it is ideal for reconstructing the original feature set, if needed, as the error would be minimal. This is why it is often used for image processing with much success. Let us now look at how Julia can make use of this technique with one of our datasets.

Applying PCA in Julia

As we saw in previous chapters, the OnlineNewsPopularity dataset is rather large when it comes to features, and not all of them seem to be relevant to the target variable, which is the number of shares a given news article receives. Julia can help us reduce the size of the feature set to make the dataset easier to handle. One way to do this is through the use of PCA, which can be found in the MultivariateStatistics package, which we first need to install and load:

In[1]: Pkg.add(“MultivariateStats”)

In[2]: using MultivariateStats

We’ll omit the outputs for most of the commands in the examples of this chapter, leaving only the cases where we need to comment on the outputs of these commands. Once we have installed the MultivariateStats package, we’ll load the data from the corresponding .csv file:

In[3]: cd(“d:\data\OnlineNewsPopularity”)        #1

In[4]: X = readcsv(“OnlineNewsPopularity.csv”)[2:end,2:end]  #2

#1 you’ll need to update this to the corresponding folder in your computer

#2 get the numeric data only

Since the header info (first row) and the identifier attribute (first column) aren’t necessary for this example, we’ll load only the rest of the dataset. Now it’s time to convert the data to floats and split it into a training and a testing set. Although the splitting is not entirely necessary to run the PCA method, it’s the best way to perform the dimensionality reduction in practice. This is because the test set is not always part of the original dataset, often becoming available at later stages of the project, long after we have created our models.

For the purposes of this example we will convert the output variable into a binary one, to demonstrate how dimensionality reduction works for classification problems. This process, however, does not affect the PCA method at all, since PCA does not make use of the target variable in any way.

In[9]: N, n = size(X)

   n -= 1          #1

   I = Array(Float64, N, n)

   O = X[:, end]

   for j = 1:(n-1)

  for i = 1:N

   I[i,j] = Float64(X[i,j])

  end

   end

In[10]: Itr = I[ind[1:z],:]

   Ite = I[ind[(z+1):end],:]

   Otr = O_bin[ind[1:z]]          #2

   Ote = O_bin[ind[(z+1):end]]        #2

In[11]: M = fit(PCA, Itr’; maxoutdim = 10)    #3

Out[11]: PCA(indim = 59, outdim = 4, principalratio = 0.99367)

#1 remove output column in the col. count (so, now n = number of features)

#2 this will be useful later on

#3 limit the total number of variables in reduced feature set to 10

At this point we have the dimensionality reduction framework, based on the training data, stored in the object called M. Notice that in the output Julia provides a brief summary of this mathematical model. We’ve learned that indim, the number of dimensions that go into the model, is 59 (=n), while outdim, the number of dimensions that come out of the model is just 4. This may appear strange since we asked for 10, but the Julia code aims to yield the optimum number of variables that is within our parameters. 10 is the maximum number of variables that can come out of the model; since Julia could get the job done with fewer, the PCA framework created just four.

The principalratio variable indicates what proportion of the overall variance of the data is covered by the selected (four) principal components, which in this case is a bit over 99%. This means that less than 1% of variance cannot be expressed by the reduced feature set, which is truly impressive. This type of variance is called residual variance. By adding more features to the PCA model we could limit the residual variance further, but remember the law of diminished returns that was mentioned earlier. Every additional principal component will be less and less useful, and will be adding less and less to the principalratio figure.

We have managed to reduce the feature set by (59 - 4) / 59 = 93%, in terms of features, with minimal loss of information (as expressed in the overall variance). Let’s see now how we can implement this framework to the data we need to work with in the later parts of the pipeline:

In[12]: Jtr = transform(M, Itr’)’

Out[12]: 31715x4 Array{Float64,2}:

    -1.1484e5  24619.6  -675.339   -29135.1

  95645.0  -57635.1  8347.65    16974.2

    -53198.3   -93375.1  36341.9   43627.4

Here we’ve taken the feature set of the original training data and applied the PCA model using the transform() function of this package. This yields a matrix consisting of four features, as expected. We had to transpose the feature matrix before applying the transform() function, since it expects the data to be formatted a bit differently. We can do the same with the feature set of the testing data as follows:

In[13]: Jte = transform(M, Ite’)’

And there you have it. The dataset has been quickly and painlessly reduced in dimensionality. Of course the new features are a bit counter-intuitive, but rest assured that most of the information contained in the original features is still there. You can learn more about the intricacies of the PCA-related objects in the package’s documentation, which is well-maintained: http://bit.ly/28OuXEk.

Independent Components Analysis (ICA): most popular alternative of PCA

Although PCA is great if you need to reconstruct the original feature set from the reduced one, we rarely have the need to do this in most data science applications. Usually it’s more valuable to have independent features, particularly if we are using certain statistical models that have feature independence as an assumption. One way to accomplish a dimensionality reduction that allows for independence is through the ICA method.

At the core of the ICA technique lies the mutual information metric, a handy measure of how good two features are together. In other words, ICA ensures that the extracted features it yields are as informative as possible, when used together. It is a crude estimation of how much value (measured in bits) would be added to a feature if it were used in tandem with another feature.

ICA tries to maximize this metric in its outputs. In other words, it tries to create feature combos (meta-features) that optimize the value of the information they collectively contain. This results in statistically independent features that are also non-Gaussian in nature. ICA is a theoretically sound approach to dimensionality reduction and has gained a lot of traction in the past decade, so it’s definitely worth having in your toolbox.

As for implementation, the same package that supports PCA also provides tools for applying the ICA method to your data. The use of the main functions is identical to that of the corresponding functions in PCA; you still build the model using fit() and apply it using transform(). You just need to use ICA instead of PCA as the first parameter of fit(), in order to create the ICA model. You can learn more about ICA from the documentation webpage of the MultivariateStatistics package: http://bit.ly/29PLfdT.

Feature Evaluation and Selection

The PCA and ICA methods fall under the umbrella of unsupervised approaches, as they don’t require a target variable in order to function. This has its advantages (e.g. they can work on all kinds of datasets and can add a lot to data exploration), but if there is additional information present, it makes sense to take it into account when reducing dimensionality. To accomplish this, we need to perform feature evaluation as a preliminary stage of the dimensionality reduction process, and then make a selection of the most promising features (or extracted features). These are often referred to as supervised approaches.

Overview of the methodology

In the majority of data science projects, the objective is to predict a particular variable (target). It logically follows that the more closely a feature is related to that variable, the more useful it will be for the models we create. Let’s say you’re trying to evaluate the effectiveness of a health diagnostic method that aims to predict whether a patient is healthy or not. You probably wouldn’t use that person’s favorite TV show as a feature–but you might use the number of hours he watches TV per day.

The feature evaluation and selection approach to dimensionality reduction ensures that only relevant features make the cut, and that all noisy ones are removed. So, in this example, we would compare all the features we have against the target variable and keep the ones that relate most closely to it, while discarding all the ones that are statistically independent of it.

There are various methods to evaluate a feature, as we have briefly seen in the previous chapter. We can use one of the following:

  • Fisher’s Discriminant Ratio (FDR)
  • Index of Discernibility (latest open-source version: DID)
  • Similarity Metrics
  • Correlation
  • Cosine similarity
  • Jaccard and other binary similarity metrics.

Although uncommon, it is possible to use combinations of these metrics for a more accurate evaluation of the features at hand. The method(s) you choose will depend on the data available. Specifically, depending on your features and your target variable, you might use the following:

Target is a float

Target is a discreet variable

Features are floats

Cosine similarity or some other non-binary similarity metric, such as Pearson’s Correlation

DID, FDR, Mutual Information

Features are binary

This combination isn’t possible! Your problem must be incorrectly modeled.

Jaccard similarity, binary similarity, or a symmetric variant of either of these

Once you have decided which metric to use for evaluation, you need to set a threshold beyond which a feature will be accepted in the reduced feature set. This is usually arbitrary and depends on the problem at hand. Naturally, the higher the threshold, the smaller the reduced feature set will be. Alternatively, you can set the number of features you wish to have in the reduced feature set (let’s call this number K). In this case you need to sort all the features based on the evaluation metric and select those with the K highest values.

Using Julia for feature evaluation and selection using cosine similarity

Now let’s take our OnlineNewsPopularity dataset once more and see how we can perform dimensionality reduction using the feature evaluation and selection approach. In particular, we’ll examine how we can harness the information of the continuous target variable (float type) in how we assess the individual features.

First, we’ll need to define the cosine similarity function that we need for assessing the features:

In[14]: function cossim(x::Array, y::Array)

  # Cosine Similarity Function

  nx = norm(x)

  ny = norm(y)

  

  if (nx == 0) | (ny == 0)

    return NaN

  else

    return dot(x,y) / (nx * ny)

  end

    end

This is a quick and dirty way of building the metric, but it works fine for the purposes of this example and shows how Julia can be flexible, when needed. If we wanted to take a few extra steps, we could build a more error-proof version of the function. This would involve ensuring that the metric would be applied only to data that would make sense to use (e.g. all numeric vectors). You can improve this method so that Julia will reject irrelevant inputs (e.g. arrays of characters) as a practice exercise.

Now that our evaluation method is in place, we can use it to assess the features in our dataset. First, let’s initialize the vector containing the feature scores in relation to the target variable:

In[15]: V = Array(Float64, n)

Some people prefer to initialize a vector with zeros, using the zeros() function, but array() is faster and makes more sense in this case study, as we populate every single one of the elements of V:

In[16]: for i = 1:n

  V[i] = cossim(Itr[:,i], Otr)

    end

After a bit of waiting, we should obtain a populated vector containing the evaluation of each of our 59 features. We can preview the first few of these numbers and calculate the average cosine similarity of the whole dataset as follows:

In[17]: abs(V[1:5]), mean(abs(V))

Out[17]: ([0.2052250978962261,0.2325465827311343,0.17498156181682142,0.03124251651392406,0.03769500008408716],0.14725725376572263)

Since cosine similarity (like many other similarity measures) takes both positive and negative values, it makes sense to use the absolute value (abs() function). This is because a feature with a very negative similarity would still be valuable, since it would carry a signal, just like in the correlation cases.

Based on the above output, the cosine similarities of the features are low, with an average value of approximately 0.15. This warns us that predicting the popularity of a news article given this data will not be a trivial problem; we will need to make our threshold relatively low, say 0.2:

In[17]: th = 0.2

   feat_ind = (abs(V) .>= th)

   sum(feat_ind)

Out[17]: 20

In this case we find the 20 features ranking over 0.2 in terms of cosine similarity. Should we need fewer or more, we can adjust the threshold accordingly. To find out exactly which features made the team, we can use the find() command:

In[18]: show(find(feat_ind))

   [1,2,7,11,12,23,24,26,27,42,44,46,47,48,49,50,52,53,54,58]

The feature selection method is bound to yield somewhat different results when you run it, since the data points selected for the subset we are using will be different. Of course, you could use all the data points, but this would be cheating; it would ensure good performance even if the model is bad.

Using Julia for feature evaluation and selection using DID

Now let’s try performing feature evaluation and selection for a dataset where the target variable is discreet (i.e. binary or nominal). In this example we will make use of the Distance-based Index of Discernibility method (DID for short), using a binary target variable based on the number of shares.

The DID metric is recommended for various reasons. First of all, it is distribution agnostic and therefore widely applicable in all kinds of data. It’s also fast to calculate. Finally, it’s intuitive and easy to interpret. Actually, transforming a continuous feature to a binary (or some other kind of discreet variable) is commonplace in data science. By employing this tactic, we illustrate that the data scientist is versatile and examines the data from various angles.

We can arbitrarily define this binary variable as being “true” for shares being equal or greater than 10,000, and “false” otherwise. One interpretation of this variable would be the articles’ viral status. Considering that these are news articles, even 10,000 shares could be enough to consider them viral. Let’s put Julia to work to make this happen:

In[19]: ind = (O .>= 10000) # cases of “viral” result

    O_bin = falses(N) # initialize binary Array

    O_bin[ind] = true

Next, we can separate the now binary outputs into training and testing, using the same randomized sampling we did previously (by employing the same array z as before):

In[20]: Otr = Array(O_bin[ind[1:z]])

    Ote = Array(O_bin[ind[(z+1):end]])

Now, we need to load the essential functions in order to apply the DID metric:

In[21]: include(“C:\users\Zacharias\Documents\Julia\DID.jl”)

Be sure to adjust the path before you attempt to load the DID script. Finally, we can evaluate all the features individually, using the same logic as before:

In[22]: for i = 1:n

   V[i] = DID(Itr[:,i], Otr)[1]

    end

If you look closely at the code of the DID.jl script, you’ll notice that DID() requires three inputs, the last one being the sample size. If you are not too particular, just utilize the default value of 10,000 data points (which is plenty for most datasets) as it’s robust. However, because a different sample is used every time you run the metric, you can expect some variance in its output (which shouldn’t affect the results much). Also note that DID() yields two outputs: the overall score (a float64 variable) and the intra-class discernibility matrix, showing how discernible each pair of classes is (a float64 matrix variable). Although insightful, the second output is not required for most applications, which is why we have the [1] index at the end of the line where the DID() function is used.

Like we did in the cosine similarity example, let’s take a look at what these feature evaluations look like:

In[23]: V[1:5], mean(V)

[0.5068131108916679,0.22245264761577588,0.04160899480072447,0.16882573283040755,0.0754911973187614]0.19675244105321787

As DID scores are by definition between 0 and 1, there is no need to apply the abs() function on them. It seems that the features are not particularly powerful for predicting whether a news article becomes viral; generally, DID scores below 0.5 are not great for two-class problems. Therefore, if we are to perform some meaningful dimensionality reduction, we need to adjust our expectations accordingly. Let’s try using a threshold value of 0.35:

In[24]: th = 0.35

    feat_ind = (V .>= th)

    sum(feat_ind)

Out[24]: 10

We’ve found ten features that appear promising. Let’s look at which indexes they correspond to:

In[25]: show(find(feat_ind)) # features in reduced feature set

    [1,14,25,31,32,33,42,54,56,58]

The two methods yield a different list of selected features, even though we applied them on the exact same subset of the dataset. The reason for this is twofold. First of all, different metrics provide different evaluations of the features. Secondly, the target variables in the two cases are inherently different (one is a float and the other is a Boolean), capturing different levels of information of the original dataset.

Pros and cons of the feature evaluation and selection approach

This set of methods, although widely used by the most knowledgeable data scientists, is not all fun and games. Whether it is worthwhile for your data is not always obvious. This decision is made easier by carefully considering the various advantages and disadvantages of this methodology, which are shown in Table 8.1.

Pros

Cons

Speed

Doesn’t take into account dependencies among the features of the dataset

Very few parameters

No mathematical proof behind it

Easily scalable

Requires target variable

Interpretability

Table 8.1. Pros and cons of the feature evaluation and selection approach to dimensionality reduction.

Other Dimensionality Reduction Techniques

There are a few other options for dimensionality reduction that are worth considering. Two of the most important ones are the following:

  • Genetic algorithms
  • Discernibility-based approach.

Both of these methods are sophisticated and often computationally expensive. However, this is the price to pay for building a robust reduced feature set that closely approximates the optimum. (Although theoretically possible, in practice it is unrealistic to obtain the optimum selection of features.)

Overview of the alternative dimensionality reduction methods

We’ll briefly discuss two of the most powerful and sophisticated dimensionality reduction methods: those based on genetic algorithms and those based on discernibility. Both of these approaches make use of groups of features at a time, aiming to capture the information contained in the relationships among the various features of the dataset. As these are more advanced topics, we will not go into much detail, but we will provide you with references where you can learn more.

Genetic algorithms

The genetic algorithm (GA) approach has been around for many years and is one of the most popular family of AI algorithms founded in biomimicry. In summary, GA involves a series of optimization algorithms that work on discreet variables. So, if you want to optimize a bunch of nominal parameters of a system and you have a continuous variable that they relate to (referred to as “fitness function”), GA is a promising approach. Since the dimensionality reduction problem (when expressed as a feature selection task) is inherently a discreet optimization problem, the GA-based approach makes perfect sense.

Your best option for working with GA is the GeneticAlgorithms package (http://bit.ly/29lG0l2). Using this package, your abcde array would be the indexes of the features available, as a Boolean vector. As for the fitness function, you can use any similarity metric you wish that can take a set of features as an input. You can define the details of this in the function called fitness().

GA is a specialized AI topic which requires a bit of studying if you wish to understand and implement it. A great place to start is on Professor Obitko’s site: http://bit.ly/29p3Tt9.

Discernibility-based approach

Unlike GA (or any other sophisticated approach out there), the discernibility-based method is straightforward and doesn’t require any background whatsoever, while its parameters are few and self-explanatory.

Unlike the individual feature evaluation use of the DID metric, the strategy discussed in this section involves the evaluation of several features at once–a trait most feature evaluation metrics don’t share. The use of the metric in this setting goes as follows:

  1. 1. Evaluate the whole feature set.
  2. 2. Establish one of the following termination parameters:
  3. a. The proportion of the discernibility score that you want to maintain in the reduced feature set, OR
  4. b. the total number of features.
  5. 3. Apply a search strategy to optimize the reduce feature set’s DID, such as:
  6. a. Start from the most promising feature and build up the feature set by adding and removing features, OR
  7. b. start from the whole feature set and remove most needless features.

Should you wish to explore this approach in more detail, you can read up on the topic in Chapter 5 of the material found at http://bit.ly/28O5qMY. Although the discernibility concept and its original implementations predate the DID metric, they all follow the same framework. Therefore, all feature evaluation methods that applied to the SID and HID metrics will also apply to DID and any other discernibility metrics.

When to use a sophisticated dimensionality reduction method

It is clear that no dimensionality reduction approach is a panacea, since they each have a different use case. This is why it is crucial that you become familiar with all of them. Overall, the sophisticated methods we described are good for complex cases where there is substantial non-linearity. GAs are particularly good for very large feature sets, but their plethora of parameters may intimidate some data scientists. If you have the time to experiment with various settings and understand the dataset well enough, go ahead and try this approach. If, however, you just want to get a working solution to the problem without becoming a digital geneticist of sorts, the discernibility-based approach would be a fine alternative.

These methods described are just a few of the most robust ones out there. There are several other worthy approaches available, so we recommend you remain on the lookout for alternatives. Even if you never use any of the sophisticated dimensionality reduction methods, it is good to know that they exist and that if your default approaches fail, there is something else you can use.

Summary

  • Dimensionality reduction is often an essential part of data science, as it condenses the dataset at hand, making efficient use of data analysis methods. As a bonus, the reduced dataset takes up less storage space and fewer resources in general.
  • There are two main approaches to dimensionality reduction: using only the features (unsupervised), and using the features in combination with the target variable (supervised).

Unsupervised dimensionality reduction methods:

  • The most popular unsupervised dimensionality reduction technique is principal components analysis (PCA). This statistical method works well with a small to moderate number of features. It maximizes the variance of the dataset explained by the extracted meta-features.
  • Independent component analysis (ICA) is a worthy alternative to PCA; similarly, it does not utilize the target variable. ICA works by maximizing the mutual information metric, instead of the variance. It also provides meta-features.

Supervised dimensionality reduction methods:

  • The supervised dimensionality reduction methods can be further categorized into basic (involving individual evaluation of features), and sophisticated (involving evaluation of features in groups).
  • Basic methods evaluate each feature, depending on the type of the target variable, and select the ones with the highest scores.
  • Continuous: Cosine Similarity, Pearson Correlation, or some other similarity metric
  • Discreet: Fisher Discriminant Ratio, Index of Discernibility, or Mutual Information
  • Sophisticated
  • Continuous: Genetic Algorithms (GA) based approach
  • Discreet: Index of Discernibility or GA based approach
  • There is no silver bullet when it comes to the dimensionality reduction method to use. Your choice depends on your data and the available resources.

Chapter Challenge

  1. 1. You are given a dataset consisting of one million rows and ten thousand features. Would you perform dimensionality reduction? If so, what method would you use? Why?
  2. 2. A project you are working on involves a dataset of one million rows and 500 features. Would you perform dimensionality reduction? If so, what method would you use? Why?
  3. 3. After a lot of refining of your original gene expression data, you have created a dataset consisting of 200 rows and five thousand features. What dimensionality reduction would you use on this dataset? Why?
  4. 4. You are given a dataset consisting of ten thousand rows and 20 features. Would you use dimensionality reduction on this dataset? Why?
  5. 5. A statistics researcher has developed a dimensionality reduction method that guarantees to find the best possible reduced feature set, by examining all possible combinations. Would you use this method for a data science project? Why?
  6. 6. What is the main advantage of feature evaluation based approaches to dimensionality reduction, over the popular statistical ones (e.g. PCA)?
  7. 7. A brilliant researcher has developed a method that can process data with high efficiency and performance, as long as the inputs are statistically independent of each other. Which dimensionality reduction method would you use to turn your dataset into a manageable size? Why?
  8. 8. You are given a dataset consisting of one million features and 100 million rows. Many of the features are correlated to each other. You have plenty of time to come up with an insight about the dataset, and the objective is to have the highest accuracy possible in the model that runs the reduced dataset. Which approach would you take? Why?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.152.254