The curse of dimensionality

An increase in the number of dimensions of a dataset means there are more entries in the vector of features that represents each observation in the corresponding Euclidean space. We measure the distance in a vector space using Euclidean distance, also known as the L2 norm, which we applied to the vector of linear regression coefficients to train a regularized Ridge Regression model.

The Euclidean distance between two n-dimensional vectors with Cartesian coordinates p = (p1, p2, ..., pn) and q = (q1, q2, ..., qn) is computed using the familiar formula developed by Pythagoras:

Hence, each new dimension adds a non-negative term to the sum, so that the distance increases with the number of dimensions for distinct vectors. In other words, as the number of features grows for a given number of observations, the feature space becomes increasingly sparse; that is, less dense or emptier. On the flip side, the lower data density requires more observations to keep the average distance between data points the same.

The following chart shows how many data points we need to maintain the average distance of 10 observations uniformly distributed on a line. It increases exponentially from 101 in a single dimension to 102 in two and 103 in three dimensions, as the data needs to expand by a factor of 10 each time we add a new dimension:

The curse_of_dimensionality notebook in the GitHub repo folder for this section simulates how the average and minimum distances between data points increase as the number of dimensions grows:

The simulation draws features in the range [0, 1] from uncorrelated uniform or correlated normal distributions, and gradually increases the number of features to 2,500. The average distance between data points increases to over 11 times the feature range for features drawn from the normal distribution, and to over 20 times in the (extreme) case of uncorrelated uniform distribution.

When the distance between observations grows, supervised machine learning becomes more difficult because predictions for new samples are less likely to be based on learning from similar training features. Put differently, the number of possible unique rows grows exponentially as the number of features increases, which makes it so much harder to efficiently sample the space.

Similarly, the complexity of the functions learned by flexible algorithms that make fewer assumptions about the actual relationship grows exponentially with the number of dimensions.

Flexible algorithms include the tree-based models we saw in Chapter 10Decision Trees and Random Forests, and Chapter 11, Gradient Boosting Machines, and the deep neural networks that we will cover from Chapter 17, Deep Learning onward. The variance of these algorithms increases as they get more opportunity to overfit to noise in more dimensions, resulting in poor generalization performance.

In practice, features are correlated, often substantially so, or do not exhibit much variation. For these reasons, dimensionality reduction helps to compress the data without losing much of the signal, and combat the curse while also economizing on memory. In these cases, it complements the use of regularization to manage prediction error due to variance and model complexity.

The critical question that we take on in the following section then becomes: what are the best ways to find a lower-dimensional representation of the data that loses as little information as possible?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.90.235.91