Dimensionality reduction

In linear algebra terms, the features of a dataset create a vector space whose dimensionality corresponds to the number of linearly independent columns (assuming there are more observations than features). Two columns are linearly dependent when they are perfectly correlated so that one can be computed from the other using the linear operations of addition and multiplication.

In other words, they are parallel vectors that represent the same rather than different directions or axes and only constitute a single dimension. Similarly, if one variable is a linear combination of several others, then it is an element of the vector space created by those columns, rather than adding a new dimension of its own.

The number of dimensions of a dataset matter because each new dimension can add a signal concerning an outcome. However, there is also a downside known as the curse of dimensionality: as the number of independent features grows while the number of observations remains constant, the average distance between data points also grows, and the density of the feature space drops exponentially.

The implications for machine learning are dramatic because prediction becomes much harder when observations are more distant; that is, different to each other. The next section addresses the resulting challenges.

Dimensionality reduction seeks to represent the information in the data more efficiently by using fewer features. To this end, algorithms project the data to a lower-dimensional space while discarding variation in the data that is not informative, or by identifying a lower-dimensional subspace or manifold on or near which the data lives.

A manifold is a space that locally resembles Euclidean space. One-dimensional manifolds include lines and circles (but not screenshots of eight, due to the crossing point). The manifold hypothesis maintains that high-dimensional data often resides in a lower-dimensional space that, if identified, permits a faithful representation of the data in this subspace.

Dimensionality reduction thus compresses the data by finding a different, smaller set of variables that capture what matters most in the original features to minimize the loss of information. Compression helps counter the curse of dimensionality, economizes on memory, and permits the visualization of salient aspects of higher-dimensional data that is otherwise very difficult to explore.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.239.162.98