138 Handbook of Big Data
ranging from web search engines and social network analysis to the analysis of astronomical
and biological data.
Importantly, the structural and noise properties of matrices that arise in machine
learning and data analysis applications are typically very different than those of matrices
that arise in scientific computing and NLA. This has led researchers to revisit traditional
problems in light of new requirements and to consider novel algorithmic approaches to
many traditional matrix problems. One of the more remarkable trends in recent years is
a new paradigm that arose in theoretical computer science (TCS) and that involves the
use of randomization as a computational resource for the design and analysis of algorithms
for fundamental matrix problems. Randomized NLA (RandNLA) is the interdisciplinary
research area that exploits randomization as a computational resource to develop improved
algorithms for large-scale linear algebra problems, for example, matrix multiplication,
linear regression, and low-rank matrix approximation [49]. In this chapter, we will discuss
RandNLA, highlighting how many of the most interesting RandNLA developments for
problems related to improved low-rank matrix approximation boil down to exploiting a
particular structural property of Euclidean vector spaces. This structural property is of
interest in and of itself (for researchers interested in linear algebra per se ), but it is also of
interest more generally (for researchers interested in using linear algebra) since it highlights
strong connections between algorithms for many seemingly unrelated matrix problems.
9.2 Overview
As background, we note that early work in RandNLA focused on low-rank approximation
problems and led to results that were primarily of theoretical interest in idealized models of
data access [23–25,34,57,63]. An overview of RandNLA for readers not familiar with the area
has recently been provided [49]. Subsequent work on very overdetermined linear regression
problems, for example, least-squares regression problems with an input matrix A ∈ R
m×n
,
with m n, led to several remarkable successes for RandNLA: theoretical results for worst-
case inputs for the running time in the RAM model that improve upon the 200-year old
Gaussian elimination [17,26,31,53,56]; high-quality implementations that are competitive
with or better than traditional deterministic implementations, for example, as provided by
LAPACK, on a single machine [2,18,62]; and high-quality implementations in parallel and
distributed environments on up to terabyte-sized input matrices [16,54,55,70,71].
As has been described in detail, for example, in [49], both the more theoretical and
the more applied successes of RandNLA for these very overdetermined linear regression
problems were achieved by using, implicitly or explicitly, the so-called statistical leverage
scores of the tall input matrix A.
∗
In some cases, the use of leverage scores was explicit, in
that one used exact or approximate leverage scores to construct a nonuniform importance
sampling probability distribution with respect to which to sample rows from the input
matrix, thereby constructing a data-aware subspace embedding [26,28]. In other cases, the
use of leverage scores was implicit, in that one performed a random projection, thereby
implementing a data-oblivious subspace embedding [31,68].
†
In both cases, the improved
∗
The statistical leverage scores of a tall matrix A ∈ R
m×n
with m n are equal to the diagonal elements
of the projection matrix onto the column span of A [14,48,50]. Thus, they capture a subtle but important
structural property of the Euclidean vector space from which the data were drawn.
†
Random projections can be applied in more general metric spaces, but in a Euclidean vector space, a
random projection essentially amounts to rotating to a random basis, where the leverage scores are uniform
and thus where uniform sampling can be applied [49].