9
Structural Properties Underlying High-Quality
Randomized Numerical Linear A lgebra Algorithms
Michael W. Mahoney and Petros Drineas
CONTENTS
9.1 Introduction .................................................................... 137
9.2 Overview ....................................................................... 138
9.3 Our Main Technical Result .................................................... 140
9.3.1 Statement of the Main Technical Result .............................. 140
9.3.2 Popular Special Case .................................................. 141
9.3.3 Proof of the Main Technical Result ................................... 142
9.4 Applications of Our Main Technical Result ................................... 143
9.4.1 CSSP: Theory ......................................................... 143
9.4.2 CSSP: Data Analysis and Machine Learning ......................... 145
9.4.3 Random Projections for Low-Rank Matrix Approximations .......... 146
9.4.4 Improved Results for Nystr¨om-Based Machine Learning ............. 148
9.5 Conclusions ..................................................................... 149
Acknowledgments ...................................................................... 149
References ............................................................................. 150
9.1 Introduction
In recent years, the amount of data that has been generated and recorded has grown
enormously, and data are now seen to be at the heart of modern economic activity,
innovation, and growth. See, for example, the report by the McKinsey Global Institute [51],
which identifies ways in which Big Data have transformed the modern world, as well as
the report by the National Research Council [19], which discusses reasons for and technical
challenges in massive data analysis. In many cases, these so-called Big Data are modeled as
matrices, basically since an m×n matrix A provides a natural mathematical structure with
which to encode information about m objects, each of which is described by n features.
As a result, while linear algebra algorithms have been of interest for decades in areas
such as numerical linear algebra (NLA) and scientific computing, in recent years, there
has been renewed interest in developing matrix algorithms that are appropriate for the
analysis of large datasets that are represented in the form of matrices. For example, tools
such as the singular value decomposition (SVD) and the related principal components
analysis (PCA) [38] permit the low-rank approximation of a matrix, and they have have
had a profound impact in diverse areas of science and engineering. They have also been
studied extensively in large-scale machine learning and data analysis applications, in settings
137
138 Handbook of Big Data
ranging from web search engines and social network analysis to the analysis of astronomical
and biological data.
Importantly, the structural and noise properties of matrices that arise in machine
learning and data analysis applications are typically very different than those of matrices
that arise in scientific computing and NLA. This has led researchers to revisit traditional
problems in light of new requirements and to consider novel algorithmic approaches to
many traditional matrix problems. One of the more remarkable trends in recent years is
a new paradigm that arose in theoretical computer science (TCS) and that involves the
use of randomization as a computational resource for the design and analysis of algorithms
for fundamental matrix problems. Randomized NLA (RandNLA) is the interdisciplinary
research area that exploits randomization as a computational resource to develop improved
algorithms for large-scale linear algebra problems, for example, matrix multiplication,
linear regression, and low-rank matrix approximation [49]. In this chapter, we will discuss
RandNLA, highlighting how many of the most interesting RandNLA developments for
problems related to improved low-rank matrix approximation boil down to exploiting a
particular structural property of Euclidean vector spaces. This structural property is of
interest in and of itself (for researchers interested in linear algebra per se ), but it is also of
interest more generally (for researchers interested in using linear algebra) since it highlights
strong connections between algorithms for many seemingly unrelated matrix problems.
9.2 Overview
As background, we note that early work in RandNLA focused on low-rank approximation
problems and led to results that were primarily of theoretical interest in idealized models of
data access [23–25,34,57,63]. An overview of RandNLA for readers not familiar with the area
has recently been provided [49]. Subsequent work on very overdetermined linear regression
problems, for example, least-squares regression problems with an input matrix A R
m×n
,
with m n, led to several remarkable successes for RandNLA: theoretical results for worst-
case inputs for the running time in the RAM model that improve upon the 200-year old
Gaussian elimination [17,26,31,53,56]; high-quality implementations that are competitive
with or better than traditional deterministic implementations, for example, as provided by
LAPACK, on a single machine [2,18,62]; and high-quality implementations in parallel and
distributed environments on up to terabyte-sized input matrices [16,54,55,70,71].
As has been described in detail, for example, in [49], both the more theoretical and
the more applied successes of RandNLA for these very overdetermined linear regression
problems were achieved by using, implicitly or explicitly, the so-called statistical leverage
scores of the tall input matrix A.
In some cases, the use of leverage scores was explicit, in
that one used exact or approximate leverage scores to construct a nonuniform importance
sampling probability distribution with respect to which to sample rows from the input
matrix, thereby constructing a data-aware subspace embedding [26,28]. In other cases, the
use of leverage scores was implicit, in that one performed a random projection, thereby
implementing a data-oblivious subspace embedding [31,68].
In both cases, the improved
The statistical leverage scores of a tall matrix A R
m×n
with m n are equal to the diagonal elements
of the projection matrix onto the column span of A [14,48,50]. Thus, they capture a subtle but important
structural property of the Euclidean vector space from which the data were drawn.
Random projections can be applied in more general metric spaces, but in a Euclidean vector space, a
random projection essentially amounts to rotating to a random basis, where the leverage scores are uniform
and thus where uniform sampling can be applied [49].
Structural Properties Underlying Randomized Linear Algebra Algorithms 139
theoretical and practical results for overdetermined linear regression problems were obtained
by coupling the original, rather theoretical, RandNLA ideas more closely with structural
properties of the input data [49].
In parallel with these successes on overdetermined regression problems, there have
also been several impressive successes on applying RandNLA methods to a wide range
of seemingly different low-rank matrix approximation problems.
For example, consider the
following problems (which are described in more detail in Section 9.4):
The column subset selection problem (CSSP), in which one seeks to select the most
informative subset of exactly k columns from a matrix.
The problem of using random projections to approximate low-rank matrix approx-
imations faster than traditional SVD-based or QR-based deterministic methods for
worst-case input matrices and/or for inputs that are typical in scientific computing
applications.
The problem of developing improved Nystr¨om-based low-rank matrix approximations
of symmetric positive definite matrices.
The problem of developing improved machine learning and data analysis methods to
identify interesting features in the data (feature selection).
These problems often arise in very different research areas, and they are—at least super-
ficially—quite different. Relatedly, it can be difficult to tell what—if anything—improved
algorithms for one problem mean for the possibility of improved algorithms for the others.
In this chapter, we highlight and discuss a particular deterministic structural property of
Euclidean vector spaces that underlies the recent improvements in RandNLA algorithms for
all of the above low-rank matrix approximation problems.
§
(See Lemma 9.1 for a statement
of this result.) This structural property characterizes the interaction between the singular
subspaces of the input matrix A and any (deterministic or randomized) sketching matrix.
In particular, this structural property is deterministic, in the sense that it is a statement
about the (fixed) input data A and not the (randomized) algorithm. Moreover, it holds for
arbitrary matrices A, that is, matrices that have an arbitrarily large number of columns and
rows and not necessarily just tall-and-thin or short-and-fat matrices A, as was the case in
the over or under-determined least-squares problems. The structural property thus applies
most directly to problems where one is interested in low-rank approximation with respect
to a low-rank space of dimension k min{m, n}.
In RandNLA, the sketching matrix is typically either a matrix representing the operation
of sampling columns or rows from the input matrix, or a matrix representing the random
projection operation. In that case, this structural property has an interpretation in terms
of how the sampling or projection operation interacts with the subspaces defined by the top
and bottom part of the spectrum of A. We emphasize, however, that this structural property
holds more generally: in particular, it holds for any (deterministic or randomized) sketching
matrix and thus it is a property of independent interest. For example, while it is outside
the scope of this chapter to discuss in detail, one can easily imagine using this property to
By low-rank matrix appr oximation problems, we informally mean problems where the input is a general
matrix A R
m×n
,wherebothm and n are large, and a rank parameter k min{m, n}; the output is a
low-rank approximation to A, not necessarily the optimal one, that is computed via the SVD.
§
Although this structural property is central to all of the above problems, its role is typically obscured,
since it is often secondary to the main result of interest in a given paper, and thus it is hidden deep within
the analysis of each of the superficially different methods that use it. This property was first introduced by
Boutsidis et al. [10] in the context of the CSSP, it was subsequently used by Halko et al. [44] to simplify
the description of several related random projection algorithms, and it was then used—typically knowingly,
but sometimes unknowingly—by many researchers working on these and other problems.
140 Handbook of Big Data
derandomize RandNLA algorithms or to develop other deterministic matrix algorithms for
these and related matrix problems or to develop improved heuristics in machine learning
and data analysis applications. In the remainder of this chapter, we highlight this structural
property, stating and presenting an analysis of a more general version of it than has been
previously available. We also describe how it is used in several of the recent improvements
to various RandNLA algorithms for low-rank matrix approximation problems.
9.3 Our Main Technical Result
In this section, we state and prove our main technical result. This technical result is a
structural condition that characterizes the interaction between the singular subspaces of
the input matrix A and any deterministic or randomized sketching matrix.
9.3.1 Statement of the Main Technical Result
Recall that, given a matrix A R
m×n
, many RandNLA algorithms seek to construct a
sketch of A by post-multiplying A by some sketching matrix Z R
n×r
,wherer is much
smaller than n. (For example, Z could represent the action of random sampling or random
projection.) Thus, the resulting matrix AZ R
m×r
is matrix that is much smaller than the
original matrix A, and the interesting question is what kind of approximation guarantees
does it offer for A.
A common approach is to explore how well AZ spans the principal subspace of A,and
one metric of accuracy is the error matrix, A P
AZ
A,whereP
AZ
A is the projection of A
onto the subspace spanned by the columns of AZ. Formally,
P
AZ
=(AZ)(AZ)
+
= U
AZ
U
T
AZ
.
Recall that X
+
R
n×m
is the Moore–Penrose pseudoinverse of any matrix X R
m×n
,
and that it can be computed via the SVD of X; see [38] for details. Similarly, U
AZ
R
m×ρ
is the matrix of the left singular vectors of AZ,whereρ is the rank of AZ. The following
structural result offers a means to bound any unitarily invariant norm of the error matrix
A P
AZ
A.
Lemma 9.1 Given A R
m×n
,letY R
n×k
be any matrix such that Y
T
Y = I
k
.Let
Z R
n×r
(r k) be any matrix such that Y
T
Z and AY have full rank. Then, for any
unitarily invariant norm ξ, we have that
A P
AZ
A
ξ
A AY Y
T
ξ
+
A AY Y
T
Z(Y
T
Z)
+
ξ
. (9.1)
Three comments about this lemma, one regarding Z, one regarding Y , and one regarding
the interaction between Z and Y , are in order.
Lemma 9.1 holds for any matrix Z, regardless of whether Z is constructed determin-
istically or randomly. In the context of RandNLA, typical constructions of Z would
represent a random sampling or random projection operation.
The orthogonal matrix Y in the above lemma is also arbitrary. In the context of
RandNLA, one can think of Y either as Y = V
k
,whereV
k
R
n×k
is the matrix of the
top k right singular vectors of A, or as some other orthogonal matrix that approximates
V
k
; however, Lemma 9.1 holds more generally.
Structural Properties Underlying Randomized Linear Algebra Algorithms 141
As stated in Lemma 9.1, Y must satisfy two conditions: the matrix Y
T
Z must have full
rank, equal to k,sincer k,andthematrixAY must also have full rank, again equal
to k.IfY = V
k
, then the constraint that AY must have full rank is trivially satisfied,
assuming that A has rank at least k. Additionally, the sampling and random projection
approaches that are used in high-quality RandNLA algorithms with sufficiently large
values of r guarantee that the rank condition on Y
T
Z is satisfied [26,49]. More generally,
though, one could perform an a posteriori check that these two conditions hold.
9.3.2 Popular Special Case
Before providing a proof of this structural result, we will now consider a popular special
case of Lemma 9.1. To do so, we will let Y = V
k
R
n×k
, namely the orthogonal matrix of
the top k right singular vectors of A. (Actually, any orthogonal matrix spanning that same
subspace would do in this discussion.) For notational convenience, we will let V
k,
R
n×(ρk)
(respectively, Σ
k,
R
(nρ)×(nρ)
) be the matrix of the bottom ρk right singular vectors
(respectively, singular values) of A.LetA
k
R
m×n
be the best rank k approximation to A
as computed by the SVD. It is well known that
A
k
= AV
k
V
T
k
Assuming that V
T
k
Z has full rank, then Lemma 9.1 implies that
A P
AZ
A
ξ
≤A A
k
ξ
+
(A A
k
) Z
V
T
k
Z
+
ξ
Note here that
A A
k
= U
k,
Σ
k,
V
T
k,
and if we drop, using unitary invariance, the matrix U
k,
from the second norm at the
right-hand side of the above inequality, then we get
A P
AZ
A
ξ
≤A A
k
ξ
+
Σ
k,
V
T
k,
Z

V
T
k
Z
+
ξ
For the special case of ξ ∈{2,F}, this is exactly the structural condition underlying the
randomized low-rank projection algorithms of [44] that was first introduced in the context
of the CSSP [10]. We summarize the above discussion in the following lemma.
Lemma 9.2 Given A R
m×n
,letV
k
R
n×k
be the matrix of the top k right singular
vectors of A.LetZ R
n×r
(r k) be any matrix such that Y
T
Z has full rank. Then, for
any unitarily invariant norm ξ,
A P
AZ
A
ξ
≤A A
k
ξ
+
Σ
k,
V
T
k,
Z

V
T
k
Z
+
ξ
(9.2)
Equation 9.2 immediately suggests a proof strategy for bounding the error RandNLA
algorithms for low-rank matrix approximation: identify a sketching matrix Z such that
V
T
k
Z has full rank; at the same time, bound the relevant norms of
V
T
k
Z
+
and V
k,
Z.
Lemma 9.1 generalizes the prior use of Lemma 9.2 in several important ways:
First, it is not necessary to focus on random sampling matrices or random projection
matrices, but instead, we consider arbitrary sketching matrices Z. This was actually
implicit in the analysis of the original version of Lemma 9.2 [10], but it seems worth
making that explicit here. It does, however, require the extra condition that AZ also
has full rank.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.47.25