2
Big-n versus Big-p in Big Data
Norman Matloff
CONTENTS
2.1 No Need for Statistics with Big Data? ......................................... 21
2.2 Notation and Notions of Dimensionality ....................................... 22
2.3 Toward a Rough Taxonomy of Big Data ...................................... 22
2.4 Case of Big-n ................................................................... 23
2.4.1 Basic Method .......................................................... 23
2.4.2 Validity of the SA Method ............................................ 24
2.4.3 Complexity Analysis ................................................... 24
2.4.4 Examples .............................................................. 25
2.5 Case of Big-p ................................................................... 26
2.5.1 Running Examples: Linear Regression and PCA ..................... 26
2.5.2 Curse of Dimensionality ............................................... 26
2.5.2.1 Data Sparsity ............................................... 26
2.5.2.2 Variable Sparsity May Ameliorate the COD ............... 27
2.5.3 Principle of Extreme Values ........................................... 28
2.5.4 What Theoretical Literature Tells Us ................................. 29
2.5.5 Example ............................................................... 30
2.6 Conclusion ...................................................................... 30
References ............................................................................. 31
What do we—or should we—mean by the word big in the phrase Big Data?Andtowhat
degree are the problems of Big Data solved, being solved, or have a poor outlook for solution?
These questions will be discussed in this chapter, examining some analyses in the research
literature along with some simple models, with the goal of clarifying some issues for practical
but careful data scientists.
Consider the classical setting of n observations in p variables, stored in an n×p matrix A.
Though the popular image of Big Data involves Big-n settings, typical Big Data applications
also involve Big-p. Both Big-n and Big-p will be treated here, in different senses, the former
from a computational point of view and the latter in a statistical context.
2.1 No Need for Statistics with Big Data?
A commonly held myth is that statistical inference is not an issue with Big Data. It is
believed that the sample size n is so large that standard errors of estimates are essentially 0,
and thus the problems of Big Data are only computational in nature, not statistical. But
21
22 Handb ook of Big Data
this view is inaccurate. Standard errors, especially relative to effect size, can indeed be large
even with Big Data.
For a quick, simple example, consider Figure 1.D in [17]. It shows very wide error bars
for some types of graph configurations, in spite of n having a value of over 54 million. One
would find large standard errors in other kinds of analyses on this data, say for coefficients
of dummy variables in a linear regression model.
Similarly, the standard errors of coefficients in linear regression tend to grow as p in-
creases, so for very large p, standard errors do become an issue. Indeed, the familiar prob-
lems of multiple c omparisons or simultaneous inference worsen as p grows, for large but
fixed n.
Thus, the Big-p problem is just as important as Big-n. Indeed, a central issue will be the
size of p relative to n. Though a plethora of procedures have been developed in recent years
for Big-p settings, we will view these methods from a safe distance, in order to understand
their goals and possible pitfalls.
2.2 Notation and Notions of Dimensionality
Many authors, such as [3,10], like to use the term high-dimensional data rather than Big
Data. This is an allusion to the importance of the Big-p problem. In this section, we develop
our first notions along such lines, to be elaborated upon as the chapter progresses.
The usual setting will be assumed: We have n observations in p variables, stored in an
n × p matrix A. The dimensionality of the problem is then taken to be p.
This formulation is somewhat narrow, in at least two senses, which can be described as
follows:
Many of today’s applications have less regularity than this. For instance, there may be
a varying number of quantities per observation.
The statistical dimensionality of the problem may be different from, often greater than,
the number of variables.
The second item above is especially important. In modeling, a distribution is a mixture of
k Gaussians; for example, we could easily have k>p. Even more to the point, in nonpara-
metric modeling for regression or classification, the effective dimensionality is infinite.
Nevertheless, the n ×p setting will suffice for the discussion here. In some cases, though,
p will denote the dimensionality of a parameter space.
2.3 Toward a Rough Taxonomy of Big Data
It would be useful to establish a working definition of Big Data, in terms of the relation
between n and p. Let us focus on parametric models here, with θ R
p
denoting the
parameter vector to be estimated. For instance, θ could be the vector of coefficients in a
linear model.
The results of [14] are very relevant here. For p-dimensional parametric exponential
families, Portnoy showed that maximum likelihood estimators will be asymptotically normal
in the usual sense if p
2
/n 0asp, n →∞. Since, as noted above, there are still reasons to
Big-n versus Big-p in Big Data 23
be concerned about standard errors in Big Data, this criterion is appealing, and we might
define Big-p as p>
n.
On the other hand, if one is willing to assume variable sparsity, meaning that most com-
ponents of θ are zero or small, one can at least obtain consistency results even if p n [3].
(It will be explained below why we call this concept variable sparsity, i.e., sparsity in the
variables.)
But what about nonparametric approaches? Indications here are dismal, as will be dis-
cussed in Section 2.5.2, though a possible remedy is proposed there.
In other words, there is no hard-and-fast definition of Big-p. One’s definition depends
on one’s goals and assumptions. There will be an elaboration on this point in the following
sections.
2.4 Case of Big-n
One of the first aspects of Big-n that comes to mind is computation time. Here, we discuss
the elements of a simple yet very powerful method to reduce that computation, which will be
referred to as software alchemy (SA). For further details, including software for the method
and references to work by various authors related to SA, see [13].
2.4.1 Basic Method
Suppose one has a certain estimator one wishes to apply to one’s data. For concreteness;
say, for example, we are finding an estimated coefficient vector in logistic regression using
the R glm() function. Instead of applying glm() to all n observations, we break the data
into r groups of size k, apply glm() to each group, then average the resulting r estimated
coefficient vectors.
If we have a multicore machine or other parallel system that allows parallel computation
of degree r, then the computation done in this manner can be much faster than if we were
to simply apply glm() to the full dataset.
More formally defined, the method is as follows. Let X
1
,X
2
, ..., X
n
be i.i.d., with common
distribution F
X
, and let θ be some function of F
X
.BothX and θ could be vector-valued.
Let
θ be an estimator of θ based on n observations, and divide the data into r groups,
and let k = n/r. Specifically, the first group consists of X
1
, ..., X
k
, the second comprises
X
k+1
, ..., X
2k
,andsoon.(Ifr does not evenly divide n, the last group will of size larger
than k.) Now compute the analogs of
θ on each group, yielding
˜
θ
1
,
˜
θ
2
, ...,
˜
θ
r
. Finally, take
the average of these values:
θ =
1
r
r
i=1
˜
θ
i
(2.1)
This becomes our estimator.
The utility of this method also extends to distributed databases, such as the Hadoop
Distributed File System. There our data are already stored in chunks, and thus ready for
using the above method.
Note carefully the i.i.d. assumption. In some datasets, there has been some preprocessing
performed that invalidates this assumption. For instance, the data may be stored in an order
determined by the sorted values of one of the variables. This is the case, for example, with
the flight delay data used later in this chapter, which is stored in date order. In such a
setting, the data must be randomized before applying the method.
24 Handb ook of Big Data
In the parallel processing literature, the term embarrassingly parallel (EP) refers to
applications in which computation may be split into chunks whose computation can be
done independently of each other. For instance, finding the sum of elements of a vector has
the EP property, whereas complex sorting algorithms do not. The above-described method
has a big advantage in that it in essence can convert a non-EP statistical problem into an
EP one, as the computation of the
˜
θ
i
above can be done independently of each other, hence
the term SA—converting non-EP problems to (statistically equivalent) EP ones.
R code implementing SA is available in [13].
2.4.2 Validity of the SA Method
The speedup in computation would of course be useless if θ were less statistically accurate
than
θ. Fortunately, that is not the case. In [13], it was shown that the SA method works well
for any asymptotically normal estimator, in the sense that SA achieves the same statistical
accuracy as that estimator:
Fix r while letting n and k go to .If
θ is asymptotically multivariate normally
distributed with asymptotic covariance matrix Σ, then the same holds for
θ.
In other words,
θ has the same statistical accuracy as that of
θ, and thus our speedup in
computation does not come at the expense of statistical accuracy. This result is intuitively
clear, and is proven using characteristic functions in [13].
It is clear that the SA method also works if p is growing, as long as one has asymptotically
normal estimators. For this reason, a rough criterion for usability of the method might be
taken to be p<
n, the Portnoy criterion discussed above; a more conservative criterion
would be to apply the above to the groups, that is, require p<
n/r.
Concerning the asymptotic nature of the SA method, note that it is used only in sit-
uations in which n is large enough for SA to bring a computational speedup, settings in
which n should be large enough for the asymptotics to have taken hold. SA was applied to
a variety of estimators in [13], and excellent agreement between
θ and
θ was found, with
proportional difference always being under 0.001 and in most cases much smaller.
Note that another reason for restricting use of the method to Big-n is bias, which may be
more prominent in smaller samples. See Section 2.5.3 for a discussion of this in the principal
components analysis (PCA) setting.
Since the data groups are independent, it is straightforward to calculate an estimated
covariance matrix for
θ; we simply take the sample covariance of the r quantities
˜
θ, divided
by r. The R code available in [13] does this.
2.4.3 Complexity Analysis
Suppose the work associated with computing
θ is O(n
c
). This would be the case, for instance,
in computation of any U-statistic, in which one must apply some function to all possible
(X
i
,X
j
) pairs of observations in our data, so that we must do O(n
2
) amount of computation.
If the r chunks are handled simultaneously on a parallel machine, say multicore, SA reduces
the time complexity of an O(n
c
) problem to roughly O(n
c
/r
c
).
Thus, statistical applications with computational time complexity greater than or equal
to O(n) will be major beneficiaries of SA. Examples of such applications investigated in [13]
include quantile regression, estimation of hazard functions with censored data, and so on.
For c>1SAbringsasuperlinear speedup, meaning a speedup that is faster than linear
in r; the latter would only reduce the time to O(n
c
/r). Another noteworthy point is that
One sometimes hears that Tukey recommended that one use not more than
n predictor variables in
regression analyses, but the author has not found a reference for this.
Big-n versus Big-p in Big Data 25
a similar analysis shows that we attain a speedup for c>1 even if the groups are handled
serially on a uniprocessor machine, with time rO(n
c
/r
c
)=O(n
c
/r
c1
).
However, in some applications, the run-time complexity depends substantially on p,not
just r. In the linear regression case, for example, the run time is first O(np
2
)toformsums
of squares and cross products, and then O(p
3
) time to perform matrix inversion. Using QR
factorization instead of matrix inversion, the latter time may be O(p
2
), depending on which
quantities are computed, but the point is that the situation is not so simple as considering
thetimeeectsofn. The case of PCA is similar.
One of the investigations performed in [13], for instance, simulated linear regression
computation with n = 200,000 and p = 125. SA yielded a speedups of 1.70 and 2.49 with
4 and 16 cores, respectively.
By contrast, the simulations done for quantile regression
exhibited superlinear behavior: For a problem size of n = 50,000 and p = 75, speedups of
8.06 and 25.55 were obtained with 4 and 16 cores.
2.4.4 Examples
In [13], the SA method was investigated via simulation. Here, let us look at some real data,
specifically the airline flight delay data (http://stat-computing.org/dataexpo/2009).
The approximately 500,000 records of the 2008 dataset were used in this example. The
R glm.fit() function was used to predict whether a flight will be more than 10 minute late,
using the variables ActualElapsedTime, DepDelay, DepTime, Distance, Month, TaxiIn, and
TaxiOut as predictors. A much more sophisticated prediction model could be developed,
say by converting Month to seasonal dummy variables, but here the goal was to illustrate
the SA method.
The run times in seconds, and values of the estimated coefficient for the DepDelay
variables were as follows:
Cores Time DepDelay coef.
1 195.593 0.203022181
4 123.331 0.203025159
8 82.160 0.203027800
16 85.454 0.203032682
Substantial speedups were obtained up through 8 cores, though using 16 cores did not
help further. The latter behavior is typical in the parallel processing world; after a certain
point, additional parallelization does not bring benefit, and may actually reduce speedup.
In addition, the estimated DepDelay coefficients obtained using SA were in very close agree-
ment with that of applying glm.fit() to the full data. The same is held for the estimated
coefficients of the other predictor variables, which are not shown here.
Next, consider ridge regression, using the implementation of the CRAN package ridge.
Here, the ridge parameter λ is set via an algorithm, so computation can be lengthy. The
results have been tabulated as follows:
Cores Time ActualElapsedTime coef.
1 90.173 0.3643521
4 53.336 0.3643433
8 39.584 0.3643375
16 47.696 0.3643163
Again, substantial speedups are obtained, and the SA estimator is virtually indistin-
guishable from the non-chunked one.
Here and in Section 2.4.4, the machine used w as a 16-core system with a h yperthreading degree of 2,
thus with parallel computation capability approaching 32. The machine was running OpenBLAS for matrix
operations.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.99.71