2. Big-n versus Big-p in Big Data (1/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Big-n versus Big-p in Big Data

Norman Matloﬀ

CONTENTS

2.1 No Need for Statistics with Big Data? ......................................... 21

2.2 Notation and Notions of Dimensionality ....................................... 22

2.3 Toward a Rough Taxonomy of Big Data ...................................... 22

2.4 Case of Big-n ................................................................... 23

2.4.1 Basic Method .......................................................... 23

2.4.2 Validity of the SA Method ............................................ 24

2.4.3 Complexity Analysis ................................................... 24

2.4.4 Examples .............................................................. 25

2.5 Case of Big-p ................................................................... 26

2.5.1 Running Examples: Linear Regression and PCA ..................... 26

2.5.2 Curse of Dimensionality ............................................... 26

2.5.2.1 Data Sparsity ............................................... 26

2.5.2.2 Variable Sparsity May Ameliorate the COD ............... 27

2.5.3 Principle of Extreme Values ........................................... 28

2.5.4 What Theoretical Literature Tells Us ................................. 29

2.5.5 Example ............................................................... 30

2.6 Conclusion ...................................................................... 30

References ............................................................................. 31

What do we—or should we—mean by the word big in the phrase Big Data?Andtowhat

degree are the problems of Big Data solved, being solved, or have a poor outlook for solution?

These questions will be discussed in this chapter, examining some analyses in the research

literature along with some simple models, with the goal of clarifying some issues for practical

but careful data scientists.

Consider the classical setting of n observations in p variables, stored in an n×p matrix A.

Though the popular image of Big Data involves Big-n settings, typical Big Data applications

also involve Big-p. Both Big-n and Big-p will be treated here, in diﬀerent senses, the former

from a computational point of view and the latter in a statistical context.

2.1 No Need for Statistics with Big Data?

A commonly held myth is that statistical inference is not an issue with Big Data. It is

believed that the sample size n is so large that standard errors of estimates are essentially 0,

and thus the problems of Big Data are only computational in nature, not statistical. But

22 Handb ook of Big Data

this view is inaccurate. Standard errors, especially relative to eﬀect size, can indeed be large

even with Big Data.

For a quick, simple example, consider Figure 1.D in [17]. It shows very wide error bars

for some types of graph conﬁgurations, in spite of n having a value of over 54 million. One

would ﬁnd large standard errors in other kinds of analyses on this data, say for coeﬃcients

of dummy variables in a linear regression model.

Similarly, the standard errors of coeﬃcients in linear regression tend to grow as p in-

creases, so for very large p, standard errors do become an issue. Indeed, the familiar prob-

lems of multiple c omparisons or simultaneous inference worsen as p grows, for large but

ﬁxed n.

Thus, the Big-p problem is just as important as Big-n. Indeed, a central issue will be the

size of p relative to n. Though a plethora of procedures have been developed in recent years

for Big-p settings, we will view these methods from a safe distance, in order to understand

their goals and possible pitfalls.

2.2 Notation and Notions of Dimensionality

Many authors, such as [3,10], like to use the term high-dimensional data rather than Big

Data. This is an allusion to the importance of the Big-p problem. In this section, we develop

our ﬁrst notions along such lines, to be elaborated upon as the chapter progresses.

The usual setting will be assumed: We have n observations in p variables, stored in an

n × p matrix A. The dimensionality of the problem is then taken to be p.

This formulation is somewhat narrow, in at least two senses, which can be described as

follows:

• Many of today’s applications have less regularity than this. For instance, there may be

a varying number of quantities per observation.

• The statistical dimensionality of the problem may be diﬀerent from, often greater than,

the number of variables.

The second item above is especially important. In modeling, a distribution is a mixture of

k Gaussians; for example, we could easily have k>p. Even more to the point, in nonpara-

metric modeling for regression or classiﬁcation, the eﬀective dimensionality is inﬁnite.

Nevertheless, the n ×p setting will suﬃce for the discussion here. In some cases, though,

p will denote the dimensionality of a parameter space.

2.3 Toward a Rough Taxonomy of Big Data

It would be useful to establish a working deﬁnition of Big Data, in terms of the relation

between n and p. Let us focus on parametric models here, with θ ∈ R

denoting the

parameter vector to be estimated. For instance, θ could be the vector of coeﬃcients in a

linear model.

The results of [14] are very relevant here. For p-dimensional parametric exponential

families, Portnoy showed that maximum likelihood estimators will be asymptotically normal

in the usual sense if p

/n → 0asp, n →∞. Since, as noted above, there are still reasons to

Big-n versus Big-p in Big Data 23

be concerned about standard errors in Big Data, this criterion is appealing, and we might

deﬁne Big-p as p>

√

On the other hand, if one is willing to assume variable sparsity, meaning that most com-

ponents of θ are zero or small, one can at least obtain consistency results even if p  n [3].

(It will be explained below why we call this concept variable sparsity, i.e., sparsity in the

variables.)

But what about nonparametric approaches? Indications here are dismal, as will be dis-

cussed in Section 2.5.2, though a possible remedy is proposed there.

In other words, there is no hard-and-fast deﬁnition of Big-p. One’s deﬁnition depends

on one’s goals and assumptions. There will be an elaboration on this point in the following

sections.

2.4 Case of Big-n

One of the ﬁrst aspects of Big-n that comes to mind is computation time. Here, we discuss

the elements of a simple yet very powerful method to reduce that computation, which will be

referred to as software alchemy (SA). For further details, including software for the method

and references to work by various authors related to SA, see [13].

2.4.1 Basic Method

Suppose one has a certain estimator one wishes to apply to one’s data. For concreteness;

say, for example, we are ﬁnding an estimated coeﬃcient vector in logistic regression using

the R glm() function. Instead of applying glm() to all n observations, we break the data

into r groups of size k, apply glm() to each group, then average the resulting r estimated

coeﬃcient vectors.

If we have a multicore machine or other parallel system that allows parallel computation

of degree r, then the computation done in this manner can be much faster than if we were

to simply apply glm() to the full dataset.

More formally deﬁned, the method is as follows. Let X

, ..., X

be i.i.d., with common

distribution F

, and let θ be some function of F

.BothX and θ could be vector-valued.

Let



θ be an estimator of θ based on n observations, and divide the data into r groups,

and let k = n/r. Speciﬁcally, the ﬁrst group consists of X

, ..., X

, the second comprises

k+1

, ..., X

,andsoon.(Ifr does not evenly divide n, the last group will of size larger

than k.) Now compute the analogs of



θ on each group, yielding

, ...,

. Finally, take

the average of these values:

θ =



i=1

(2.1)

This becomes our estimator.

The utility of this method also extends to distributed databases, such as the Hadoop

Distributed File System. There our data are already stored in chunks, and thus ready for

using the above method.

Note carefully the i.i.d. assumption. In some datasets, there has been some preprocessing

performed that invalidates this assumption. For instance, the data may be stored in an order

determined by the sorted values of one of the variables. This is the case, for example, with

the ﬂight delay data used later in this chapter, which is stored in date order. In such a

setting, the data must be randomized before applying the method.

24 Handb ook of Big Data

In the parallel processing literature, the term embarrassingly parallel (EP) refers to

applications in which computation may be split into chunks whose computation can be

done independently of each other. For instance, ﬁnding the sum of elements of a vector has

the EP property, whereas complex sorting algorithms do not. The above-described method

has a big advantage in that it in essence can convert a non-EP statistical problem into an

EP one, as the computation of the

above can be done independently of each other, hence

the term SA—converting non-EP problems to (statistically equivalent) EP ones.

R code implementing SA is available in [13].

2.4.2 Validity of the SA Method

The speedup in computation would of course be useless if θ were less statistically accurate

than



θ. Fortunately, that is not the case. In [13], it was shown that the SA method works well

for any asymptotically normal estimator, in the sense that SA achieves the same statistical

accuracy as that estimator:

Fix r while letting n and k go to ∞.If



θ is asymptotically multivariate normally

distributed with asymptotic covariance matrix Σ, then the same holds for

θ.

In other words,

θ has the same statistical accuracy as that of



θ, and thus our speedup in

computation does not come at the expense of statistical accuracy. This result is intuitively

clear, and is proven using characteristic functions in [13].

It is clear that the SA method also works if p is growing, as long as one has asymptotically

normal estimators. For this reason, a rough criterion for usability of the method might be

taken to be p<

√

n, the Portnoy criterion discussed above; a more conservative criterion

would be to apply the above to the groups, that is, require p<



n/r.

∗

Concerning the asymptotic nature of the SA method, note that it is used only in sit-

uations in which n is large enough for SA to bring a computational speedup, settings in

which n should be large enough for the asymptotics to have taken hold. SA was applied to

a variety of estimators in [13], and excellent agreement between

θ and



θ was found, with

proportional diﬀerence always being under 0.001 and in most cases much smaller.

Note that another reason for restricting use of the method to Big-n is bias, which may be

more prominent in smaller samples. See Section 2.5.3 for a discussion of this in the principal

components analysis (PCA) setting.

Since the data groups are independent, it is straightforward to calculate an estimated

covariance matrix for

θ; we simply take the sample covariance of the r quantities

θ, divided

by r. The R code available in [13] does this.

2.4.3 Complexity Analysis

Suppose the work associated with computing



θ is O(n

). This would be the case, for instance,

in computation of any U-statistic, in which one must apply some function to all possible

) pairs of observations in our data, so that we must do O(n

) amount of computation.

If the r chunks are handled simultaneously on a parallel machine, say multicore, SA reduces

the time complexity of an O(n

) problem to roughly O(n

Thus, statistical applications with computational time complexity greater than or equal

to O(n) will be major beneﬁciaries of SA. Examples of such applications investigated in [13]

include quantile regression, estimation of hazard functions with censored data, and so on.

For c>1SAbringsasuperlinear speedup, meaning a speedup that is faster than linear

in r; the latter would only reduce the time to O(n

/r). Another noteworthy point is that

∗

One sometimes hears that Tukey recommended that one use not more than

√

n predictor variables in

regression analyses, but the author has not found a reference for this.

Big-n versus Big-p in Big Data 25

a similar analysis shows that we attain a speedup for c>1 even if the groups are handled

serially on a uniprocessor machine, with time rO(n

)=O(n

c−1

However, in some applications, the run-time complexity depends substantially on p,not

just r. In the linear regression case, for example, the run time is ﬁrst O(np

)toformsums

of squares and cross products, and then O(p

) time to perform matrix inversion. Using QR

factorization instead of matrix inversion, the latter time may be O(p

), depending on which

quantities are computed, but the point is that the situation is not so simple as considering

thetimeeﬀectsofn. The case of PCA is similar.

One of the investigations performed in [13], for instance, simulated linear regression

computation with n = 200,000 and p = 125. SA yielded a speedups of 1.70 and 2.49 with

4 and 16 cores, respectively.

∗

By contrast, the simulations done for quantile regression

exhibited superlinear behavior: For a problem size of n = 50,000 and p = 75, speedups of

8.06 and 25.55 were obtained with 4 and 16 cores.

2.4.4 Examples

In [13], the SA method was investigated via simulation. Here, let us look at some real data,

speciﬁcally the airline ﬂight delay data (http://stat-computing.org/dataexpo/2009).

The approximately 500,000 records of the 2008 dataset were used in this example. The

R glm.ﬁt() function was used to predict whether a ﬂight will be more than 10 minute late,

using the variables ActualElapsedTime, DepDelay, DepTime, Distance, Month, TaxiIn, and

TaxiOut as predictors. A much more sophisticated prediction model could be developed,

say by converting Month to seasonal dummy variables, but here the goal was to illustrate

the SA method.

The run times in seconds, and values of the estimated coeﬃcient for the DepDelay

variables were as follows:

Cores Time DepDelay coef.

1 195.593 0.203022181

4 123.331 0.203025159

8 82.160 0.203027800

16 85.454 0.203032682

Substantial speedups were obtained up through 8 cores, though using 16 cores did not

help further. The latter behavior is typical in the parallel processing world; after a certain

point, additional parallelization does not bring beneﬁt, and may actually reduce speedup.

In addition, the estimated DepDelay coeﬃcients obtained using SA were in very close agree-

ment with that of applying glm.ﬁt() to the full data. The same is held for the estimated

coeﬃcients of the other predictor variables, which are not shown here.

Next, consider ridge regression, using the implementation of the CRAN package ridge.

Here, the ridge parameter λ is set via an algorithm, so computation can be lengthy. The

results have been tabulated as follows:

Cores Time ActualElapsedTime coef.

1 90.173 0.3643521

4 53.336 0.3643433

8 39.584 0.3643375

16 47.696 0.3643163

Again, substantial speedups are obtained, and the SA estimator is virtually indistin-

guishable from the non-chunked one.

∗

Here and in Section 2.4.4, the machine used w as a 16-core system with a h yperthreading degree of 2,

thus with parallel computation capability approaching 32. The machine was running OpenBLAS for matrix

operations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Big-n versus Big-p in Big Data (1/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
2. Big-n versus Big-p in Big Data (1/3)