17. High-Dimensional Regression and Inference (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

310 Handbook of Big Data

A linear but biased estimator of β

is therefore given by



(j)

)

(j)

)

(j)

with



]=β



k=j

where

(j)

)

(k)

(j)

)

(j)

In the low-dimensional case, we can ﬁnd a Z

(j)

that is orthogonal to all X

(k)

,k = j,

by deﬁning Z

(j)

as the residuals when regressing X

(j)

versus X

(k)

,k = j (leading to an

unbiased estimator). In fact, this is how the classical OLS estimator can be obtained. This

is not possible anymore in the high-dimensional situation. However, we can use a lasso

projection by deﬁning Z

(j)

as the residuals of a lasso regression of X

(j)

versus all other

predictor variables X

(k)

,k = j, leading to the bias corrected estimator



(j)

)

(j)

)

(j)

−



k=j



Lasso,k

where



Lasso

is the lasso estimator when regressing Y versus all predictors in X. In contrast



Lasso

, the estimator



b is not sparse anymore, that is, the lasso estimator got de-sparsiﬁed.

Asymptotic theory is available for the de-sparsiﬁed estimator. Under suitable regularity

conditions, it holds that for large n (and p)

(



− β

)



≈N(0, 1)

where

(j)

)

(j)



(j)

)

(j)



and similarly for the multivariate counterpart. See [29] for details, also the uniformity of

convergence.

A major drawback of the lasso projection approach is the computation of the p

diﬀerent lasso regressions for the regularized projection step. However, the problem is highly

parallelizable by construction.

Lasso Projection in R

We use the function lasso.proj in the R-package hdi. It shares a very similar

interface and output structure with the function ridge.proj. We get a ﬁtted object

using

> fit.lasso <- lasso.proj(x, y)

Again, p-values for the (standard) null hypotheses β

=0arestoredinpval and their

multiple-testing counterparts in pval.corr. It is also possible to do simultaneous tests,

for example, for G = {1, 2, 3},

> fit.lasso$groupTest(group = 1:3)

High-Dimensional Regression and Inference 311

and to get the conﬁdence intervals using the function confint.

The method can be parallelized by setting the option parallel = TRUE (and using

a suitable value for ncores). In addition, user-deﬁned score vectors can be supplied

through the argument Z.

17.3.3 Extension to Generalized Linear Models

Conceptually, the presented methods can be extended to generalized linear models by using

a weighted squared-error approach [5]. The idea is to ﬁrst apply a (standard) 

-penalized

maximum likelihood (lasso) estimator and use its solution to construct the corresponding

weighted least-squares problem (in the spirit of the iteratively re-weighted least-squares

algorithm IRLS). The presented methods can then be applied to the re-weighted problem.

Extensions to Generalized Linear Models i n R

Both methods have a family argument. If we have a binary outcome Y ,wecanuse

> fit.ridge <- ridge.proj(x, y, family = "binomial")

to ﬁt a logistic regression model with a binary response y using the ridge projection

method.

17.3.4 Covariance Test

Another recent method that is also available in R is the so-called covariance test [15] and its

extension, the spacing test [26] (requiring fewer assumptions and with exact ﬁnite sample

results). The idea is to do sequential inference along the lasso solution path (similar as

in forward stepwise regression) by performing a (conditional) test whenever a predictor

enters the path. A heuristic to extend the methods to generalized linear models (GLMs) is

available, again through the IRLS approach.

Compared to the previous methods, the covariance test is a conditional test. Although

theory and more importantly software yield p-values on an individual variable level,

interpretation is now diﬀerent. A p-value is now rather corresponding to a lasso step

than to the individual variable entering at the corresponding step; for a new dataset, we

might see a diﬀerent variable at the same step. See also the discussion of [15] and the

corresponding rejoinder. An analogous example for this diﬀerent philosophy can be found

in the rejoinder of [15], when trying to determine the test error of the k-step lasso solution

by cross-validation, we would run the lasso on diﬀerent subsets of the dataset and average

the observed test errors. A diﬀerent model (with respect to the selected variables, but not

size) is potentially being selected at every cross-validation iteration.

Covariance Test in R

We use the R-package covTest [25]. For the linear case, it needs a lars object [10]

as input.

> fit.lars <- lars(x, y)

> fit.covTest <- covTest(fit.lars, x, y)

312 Handbook of Big Data

The p-values are stored in results. The call for the GLM situation is very similar

> fit.lars.glm <- lars.glm(x, y, family = "binomial")

> fit.covTest.glm <- covTest(fit.lars.glm, x = x, y = y)

17.4 Subsampling, Sample Splitting, and p-Value Aggregation

17.4.1 From Selection Frequencies to Stability Selection

Because sparse estimators have a point mass at zero, it is not straightforward to apply

bootstrap or subsampling techniques to do inference or to construct conﬁdence intervals.

Nevertheless, bootstrap techniques have been widely used in model selection problems to

assess the stability of a selected model. The idea is to focus on those predictors that are still

being selected when the dataset is being reshuﬄed. The selection frequency of a predictor

(in a total of B samples) can therefore be used as a heuristic measure of its stability.

A theoretical foundation of such an approach can be found in [20]. Assume that we have

a model selection procedure that selects (on average) q predictors (for example, by using

the q predictors that enter the lasso path ﬁrst when varying the penalty parameter λ). At

every iteration b, we perform the following steps, b =1,...,B, for example, B = 500.

1. Draw subsample I

∗

⊂{1,...,n} (without replacement) of size n/2.

2. Apply the model selection procedure to subsampled data leading to



∗



S(I

∗

This allows us to calculate an empirical selection frequency



for every predictor x

(j)

j =1,...,p,



#{b; j ∈



∗

}

For a frequency threshold 1/2 < π

thres

< 1, we deﬁne our ﬁnal model as



S = {j;



≥ π

thres

}

This means that we use those predictors in our ﬁnal model that are being selected in at

least π

thres

× 100% of the subsamples.

Under suitable assumptions, we have control of the expected number of false positives:

E[V ] ≤

2π

thres

− 1

(17.5)

where V = |



S ∩ S

|, see [20] for details. If we use

q =



αp(2π

thres

− 1)

we can control the familywise error rate at level α ∈ (0, 1), that is,

P (V>0) ≤ α

High-Dimensional Regression and Inference 313

In practice, we typically specify a bound on E[V ] (the expected number of false positives

that we are willing to tolerate) and a threshold π

thres

(e.g., π

thres

=0.75). Using Equation

17.5 we can then derive the corresponding value of q to ensure control of E[V ].

Stability Selection in R

We use the function stability in the R-package hdi. If we want to control E[V ] ≤ 1,

we use

> fit.stability <- stability(x, y, EV = 1)

A default value of π

thres

=0.75 is being used. The stable predictors can be found in

select and the selection frequency of every predictor in freq. By construction, the

algorithm is highly parallelizable (use the options parallel and ncores).

By default, the model selection criterion uses the ﬁrst q predictors in the lasso

path of the linear model (implemented in the function lasso.firstq). Extensions to

other models (beyond the linear model) are straightforward by setting the argument

model.selector appropriately. This means that the function can be applied to any

method that provides an appropriate model selection function.

17.4.2 Sample Splitting and p-Value Aggregation

Other approaches are based on sample splitting [21,31]. The idea is to split the dataset into

two (disjoint) parts: the ﬁrst part is used for model selection where the high-dimensional

problem is reduced to a reasonable size (e.g., using the lasso). The second part is used for

(classical) low-dimensional statistical inference (using the selected model on the ﬁrst part).

The p-values on the second part are honest as the two parts of the dataset are disjoint. In

more details, the algorithm in [31] works as follows:

1. Partition the sample {1,...,n} = I

∪ I

with I

∩ I

= ∅ and |I

| = n/2 and

| = n −n/2.

2. Using only I

, select the variables

S ⊆{1,...,p}. Assume or enforce that |

S|≤

| = n/2≤|I

3. Using classical least-squares theory, compute p-values P

raw,j

for H

0,j

,forj ∈

using only I

.Forj/∈

S, assign P

raw,j

=1.

4. Adjust p-values for multiple testing using Bonferroni correction on the selected

model

S (with |

S|p),

corr,j

= min(P

·|

S|, 1)

If the selected model

S contains the true model S,thep-values are correct. For any model

selection procedure with the screening property, we therefore (asymptotically) get p-values

corr,j

controlling the familywise error rate.

To get reproducible results (that do not depend on a single data split), we can run the

sample splitting algorithm B times, for example, B =50orB = 100, yielding B diﬀerent

p-values for every predictor,

[1]

corr,j

,...,P

[B]

corr,j

,j=1,...,p

314 Handbook of Big Data

Clearly, the diﬀerent p-values corresponding to the same predictor are not independent.

Nevertheless, we can aggregate them using an (arbitrary) prespeciﬁed γ -quantile, 0 < γ < 1,

leading to

(γ)=min



emp. γ-quantile{P

[b]

corr,j

/γ; b =1,...,B}, 1



(17.6)

the so-called quantile aggregated p-values, see [21] for details. The price that we have

to pay for using a (potentially small) quantile is the factor 1/γ. For example, if we

choose the median, we have to multiply all p-values by the factor of 2. This is called the

multisample splitting algorithm [21]. It is loosely related to stability selection. For example,

for γ =0.5, we require a predictor to be selected in at least 50% of the sample splits with

a small enough p-value. Moreover, quantile aggregation as deﬁned in Equation 17.6 is a

general (conservative) p-value aggregation procedure that works under arbitrary dependency

structures.

Aprioriit is not clear how to select the parameter γ. We can even search for the

best γ-quantile in a range (γ

min

, 1), for example, γ

min

=0.05, leading to the aggregated

p-value,

=min



(1 − log(γ

min

)) inf

γ∈(γ

min

,1)

(γ), 1



j =1,...,p

The price for this additional search is the factor 1 −log(γ

min

). Under suitable assumptions,

the p-values P

are controlling the familywise error rate [21]. The smaller we choose γ

min

the more susceptible we are again to a speciﬁc realization of the B sample splits. Therefore,

we should choose a large value of B in situations where γ is small.

Multisample Splitting in R

We use the function multi.split in the R-package hdi.

> fit.multi <- multi.split(x, y)

We can use any model for which there is a model selection function (deﬁned in argument

model.selector)andaclassical p-value function (argument classical.fit). The

default uses lasso (with cross-validation) and a linear model ﬁt. Extensions to GLMs

and many more models are (from a technical point of view) straightforward.

The p-values are stored in pval.corr. Note that by construction the multisample

splitting algorithm (only) provides p-values for familywise error control.

Conﬁdence intervals can also be obtained through the function confint; however,

the level has already to be set in the call of the function multi.split (argument

ci.level).

17.5 Hierarchical Approaches

In most applications we are faced with (strongly) correlated design matrices. Already in

the low-dimensional case, two strongly correlated predictors might have large (individual)

p-values, while the joint null hypothesis can be clearly rejected. Moreover, too strong

correlation in the design matrix might also violate assumptions of the previously discussed

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 17. High-Dimensional Regression and Inference (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
17. High-Dimensional Regression and Inference (2/4)