310 Handbook of Big Data
A linear but biased estimator of β
j
is therefore given by
b
j
=
(Z
(j)
)
T
Y
(Z
(j)
)
T
X
(j)
with
E[
b
j
]=β
j
+
k=j
P
jk
β
k
where
P
jk
=
(Z
(j)
)
T
X
(k)
(Z
(j)
)
T
X
(j)
In the low-dimensional case, we can find a Z
(j)
that is orthogonal to all X
(k)
,k = j,
by defining Z
(j)
as the residuals when regressing X
(j)
versus X
(k)
,k = j (leading to an
unbiased estimator). In fact, this is how the classical OLS estimator can be obtained. This
is not possible anymore in the high-dimensional situation. However, we can use a lasso
projection by defining Z
(j)
as the residuals of a lasso regression of X
(j)
versus all other
predictor variables X
(k)
,k = j, leading to the bias corrected estimator
b
j
=
(Z
(j)
)
T
Y
(Z
(j)
)
T
X
(j)
k=j
P
jk
β
Lasso,k
where
β
Lasso
is the lasso estimator when regressing Y versus all predictors in X. In contrast
to
β
Lasso
, the estimator
b is not sparse anymore, that is, the lasso estimator got de-sparsified.
Asymptotic theory is available for the de-sparsified estimator. Under suitable regularity
conditions, it holds that for large n (and p)
(
b
j
β
j
)
σ
Ω
jj
≈N(0, 1)
where
Ω
jj
=
(Z
(j)
)
T
Z
(j)
(X
(j)
)
T
Z
(j)
2
and similarly for the multivariate counterpart. See [29] for details, also the uniformity of
convergence.
A major drawback of the lasso projection approach is the computation of the p
different lasso regressions for the regularized projection step. However, the problem is highly
parallelizable by construction.
Lasso Projection in R
We use the function lasso.proj in the R-package hdi. It shares a very similar
interface and output structure with the function ridge.proj. We get a fitted object
using
> fit.lasso <- lasso.proj(x, y)
Again, p-values for the (standard) null hypotheses β
j
=0arestoredinpval and their
multiple-testing counterparts in pval.corr. It is also possible to do simultaneous tests,
for example, for G = {1, 2, 3},
> fit.lasso$groupTest(group = 1:3)
High-Dimensional Regression and Inference 311
and to get the confidence intervals using the function confint.
The method can be parallelized by setting the option parallel = TRUE (and using
a suitable value for ncores). In addition, user-defined score vectors can be supplied
through the argument Z.
17.3.3 Extension to Generalized Linear Models
Conceptually, the presented methods can be extended to generalized linear models by using
a weighted squared-error approach [5]. The idea is to first apply a (standard)
1
-penalized
maximum likelihood (lasso) estimator and use its solution to construct the corresponding
weighted least-squares problem (in the spirit of the iteratively re-weighted least-squares
algorithm IRLS). The presented methods can then be applied to the re-weighted problem.
Extensions to Generalized Linear Models i n R
Both methods have a family argument. If we have a binary outcome Y ,wecanuse
> fit.ridge <- ridge.proj(x, y, family = "binomial")
to fit a logistic regression model with a binary response y using the ridge projection
method.
17.3.4 Covariance Test
Another recent method that is also available in R is the so-called covariance test [15] and its
extension, the spacing test [26] (requiring fewer assumptions and with exact finite sample
results). The idea is to do sequential inference along the lasso solution path (similar as
in forward stepwise regression) by performing a (conditional) test whenever a predictor
enters the path. A heuristic to extend the methods to generalized linear models (GLMs) is
available, again through the IRLS approach.
Compared to the previous methods, the covariance test is a conditional test. Although
theory and more importantly software yield p-values on an individual variable level,
interpretation is now different. A p-value is now rather corresponding to a lasso step
than to the individual variable entering at the corresponding step; for a new dataset, we
might see a different variable at the same step. See also the discussion of [15] and the
corresponding rejoinder. An analogous example for this different philosophy can be found
in the rejoinder of [15], when trying to determine the test error of the k-step lasso solution
by cross-validation, we would run the lasso on different subsets of the dataset and average
the observed test errors. A different model (with respect to the selected variables, but not
size) is potentially being selected at every cross-validation iteration.
Covariance Test in R
We use the R-package covTest [25]. For the linear case, it needs a lars object [10]
as input.
> fit.lars <- lars(x, y)
> fit.covTest <- covTest(fit.lars, x, y)
312 Handbook of Big Data
The p-values are stored in results. The call for the GLM situation is very similar
> fit.lars.glm <- lars.glm(x, y, family = "binomial")
> fit.covTest.glm <- covTest(fit.lars.glm, x = x, y = y)
17.4 Subsampling, Sample Splitting, and p-Value Aggregation
17.4.1 From Selection Frequencies to Stability Selection
Because sparse estimators have a point mass at zero, it is not straightforward to apply
bootstrap or subsampling techniques to do inference or to construct confidence intervals.
Nevertheless, bootstrap techniques have been widely used in model selection problems to
assess the stability of a selected model. The idea is to focus on those predictors that are still
being selected when the dataset is being reshued. The selection frequency of a predictor
(in a total of B samples) can therefore be used as a heuristic measure of its stability.
A theoretical foundation of such an approach can be found in [20]. Assume that we have
a model selection procedure that selects (on average) q predictors (for example, by using
the q predictors that enter the lasso path first when varying the penalty parameter λ). At
every iteration b, we perform the following steps, b =1,...,B, for example, B = 500.
1. Draw subsample I
b
⊂{1,...,n} (without replacement) of size n/2.
2. Apply the model selection procedure to subsampled data leading to
S
b
=
S(I
b
).
This allows us to calculate an empirical selection frequency
π
j
for every predictor x
(j)
,
j =1,...,p,
π
j
=
1
B
#{b; j
S
b
}
For a frequency threshold 1/2 < π
thres
< 1, we define our final model as
S = {j;
π
j
π
thres
}
This means that we use those predictors in our final model that are being selected in at
least π
thres
× 100% of the subsamples.
Under suitable assumptions, we have control of the expected number of false positives:
E[V ]
1
2π
thres
1
q
2
p
(17.5)
where V = |
S S
c
|, see [20] for details. If we use
q =
αp(2π
thres
1)
we can control the familywise error rate at level α (0, 1), that is,
P (V>0) α
High-Dimensional Regression and Inference 313
In practice, we typically specify a bound on E[V ] (the expected number of false positives
that we are willing to tolerate) and a threshold π
thres
(e.g., π
thres
=0.75). Using Equation
17.5 we can then derive the corresponding value of q to ensure control of E[V ].
Stability Selection in R
We use the function stability in the R-package hdi. If we want to control E[V ] 1,
we use
> fit.stability <- stability(x, y, EV = 1)
A default value of π
thres
=0.75 is being used. The stable predictors can be found in
select and the selection frequency of every predictor in freq. By construction, the
algorithm is highly parallelizable (use the options parallel and ncores).
By default, the model selection criterion uses the first q predictors in the lasso
path of the linear model (implemented in the function lasso.firstq). Extensions to
other models (beyond the linear model) are straightforward by setting the argument
model.selector appropriately. This means that the function can be applied to any
method that provides an appropriate model selection function.
17.4.2 Sample Splitting and p-Value Aggregation
Other approaches are based on sample splitting [21,31]. The idea is to split the dataset into
two (disjoint) parts: the first part is used for model selection where the high-dimensional
problem is reduced to a reasonable size (e.g., using the lasso). The second part is used for
(classical) low-dimensional statistical inference (using the selected model on the first part).
The p-values on the second part are honest as the two parts of the dataset are disjoint. In
more details, the algorithm in [31] works as follows:
1. Partition the sample {1,...,n} = I
1
I
2
with I
1
I
2
= and |I
1
| = n/2 and
|I
2
| = n −n/2.
2. Using only I
1
, select the variables
ˆ
S ⊆{1,...,p}. Assume or enforce that |
ˆ
S|≤
|I
1
| = n/2≤|I
2
|.
3. Using classical least-squares theory, compute p-values P
raw,j
for H
0,j
,forj
ˆ
S
using only I
2
.Forj/
ˆ
S, assign P
raw,j
=1.
4. Adjust p-values for multiple testing using Bonferroni correction on the selected
model
ˆ
S (with |
ˆ
S|p),
P
corr,j
= min(P
j
·|
ˆ
S|, 1)
If the selected model
ˆ
S contains the true model S,thep-values are correct. For any model
selection procedure with the screening property, we therefore (asymptotically) get p-values
P
corr,j
controlling the familywise error rate.
To get reproducible results (that do not depend on a single data split), we can run the
sample splitting algorithm B times, for example, B =50orB = 100, yielding B different
p-values for every predictor,
P
[1]
corr,j
,...,P
[B]
corr,j
,j=1,...,p
314 Handbook of Big Data
Clearly, the different p-values corresponding to the same predictor are not independent.
Nevertheless, we can aggregate them using an (arbitrary) prespecified γ -quantile, 0 < γ < 1,
leading to
Q
j
(γ)=min
emp. γ-quantile{P
[b]
corr,j
/γ; b =1,...,B}, 1
(17.6)
the so-called quantile aggregated p-values, see [21] for details. The price that we have
to pay for using a (potentially small) quantile is the factor 1/γ. For example, if we
choose the median, we have to multiply all p-values by the factor of 2. This is called the
multisample splitting algorithm [21]. It is loosely related to stability selection. For example,
for γ =0.5, we require a predictor to be selected in at least 50% of the sample splits with
a small enough p-value. Moreover, quantile aggregation as defined in Equation 17.6 is a
general (conservative) p-value aggregation procedure that works under arbitrary dependency
structures.
Aprioriit is not clear how to select the parameter γ. We can even search for the
best γ-quantile in a range (γ
min
, 1), for example, γ
min
=0.05, leading to the aggregated
p-value,
P
j
=min
(1 log(γ
min
)) inf
γ(γ
min
,1)
Q
j
(γ), 1
j =1,...,p
The price for this additional search is the factor 1 log(γ
min
). Under suitable assumptions,
the p-values P
j
are controlling the familywise error rate [21]. The smaller we choose γ
min
,
the more susceptible we are again to a specific realization of the B sample splits. Therefore,
we should choose a large value of B in situations where γ is small.
Multisample Splitting in R
We use the function multi.split in the R-package hdi.
> fit.multi <- multi.split(x, y)
We can use any model for which there is a model selection function (defined in argument
model.selector)andaclassical p-value function (argument classical.fit). The
default uses lasso (with cross-validation) and a linear model fit. Extensions to GLMs
and many more models are (from a technical point of view) straightforward.
The p-values are stored in pval.corr. Note that by construction the multisample
splitting algorithm (only) provides p-values for familywise error control.
Confidence intervals can also be obtained through the function confint; however,
the level has already to be set in the call of the function multi.split (argument
ci.level).
17.5 Hierarchical Approaches
In most applications we are faced with (strongly) correlated design matrices. Already in
the low-dimensional case, two strongly correlated predictors might have large (individual)
p-values, while the joint null hypothesis can be clearly rejected. Moreover, too strong
correlation in the design matrix might also violate assumptions of the previously discussed
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.21.175