Targeted Learning for Variable Importance 421
Inference
Statistical Inference:
Standard errors calculated using
stacked influence curve;
inference for vector of target parameters
is based on multivariate normal
distribution (or, alternatively, resampling
based methods).
Interpretation:
The (sequential) randomization
assumption is often violated in variable
importance settings, thus target
parameters are interpreted as statistical,
not causal, parameters since causal
assumptions do not hold.
22.4 Programming
Practical tools for the implementation of targeted learning methods for variable importance
have developed alongside the theoretical and methodological advances. While some work has
been done to develop computational tools for targeted learning in proprietary programming
languages, such as SAS, the majority of the code has been built in R. TMLE and collaborative
TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such
as those discussed throughout this chapter, is available in the supplementary material of
Wang et al. [64]. Each R package discussed in this section is available on The Comprehensive
RArchiveNetwork(www.cran.r-project.org).
Of key importance are the two R packages SuperLearner and tmle [10,24]. The
SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the
integration of dozens of prespecified potential algorithms as well as a system of wrappers
that provide the user with the ability to design their own algorithms, or include newer
algorithms not yet added to the package. The package returns multiple useful objects,
including the cross-validated predicted values, final predicted values, vector of weights, and
fitted objects for each of the included algorithms, among others. The tmle package, authored
by Susan Gruber (Reagan-Udall Foundation, Washington, DC), allows for the estimation
of both average treatment effects and parameters defined by a marginal structural model
in cross-sectional data with a binary intervention. This package also includes the ability to
incorporate missingness in the outcome and the intervention, use SuperLearner to estimate
the relevant components of the likelihood, and use data with a mediating variable.
The multiPIM package [26], authored by Stephan Ritter (Omicia, Inc., Oakland, CA),
is designed specifically for variable importance analysis, and estimates an attributable-
risk-type parameter using TMLE. This package also allows the use of SuperLearner to
estimate nuisance parameters and produces additional estimates using estimating-equation-
based estimators and g-computation. The package includes its own internal bootstrapping
function to calculate standard errors if this is preferred over the use of influence curves, or
influence curves are not valid for the chosen estimator.
Four additional prediction-focused packages are casecontrolSL [17], cvAUC [15],
subsemble [16], and h2oEnsemble [14], all primarily authored by Erin LeDell (Berkeley).
The casecontrolSL package relies on SuperLearner and performs subsampling in a case-
control design with inverse-probability-of-censoring-weighting, which may be particularly
useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area
under the ROC curve estimators when using cross-validation. The subsemble package was
developed based on a new approach [44] to ensembling that fits each algorithm on a subset
of the data and combines these fits using cross-validation. This technique can be used in
datasets of all size, but has been demonstrated to be particularly useful in smaller datasets.