Targeted Learning for Variable Importance 421
Inference
Statistical Inference:
Standard errors calculated using
stacked influence curve;
inference for vector of target parameters
is based on multivariate normal
distribution (or, alternatively, resampling
based methods).
Interpretation:
The (sequential) randomization
assumption is often violated in variable
importance settings, thus target
parameters are interpreted as statistical,
not causal, parameters since causal
assumptions do not hold.
22.4 Programming
Practical tools for the implementation of targeted learning methods for variable importance
have developed alongside the theoretical and methodological advances. While some work has
been done to develop computational tools for targeted learning in proprietary programming
languages, such as SAS, the majority of the code has been built in R. TMLE and collaborative
TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such
as those discussed throughout this chapter, is available in the supplementary material of
Wang et al. [64]. Each R package discussed in this section is available on The Comprehensive
RArchiveNetwork(www.cran.r-project.org).
Of key importance are the two R packages SuperLearner and tmle [10,24]. The
SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the
integration of dozens of prespecified potential algorithms as well as a system of wrappers
that provide the user with the ability to design their own algorithms, or include newer
algorithms not yet added to the package. The package returns multiple useful objects,
including the cross-validated predicted values, final predicted values, vector of weights, and
fitted objects for each of the included algorithms, among others. The tmle package, authored
by Susan Gruber (Reagan-Udall Foundation, Washington, DC), allows for the estimation
of both average treatment effects and parameters defined by a marginal structural model
in cross-sectional data with a binary intervention. This package also includes the ability to
incorporate missingness in the outcome and the intervention, use SuperLearner to estimate
the relevant components of the likelihood, and use data with a mediating variable.
The multiPIM package [26], authored by Stephan Ritter (Omicia, Inc., Oakland, CA),
is designed specifically for variable importance analysis, and estimates an attributable-
risk-type parameter using TMLE. This package also allows the use of SuperLearner to
estimate nuisance parameters and produces additional estimates using estimating-equation-
based estimators and g-computation. The package includes its own internal bootstrapping
function to calculate standard errors if this is preferred over the use of influence curves, or
influence curves are not valid for the chosen estimator.
Four additional prediction-focused packages are casecontrolSL [17], cvAUC [15],
subsemble [16], and h2oEnsemble [14], all primarily authored by Erin LeDell (Berkeley).
The casecontrolSL package relies on SuperLearner and performs subsampling in a case-
control design with inverse-probability-of-censoring-weighting, which may be particularly
useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area
under the ROC curve estimators when using cross-validation. The subsemble package was
developed based on a new approach [44] to ensembling that fits each algorithm on a subset
of the data and combines these fits using cross-validation. This technique can be used in
datasets of all size, but has been demonstrated to be particularly useful in smaller datasets.
422 Handbook of Big Data
A new implementation of super learner can be found in the Java-based h2oEnsemble
package, which was designed for big data. The package uses the H2O R interface to run
super learning in R with a selection of prespecified algorithms.
Another TMLE package is ltmle [46], primarily authored by Joshua Schwab (Berkeley).
This package mainly focuses on parameters in longitudinal data structures, including the
treatment-specific mean outcome and parameters defined by a marginal structural model.
The package returns estimates for TMLE, g-computation, and estimating-equation-based
estimators.
22.5 Discussion: Variable Importance and Big Data
While the development of targeted learning for variable importance has demonstrated
promise, its potential has yet to be fully realized. The data we are collecting in biology,
social sciences, health care, medicine, business analytics, and ecology, among others,
continue to grow in both dimensions (n and p), and are frequently observational in nature
[31,43]. Statisticians are armed with a unique set of rigorous and practical tools to tackle
these challenges. To face this growth in data going forward, targeted learning provides
a framework for incorporating advances in machine learning and TMLE for problems of
variable importance [33,60].
It is always important to remember that sophisticated statistical methods will never
be able to overcome weak or problematic big data. Misclassification, missingness, and
unmeasured confounding are frequently found in these new streams of data. A thorough
understanding of the data and associated research questions, often only ascertained by
working in interdisciplinary teams, is required before leaping toward analysis. This will not
change as technologies continue to advance.
To call in the statistician after the experiment is done may be no more than
asking him to perform a postmortem examination: he may be able to say what
the experiment died of.
R.A. Fisher
1938
We will also need to address the increasing computational challenges presented by these
data. For example, online targeted learning [57] is a new proposed method for data that
arrives sequentially, another common feature of big data applications. Advances will not only
be found in statistical theory and methodology development, however. Existing approaches
to merge data systems and statistics currently mainly use database systems to serve, for
example, R requests in the background. Movement toward integrated native big data systems
may be a key component in the adoption of rigorous targeted learning tools for variable
importance in massive datasets.
Targeted learning is one of many new statistical innovations that are poised for further
theoretical and methodological development in this new era of big data, inspired by these
real-world challenges. Advances in dimension-reduction for imaging analyses, for example,
will improve our ability to use features of these images as covariates in variable importance
Targeted Learning for Variable Importance 423
analyses, and also move us toward the ability to estimate variable importance measures of a
list of images. The future of statistical and scientific discovery with big data is bright,
as we look forward to the creation of automated big data machines that incorporate
investigator knowledge, are statistically sound, and can handle the computational burden of
our data.
Acknowledgments
The author acknowledges funding from the University of Utah fund P0 163947.
References
1. O. Bembom, M.L. Petersen, S.-Y. Rhee, W.J. Fessel, S.E. Sinisi, R.W. Shafer, and
M.J. van der Laan. Biomarker discovery using targeted maximum likelihood estimation:
Application to the treatment of antiretroviral resistant HIV infection. Stat Med, 28:
152–72, 2009.
2. O. Bembom and M.J. van der Laan. A practical illustration of the importance of realistic
individualized treatment rules in causal inference. Electron J Stat, 1:574–596, 2007.
3. A. Chambaz, D. Choudat, C. Huber, J.C. Pairon, and M.J. van der Laan. Analysis of
the effect of occupational exposure to asbestos based on threshold regression modeling
of case–control data. Biostatistics, 15(2):327–340, 2014.
4. A. Chambaz, P. Neuvial, and M.J. van der Laan. Estimation of a non-parametric
variable importance measure of a continuous exposure. Electron J Stat, 6:1059–1099,
2012.
5. S. Datta and H.C. van Houwelingen. Statistics in biological and medical sciences. Stat
Prob Lett, 81(7):715–716, 2011.
6. I. Diaz, A.E. Hubbard, A. Decker, and M. Cohen. Variable importance and prediction
methods for longitudinal problems with missing variables. Technical Report 318,
Division of Biostatistics, University of California, Berkeley, CA, 2013.
7. S. Dudoit and M.J. van der Laan. Resampling Based Multiple Testing with Applications
to Genomics. Springer, Berlin, Germany, 2008.
8. T.R. Golub, D.K. Slonim, P. Tamayo et al. Molecular classification of cancer: Class
discovery and class prediction by gene expression monitoring. Science, 286:531–537,
1999.
9. S. Gruber and M.J. van der Laan. An application of collaborative targeted maximum
likelihood estimation in causal inference and genomics. Int J Biostat, 6(1):Article 18,
2010.
10. S. Gruber and M.J. van der Laan. tmle: An R package for targeted maximum likelihood
estimation. J Stat Softw, 51(13):1–35, 2012.
424 Handbook of Big Data
11. C.S. Haley and S.A. Knott. A simple regression method for mapping quantitative trait
loci in line crosses using flanking markers. Heredity, 69(4):315–324, 1992.
12. R. Kessler, S. Rose, and K. Koenen et al. How well can post-traumatic stress disorder
be predicted from pre-trauma risk factors? An exploratory study in the WHO world
mental health surveys. World Psychiatry, 13(3):265–274, 2014.
13. L. Kunz, S. Rose, and S.-L. Normand. An overview of statistical approaches for
comparative effectiveness research. In C. Gatsonis and S.C. Morton, editors, Methods
in Comparative Effectiveness Research. Chapman & Hall, Boca Raton, FL, 2015.
14. E. LeDell. h2oEnsemble: H2O Ensemble. R package version 0.0.1, 2014.
15. E. LeDell, M. Petersen, and M.J. van der Laan. cvAUC: Cross-Validated Area Under
the ROC Curve Confidence Intervals. R package version 1.0-0, 2014.
16. E. LeDell, S. Sapp, and M.J. van der Laan. Subsemble: An Ensemble Method for
Combining Subset-Specific Algorithm Fits. R package version 0.0.9, 2014.
17. E. LeDell, M.J. van der Laan, and M. Petersen. casecontrolSL: Case-Control Subsam-
pling for SuperLearner. R package version 0.1-5, 2014.
18. K.L. Moore and M.J. van der Laan. Application of time-to-event methods in the
assessment of safety in clinical trials. In Karl E. Peace, editor, Design, Summarization,
Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman &
Hall, Boca Raton, FL, 2009.
19. K.L. Moore and M.J. van der Laan. Covariate adjustment in randomized trials with
binary outcomes: targeted maximum likelihood estimation. Stat Med, 28(1):39–64, 2009.
20. K.L. Moore and M.J. van der Laan. Increasing power in randomized trials with right
censored outcomes through covariate adjustment. J Biopharm Stat, 19(6):1099–1131,
2009.
21. R. Neugebauer and J. Bullard. DSA: Data-Adaptive Estimation with Cross-Validation
and the D/S/A Algorithm. R package version 3.1.3, 2009.
22. R. Neugebauer, J.A. Schmittdiel, and M.J. Laan. Targeted learning in real-world
comparative effectiveness research with time-varying interventions. Stat Med, 33(14):
2480–2520, 2014.
23. J. Pearl. On a class of bias-amplifying variables that endanger effect estimates. In
Proceedings of the Uncertainty in Articial Intelligence, Catalina Island, CA, 2010.
24. E. Polley and M.J. van der Laan. SuperLearner: Super Learner Prediction. R package
version 2.0-10, 2013.
25. E.C. Polley and M.J. van der Laan. Predicting optimal treatment assignment based
on prognostic factors in cancer patients. In K.E. Peace, editor, Design, Summarization,
Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints, Chapman &
Hall, Boca Raton, FL, 2009.
26. S.J. Ritter, N.P. Jewell, and A.E. Hubbard. R package multiPIM: A causal inference
approach to variable importance analysis. J Stat Softw, 57(8), 2014.
Targeted Learning for Variable Importance 425
27. J.M. Robins. Robust estimation in sequentially ignorable missing data and causal
inference models. In Proceedings of the American Statistical Association, Indianapolis,
IN, 2000.
28. J.M. Robins and A. Rotnitzky. Recovery of information and adjustment for dependent
censoring using surrogate markers. In N.P. Jewell, K. Dietz and V.T. Farewell, editors,
AIDS Epidemiology, pp. 297–331. Birkh¨auser, Basel, Switzerland, 1992.
29. J.M. Robins and A. Rotnitzky. Comment on the Bickel and Kwon article, “Inference
for semiparametric models: Some questions and an answer.” Stat Sinica, 11(4):920–936,
2001.
30. J.M. Robins, A. Rotnitzky, and M.J. van der Laan. Comment on “On profile likelihood.”
J Am Stat Assoc, 450:431–435, 2000.
31. S. Rose. Big data and the future. Significance, 9(4):47–48, 2012.
32. S. Rose. Mortality risk score prediction in an elderly population using machine learning.
Am J Epidemiol, 177(5):443–452, 2013.
33. S. Rose. Statisticians’ place in big data. Amstat News, 428:28, 2013.
34. S. Rose and M.J. van der Laan. Simple optimal weighting of cases and controls in
case-control studies. Int J Biostat, 4(1):Article 19, 2008.
35. S. Rose and M.J. van der Laan. Why match? Investigating matched case-control study
designs with causal effect estimation. Int J Biostat, 5(1):Article 1, 2009.
36. S. Rose and M.J. van der Laan. A targeted maximum likelihood estimator for two-stage
designs. Int J Biostat, 7(1):Article 17, 2011.
37. S. Rose and M.J. van der Laan. A double robust approach to causal effects in case-
control studies. Am J Epidemiol, 179(6):663–669, 2014.
38. S. Rose and M.J. van der Laan. Rose and van der laan respond to “Some advantages
of RERI.” Am J Epidemiol, 179(6):672–673, 2014.
39. M. Rosenblum, S.G. Deeks, M.J. van der Laan, and D.R. Bangsberg. The risk of viro-
logic failure decreases with duration of HIV suppression, at greater than 50% adherence
to antiretroviral therapy. PLoS ONE, 4(9): e7196. doi:10.1371/journal.pone.0007196,
2009.
40. M. Rosenblum and M.J. van der Laan. Using regression models to analyze randomized
trials: Asymptotically valid hypothesis tests despite incorrectly specified models.
Biometrics, 65(3):937–945, 2009.
41. M. Rosenblum and M.J. van der Laan. Targeted maximum likelihood estimation of the
parameter of a marginal structural model. Int J Biostat, 6(2):Article 19, 2010.
42. D.B. Rubin and M.J. van der Laan. Empirical efficiency maximization: Improved locally
efficient covariate adjustment in randomized experiments and survival analysis. Int
JBiostat, 4(1):Article 5, 2008.
43. C. Rudin and D. Dunson, R. Irizarry et al. Discovery with data: Leveraging statistics
and computer science to transform science and society. Technical report, American
Statistical Association, Alexandria, VA, 2014.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.99.71