22. Targeted Learning for Variable Importance (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Targeted Learning for Variable Importance 421

Inference

Statistical Inference:

Standard errors calculated using

stacked inﬂuence curve;

inference for vector of target parameters

is based on multivariate normal

distribution (or, alternatively, resampling

based methods).

Interpretation:

The (sequential) randomization

assumption is often violated in variable

importance settings, thus target

parameters are interpreted as statistical,

not causal, parameters since causal

assumptions do not hold.

22.4 Programming

Practical tools for the implementation of targeted learning methods for variable importance

have developed alongside the theoretical and methodological advances. While some work has

been done to develop computational tools for targeted learning in proprietary programming

languages, such as SAS, the majority of the code has been built in R. TMLE and collaborative

TMLE R code speciﬁcally tailored to answer quantitative trait loci mapping questions, such

as those discussed throughout this chapter, is available in the supplementary material of

Wang et al. [64]. Each R package discussed in this section is available on The Comprehensive

RArchiveNetwork(www.cran.r-project.org).

Of key importance are the two R packages SuperLearner and tmle [10,24]. The

SuperLearner package, authored by Eric Polley (NCI), is ﬂexible, allowing for the

integration of dozens of prespeciﬁed potential algorithms as well as a system of wrappers

that provide the user with the ability to design their own algorithms, or include newer

algorithms not yet added to the package. The package returns multiple useful objects,

including the cross-validated predicted values, ﬁnal predicted values, vector of weights, and

ﬁtted objects for each of the included algorithms, among others. The tmle package, authored

by Susan Gruber (Reagan-Udall Foundation, Washington, DC), allows for the estimation

of both average treatment eﬀects and parameters deﬁned by a marginal structural model

in cross-sectional data with a binary intervention. This package also includes the ability to

incorporate missingness in the outcome and the intervention, use SuperLearner to estimate

the relevant components of the likelihood, and use data with a mediating variable.

The multiPIM package [26], authored by Stephan Ritter (Omicia, Inc., Oakland, CA),

is designed speciﬁcally for variable importance analysis, and estimates an attributable-

risk-type parameter using TMLE. This package also allows the use of SuperLearner to

estimate nuisance parameters and produces additional estimates using estimating-equation-

based estimators and g-computation. The package includes its own internal bootstrapping

function to calculate standard errors if this is preferred over the use of inﬂuence curves, or

inﬂuence curves are not valid for the chosen estimator.

Four additional prediction-focused packages are casecontrolSL [17], cvAUC [15],

subsemble [16], and h2oEnsemble [14], all primarily authored by Erin LeDell (Berkeley).

The casecontrolSL package relies on SuperLearner and performs subsampling in a case-

control design with inverse-probability-of-censoring-weighting, which may be particularly

useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area

under the ROC curve estimators when using cross-validation. The subsemble package was

developed based on a new approach [44] to ensembling that ﬁts each algorithm on a subset

of the data and combines these ﬁts using cross-validation. This technique can be used in

datasets of all size, but has been demonstrated to be particularly useful in smaller datasets.

422 Handbook of Big Data

A new implementation of super learner can be found in the Java-based h2oEnsemble

package, which was designed for big data. The package uses the H2O R interface to run

super learning in R with a selection of prespeciﬁed algorithms.

Another TMLE package is ltmle [46], primarily authored by Joshua Schwab (Berkeley).

This package mainly focuses on parameters in longitudinal data structures, including the

treatment-speciﬁc mean outcome and parameters deﬁned by a marginal structural model.

The package returns estimates for TMLE, g-computation, and estimating-equation-based

estimators.

22.5 Discussion: Variable Importance and Big Data

While the development of targeted learning for variable importance has demonstrated

promise, its potential has yet to be fully realized. The data we are collecting in biology,

social sciences, health care, medicine, business analytics, and ecology, among others,

continue to grow in both dimensions (n and p), and are frequently observational in nature

[31,43]. Statisticians are armed with a unique set of rigorous and practical tools to tackle

these challenges. To face this growth in data going forward, targeted learning provides

a framework for incorporating advances in machine learning and TMLE for problems of

variable importance [33,60].

It is always important to remember that sophisticated statistical methods will never

be able to overcome weak or problematic big data. Misclassiﬁcation, missingness, and

unmeasured confounding are frequently found in these new streams of data. A thorough

understanding of the data and associated research questions, often only ascertained by

working in interdisciplinary teams, is required before leaping toward analysis. This will not

change as technologies continue to advance.

To call in the statistician after the experiment is done may be no more than

asking him to perform a postmortem examination: he may be able to say what

the experiment died of.

R.A. Fisher

1938

We will also need to address the increasing computational challenges presented by these

data. For example, online targeted learning [57] is a new proposed method for data that

arrives sequentially, another common feature of big data applications. Advances will not only

be found in statistical theory and methodology development, however. Existing approaches

to merge data systems and statistics currently mainly use database systems to serve, for

example, R requests in the background. Movement toward integrated native big data systems

may be a key component in the adoption of rigorous targeted learning tools for variable

importance in massive datasets.

Targeted learning is one of many new statistical innovations that are poised for further

theoretical and methodological development in this new era of big data, inspired by these

real-world challenges. Advances in dimension-reduction for imaging analyses, for example,

will improve our ability to use features of these images as covariates in variable importance

Targeted Learning for Variable Importance 423

analyses, and also move us toward the ability to estimate variable importance measures of a

list of images. The future of statistical and scientiﬁc discovery with big data is bright,

as we look forward to the creation of automated big data machines that incorporate

investigator knowledge, are statistically sound, and can handle the computational burden of

our data.

Acknowledgments

The author acknowledges funding from the University of Utah fund P0 163947.

References

1. O. Bembom, M.L. Petersen, S.-Y. Rhee, W.J. Fessel, S.E. Sinisi, R.W. Shafer, and

M.J. van der Laan. Biomarker discovery using targeted maximum likelihood estimation:

Application to the treatment of antiretroviral resistant HIV infection. Stat Med, 28:

152–72, 2009.

2. O. Bembom and M.J. van der Laan. A practical illustration of the importance of realistic

individualized treatment rules in causal inference. Electron J Stat, 1:574–596, 2007.

3. A. Chambaz, D. Choudat, C. Huber, J.C. Pairon, and M.J. van der Laan. Analysis of

the eﬀect of occupational exposure to asbestos based on threshold regression modeling

of case–control data. Biostatistics, 15(2):327–340, 2014.

4. A. Chambaz, P. Neuvial, and M.J. van der Laan. Estimation of a non-parametric

variable importance measure of a continuous exposure. Electron J Stat, 6:1059–1099,

2012.

5. S. Datta and H.C. van Houwelingen. Statistics in biological and medical sciences. Stat

Prob Lett, 81(7):715–716, 2011.

6. I. Diaz, A.E. Hubbard, A. Decker, and M. Cohen. Variable importance and prediction

methods for longitudinal problems with missing variables. Technical Report 318,

Division of Biostatistics, University of California, Berkeley, CA, 2013.

7. S. Dudoit and M.J. van der Laan. Resampling Based Multiple Testing with Applications

to Genomics. Springer, Berlin, Germany, 2008.

8. T.R. Golub, D.K. Slonim, P. Tamayo et al. Molecular classiﬁcation of cancer: Class

discovery and class prediction by gene expression monitoring. Science, 286:531–537,

1999.

9. S. Gruber and M.J. van der Laan. An application of collaborative targeted maximum

likelihood estimation in causal inference and genomics. Int J Biostat, 6(1):Article 18,

2010.

10. S. Gruber and M.J. van der Laan. tmle: An R package for targeted maximum likelihood

estimation. J Stat Softw, 51(13):1–35, 2012.

424 Handbook of Big Data

11. C.S. Haley and S.A. Knott. A simple regression method for mapping quantitative trait

loci in line crosses using ﬂanking markers. Heredity, 69(4):315–324, 1992.

12. R. Kessler, S. Rose, and K. Koenen et al. How well can post-traumatic stress disorder

be predicted from pre-trauma risk factors? An exploratory study in the WHO world

mental health surveys. World Psychiatry, 13(3):265–274, 2014.

13. L. Kunz, S. Rose, and S.-L. Normand. An overview of statistical approaches for

comparative eﬀectiveness research. In C. Gatsonis and S.C. Morton, editors, Methods

in Comparative Eﬀectiveness Research. Chapman & Hall, Boca Raton, FL, 2015.

14. E. LeDell. h2oEnsemble: H2O Ensemble. R package version 0.0.1, 2014.

15. E. LeDell, M. Petersen, and M.J. van der Laan. cvAUC: Cross-Validated Area Under

the ROC Curve Conﬁdence Intervals. R package version 1.0-0, 2014.

16. E. LeDell, S. Sapp, and M.J. van der Laan. Subsemble: An Ensemble Method for

Combining Subset-Speciﬁc Algorithm Fits. R package version 0.0.9, 2014.

17. E. LeDell, M.J. van der Laan, and M. Petersen. casecontrolSL: Case-Control Subsam-

pling for SuperLearner. R package version 0.1-5, 2014.

18. K.L. Moore and M.J. van der Laan. Application of time-to-event methods in the

assessment of safety in clinical trials. In Karl E. Peace, editor, Design, Summarization,

Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints. Chapman &

Hall, Boca Raton, FL, 2009.

19. K.L. Moore and M.J. van der Laan. Covariate adjustment in randomized trials with

binary outcomes: targeted maximum likelihood estimation. Stat Med, 28(1):39–64, 2009.

20. K.L. Moore and M.J. van der Laan. Increasing power in randomized trials with right

censored outcomes through covariate adjustment. J Biopharm Stat, 19(6):1099–1131,

2009.

21. R. Neugebauer and J. Bullard. DSA: Data-Adaptive Estimation with Cross-Validation

and the D/S/A Algorithm. R package version 3.1.3, 2009.

22. R. Neugebauer, J.A. Schmittdiel, and M.J. Laan. Targeted learning in real-world

comparative eﬀectiveness research with time-varying interventions. Stat Med, 33(14):

2480–2520, 2014.

23. J. Pearl. On a class of bias-amplifying variables that endanger eﬀect estimates. In

Proceedings of the Uncertainty in Artiﬁcial Intelligence, Catalina Island, CA, 2010.

24. E. Polley and M.J. van der Laan. SuperLearner: Super Learner Prediction. R package

version 2.0-10, 2013.

25. E.C. Polley and M.J. van der Laan. Predicting optimal treatment assignment based

on prognostic factors in cancer patients. In K.E. Peace, editor, Design, Summarization,

Analysis & Interpretation of Clinical Trials with Time-to-Event Endpoints, Chapman &

Hall, Boca Raton, FL, 2009.

26. S.J. Ritter, N.P. Jewell, and A.E. Hubbard. R package multiPIM: A causal inference

approach to variable importance analysis. J Stat Softw, 57(8), 2014.

Targeted Learning for Variable Importance 425

27. J.M. Robins. Robust estimation in sequentially ignorable missing data and causal

inference models. In Proceedings of the American Statistical Association, Indianapolis,

IN, 2000.

28. J.M. Robins and A. Rotnitzky. Recovery of information and adjustment for dependent

censoring using surrogate markers. In N.P. Jewell, K. Dietz and V.T. Farewell, editors,

AIDS Epidemiology, pp. 297–331. Birkh¨auser, Basel, Switzerland, 1992.

29. J.M. Robins and A. Rotnitzky. Comment on the Bickel and Kwon article, “Inference

for semiparametric models: Some questions and an answer.” Stat Sinica, 11(4):920–936,

2001.

30. J.M. Robins, A. Rotnitzky, and M.J. van der Laan. Comment on “On proﬁle likelihood.”

J Am Stat Assoc, 450:431–435, 2000.

31. S. Rose. Big data and the future. Signiﬁcance, 9(4):47–48, 2012.

32. S. Rose. Mortality risk score prediction in an elderly population using machine learning.

Am J Epidemiol, 177(5):443–452, 2013.

33. S. Rose. Statisticians’ place in big data. Amstat News, 428:28, 2013.

34. S. Rose and M.J. van der Laan. Simple optimal weighting of cases and controls in

case-control studies. Int J Biostat, 4(1):Article 19, 2008.

35. S. Rose and M.J. van der Laan. Why match? Investigating matched case-control study

designs with causal eﬀect estimation. Int J Biostat, 5(1):Article 1, 2009.

36. S. Rose and M.J. van der Laan. A targeted maximum likelihood estimator for two-stage

designs. Int J Biostat, 7(1):Article 17, 2011.

37. S. Rose and M.J. van der Laan. A double robust approach to causal eﬀects in case-

control studies. Am J Epidemiol, 179(6):663–669, 2014.

38. S. Rose and M.J. van der Laan. Rose and van der laan respond to “Some advantages

of RERI.” Am J Epidemiol, 179(6):672–673, 2014.

39. M. Rosenblum, S.G. Deeks, M.J. van der Laan, and D.R. Bangsberg. The risk of viro-

logic failure decreases with duration of HIV suppression, at greater than 50% adherence

to antiretroviral therapy. PLoS ONE, 4(9): e7196. doi:10.1371/journal.pone.0007196,

2009.

40. M. Rosenblum and M.J. van der Laan. Using regression models to analyze randomized

trials: Asymptotically valid hypothesis tests despite incorrectly speciﬁed models.

Biometrics, 65(3):937–945, 2009.

41. M. Rosenblum and M.J. van der Laan. Targeted maximum likelihood estimation of the

parameter of a marginal structural model. Int J Biostat, 6(2):Article 19, 2010.

42. D.B. Rubin and M.J. van der Laan. Empirical eﬃciency maximization: Improved locally

eﬃcient covariate adjustment in randomized experiments and survival analysis. Int

JBiostat, 4(1):Article 5, 2008.

43. C. Rudin and D. Dunson, R. Irizarry et al. Discovery with data: Leveraging statistics

and computer science to transform science and society. Technical report, American

Statistical Association, Alexandria, VA, 2014.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 22. Targeted Learning for Variable Importance (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
22. Targeted Learning for Variable Importance (3/4)