20. Tutorial for Causal Inference (3/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Tutorial for Causal Inference 371

associations (i.e., conditioning on a collider). The rationale for condition 2 is to block

any remaining spurious sources of association. For the basic structure (Figure 20.3),

the randomization assumption will hold if the following independence assumptions are

true:

and U

or U

There must not be any unmeasured common causes of the exposure and the outcome, and of

the exposure and covariates or of the outcome and covariates. As illustrated in Figure 20.4,

this graphical criteria can aid in the selection of an appropriate adjustment set.

When the randomization assumption holds, we can identify the distribution of counter-

factuals within strata of covariates. Speciﬁcally, we have that for each P

U,X

∈M

U,X

= y|W = w )=P

U,X

= y|A = a, W = w)

= P (Y = y|A = a, W = w),

where the distribution P of the observed data is implied by P

U,X

. This gives us the G-

computation identiﬁability result [27] for the true distributions P

U,X,0

and P

U,X,0



(Y |A = a, W = w)P

(W = w),

where the summation generalizes to an integral for continuous covariates. Likewise, we can

identify the diﬀerence in the expected counterfactual outcomes (i.e., the average treatment

eﬀect) in terms of the diﬀerence in the conditional mean outcomes, averaged with respect

to the covariate distribution:

U,X,0

− Y

)



 

U,X,0

)





(Y |A =1,W = w) − E

(Y |A =0,W = w)



(W = w)



 

Ψ(P

)

Identiﬁability also relies on having suﬃcient support in the data. The G-computation

formula requires that the conditional mean E

(Y |A = a, W = w) is well deﬁned for all

possible values of w and levels of a of interest. In a nonparametric statistical model, each

exposure of interest must occur with some positive probability for each possible covariate

stratum:

min

a∈A

(A = a|W = w) > 0, for all w for which P

(W = w) > 0.

This condition is known as the positivity assumption and as the experimental treatment

assignment assumption.

Suppose, for example, that the randomization assumption holds conditionally on a single

binary baseline covariate. Then our statistical estimand could be rewritten as

Ψ(P



(Y |A =1,W =1)− E

(Y |A =0,W =1)



(W =1)



(Y |A =1,W =0)− E

(Y |A =0,W =0)



(W =0).

As an extreme, suppose that in the population, there are zero exposed patients with this

covariate: P

(A =1|W = 1) = 0. Then there would be no information about outcomes under

the exposure for this subpopulation. To identify the treatment eﬀect, we could consider a

diﬀerent target parameter (e.g., the eﬀect among those with W = 0) or consider additional

modeling assumptions (e.g., the eﬀect is the same among those with W =1andW =0).

372 Handbook of Big Data

Both options are a bit dissatisfying and other approaches may be taken [37]. The risk of

violating the positivity assumption is exacerbated with higher dimensional data (i.e., as the

number of covariates or their levels grow).

In many cases, our initial assumptions, encoded in the structural causal model M

are not suﬃcient to identify the causal eﬀect Ψ

U,X

). Indeed, for our running example

(Figure 20.2a), the set of baseline covariates is not suﬃcient to block the back-door

paths from the outcome to the exposure. The question then becomes how to proceed?

Possible options include giving up, gathering more data, or continuing to estimation while

clearly acknowledging the lack of identiﬁability during the interpretation step. To facilitate

the third option, we can use M

F∗

to denote the structural causal model, augmented

with additional convenience-based assumptions needed for identiﬁability. This gives us a

way to proceed, while separating our real knowledge M

from our wished identiﬁability

assumptions M

F∗

Overall, identiﬁability assumptions and resulting estimands are speciﬁc to the causal

parameter Ψ

U,X

). We are focusing on a point treatment eﬀect (i.e., distribution of

counterfactuals under interventions on a single node or variable). Diﬀerent identiﬁability

results are needed for interventions on more than one node (e.g., longitudinal treatment

eﬀects and direct eﬀects) and interventions responding to patient characteristics (e.g.,

dynamic regimes). Furthermore, a given causal parameter may have more than one

identiﬁability result (e.g., instrumental variables and the front-door criterion). See, for

example, Pearl [14].

Common Pitfall: Stating vs. Evaluating the Identiﬁability Assumptions

There is a temptation to simply state the identiﬁability assumptions and proceed

to the analysis. The identiﬁability assumptions require careful consideration. Directed

(a)

UU*

(b)

(c)

FIGURE 20.4

Considering the back-door criterion. (a) The set of covariates W 2 is suﬃcient to block the

back-door path from Y → W 2 → A. Therefore, the randomization assumption will hold

conditionally on W 2. Further adjustment for W 1 is unnecessary and potentially harmful. (b)

The randomization assumption holds conditionally on ∅. Adjusting for W (i.e., conditioning

on a collider of U and U

∗

) opens a back-door path and induces a spurious association

between A and Y . (c) The randomization assumption holds conditionally on (W, L). The

covariates L are needed to block the back-door path from Y → L → U → A, even though

L occurs temporally after the exposure A.

Tutorial for Causal Inference 373

acyclic graphs facilitate the evaluation of assumptions by subject-matter experts

without extensive statistical training. When interpreting the analysis, any convenience-

based causal assumptions should be transparently stated and explained.

20.7 Estimation and Inference

In the previous step, we deﬁned the parameter of interest as a mapping from the statistical

model to the parameter space: Ψ : M→R. In other words, the statistical parameter is

a function, whose input is any distribution P compatible with the statistical model and

whose output is a real number. The parameter mapping applied to the true observed data

distribution P

is called the estimand and denoted Ψ(P

). Recall we have n independent,

identically distributed (i.i.d.) copies of the random variable O =(W, A, Y ). The empirical

distribution P

corresponds to putting a weight 1/n on each copy of O

.Anestimator is a

function, whose input is the observed data (a realization of P

) and output a value in the

parameter space.

In this chapter, we consider substitution estimators based on the G-computation

identiﬁability result [27]:

Ψ(P

)=E



(Y |A =1,W) − E

(Y |A =0,W)



. (20.3)

A simple substitution estimator for Ψ(P

) can be implemented as follows:

1. Estimate the conditional expectation of the outcome, given the exposure and

covariates, denoted

E(Y |A, W ).

2. Use this estimate to generate the predicted outcomes for each unit, setting A =1

and A =0.

3. Take the sample average of the diﬀerence in these predicted outcomes:

Ψ(P



i=1

E(Y

=1,W

) −

E(Y

=0,W

The last step corresponds to estimating the marginal covariate distribution P

(W )withthe

sample proportion:



I(W

= w).

There are many options available for estimating the conditional expectation E

(Y |A, W ).

Often, parametric models are used to relate the conditional mean outcome to the possible

predictor variables and the exposure. Suppose, for example, we knew that the conditional

expectation of a continuous outcome could be described by the following parametric

model:

(Y |A, W )=β

+ β

A + β

+ β

∗

+ β

∗

where W = {W

} denotes the set of covariates, needed for identiﬁability. Then

this knowledge should have been encoded in our structural causal model M

with

implied restrictions on our statistical model M. (In other words, we avoid introducing

new assumptions during the analysis.) The coeﬃcients in this regression model could be

estimated with maximum likelihood or with ordinary least squares regression. The estimate

does not, however, provide an estimate of the G-computation identiﬁability result. The

374 Handbook of Big Data

exact interpretation of

depends on which variables and which interactions are included

in the parametric model. To obtain an estimate of Ψ(P

), we need to average the predicted

outcomes with respect to the distribution of covariates:

Ψ(P



i=1

E(Y

=1,W

) −

E(Y

=0,W

)



i=1



1,i

2,i



As a second example, suppose we knew that the conditional risk of a binary outcome could

be described by the following parametric model:

logit



(Y|A, W)



= β

+ β

A+β

+ ···+ β

where W = {W

...,W

} denotes the set of covariates, needed for identiﬁability. Then

the estimate

would provide an estimate of the logarithm of the conditional odds ratio.

An estimate of the G-computation identiﬁability result is given by averaging the expected

outcomes under the exposure A = 1 and control A =0:

Ψ(P



i=1



1+exp

−(

1,i

+···+

10,i

)

−

1+exp

−(

1,i

+···+

10,i

)



In most cases, our background knowledge is inadequate to describe the conditional

expectation E

(Y |A, W )withsuchparametricmodels.Indeed, with high dimensional data,

the sheer number of potential covariates will likely make it impossible to correctly specify the

functional form. If the assumed parametric model is incorrect, the point estimates will often

be biased and inference misleading. In other words, the structural causal model M

,repre-

senting our knowledge of the underlying data generating process, often implies a nonpara-

metric statistical model M. Our estimation approach should respect the statistical model.

To avoid unsubstantiated assumptions about functional form, it is sometimes possible

to estimate E

(Y |A, W ) with the empirical mean in each exposure–covariate stratum.

Unfortunately, even when all covariates are discrete valued, nonparametric maximum

likelihood estimators quickly become ill-deﬁned due to the curse of dimensionality; the

number of possible exposure–covariate combinations far exceeds the number of observations.

Again, this problem becomes exacerbated with big data, where, for example, there are

hundreds of potential covariates under consideration.

Various model selection routines can help alleviate these problems. For example, stepwise

regression will add and subtract variables in hopes of minimizing the Akaike information

criterion or the Bayesian information criterion. Other data-adaptive methods, based on

cross-validation, involve splitting the data into training and validation sets. Each possible

algorithm (e.g., various parametric models or semiparametric methods) is then ﬁt on the

training set and its performance assessed on the validation set. The measure of performance

can be deﬁned by a loss function, such as the L2-squared error or the negative log likelihood.

Super learner, for example, uses cross-validation to select the candidate algorithm with the

best performance or to build the optimal (convex) combination of estimates from candidate

algorithms [38,39]. (For further details, see Chapter 19.) A point estimate could then be

obtained by averaging the diﬀerence in predicted outcomes for each unit under the exposure

and under the control.

Although these data-adaptive methods avoid betting on one apriorispeciﬁed parametric

regression model and are amenable to semiparametric algorithms, there is no reliable

Tutorial for Causal Inference 375

way to obtain statistical inference for parameters, such as the G-computation estimand

Ψ(P

). Treating the ﬁnal algorithm as if it were prespeciﬁed ignores the selection process.

Furthermore, the selected algorithm was tailored to maximize/minimize some criterion

with regard to the conditional expectation E

(Y |A, W ) and will, in general, not provide

the best bias–variance trade-oﬀ for estimating the statistical parameter Ψ(P

). Indeed,

estimating the conditional mean outcome Y in every stratum of (A, W )isamuch

more ambitious task than estimating one number (the diﬀerence in conditional means,

averaged with respect to the covariate distribution). Thus, without an additional step, the

resulting estimator will be overly biased relative to its standard error, preventing accurate

inference.

Targeted maximum likelihood estimation (TMLE) provides a way forward [3,40]. TMLE

is a general algorithm for the construction of double robust, semiparametric, eﬃcient

substitution estimators. TMLE allows for data-adaptive estimation while obtaining valid

statistical inference. The algorithm is detailed in Chapter 22. Although TMLE is a

general algorithm for a wide range of parameters, we focus on its implementation for the

G-computation estimand. Brieﬂy, the TMLE algorithm uses information in the estimated

exposure mechanism

P (A|W ) to update the initial estimator of the conditional mean

(Y |A, W ). The targeted estimates are then substituted into the parameter mapping.

The updating step achieves a targeted bias reduction for the parameter of interest Ψ(P

)

and serves to solve the eﬃcient score equation. As a result, TMLE is a double robust

estimator; it will be consistent for Ψ(P

) is either the conditional expectation E

(Y |A, W )

or the exposure mechanism P

(A|W ) is estimated consistently. When both functions are

consistently estimated at a fast enough rate, the TMLE will be eﬃcient in that it achieves the

lowest asymptotic variance among a large class of estimators. These asymptotic properties

typically translate into lower bias and variance in ﬁnite samples. The advantages of TMLE

have been repeatedly demonstrated in both simulation studies and applied analyses [37,41–

43]. The procedure is available with standard software such as the tmle and ltmle packages

in R [44–46].

Thus far, we have discussed obtaining a point estimate from a simple or targeted substi-

tution estimator. To create conﬁdence intervals and test hypotheses, we also need to quantify

uncertainty. A simple substitution estimator based on a correctly speciﬁed parametric model

is asymptotically linear, and its variance can be approximated by the variance of its inﬂuence

curve, divided by sample size n. It is worth emphasizing that our estimand Ψ(P

) often does

not correspond to a single coeﬃcient, and therefore we usually cannot read oﬀ the reported

standard error from common software. Under reasonable conditions, the TMLE is also

asymptotically linear and inference can be based on an estimate of its inﬂuence curve.

Overall, this chapter focused on substitution estimators (simple and targeted) of the

G-computation identiﬁability result [27]. The simple substitution estimator only requires

an estimate of the marginal distribution of baseline covariates P

(W ) and the conditional

expectation of the outcome, given the exposure and covariates E

(Y |A, W ). TMLE also

requires an estimate of the exposure mechanism P

(A|W ). There are many other algorithms

available for estimation of Ψ(P

). A popular class of estimators relies only on estimation

of the exposure mechanism [47–49]. Inverse probability of treatment weighting (IPTW)

estimators, for example, control for measured confounders by up-weighting exposure–

covariate groups that are underrepresented and down-weighting exposure–covariate groups

that are overrepresented (relative to what would be seen were the exposure randomized).

Its double robust counterpart, augmented-IPTW, shares many of the same properties as

TMLE [50,51]. A key distinction is that IPTW and augmented-IPTW are solutions to

estimating equations and therefore respond diﬀerently in the face of challenges due to strong

confounding and rare outcomes [37,52]. Throughout, we maintain that estimators should

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 20. Tutorial for Causal Inference (3/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
20. Tutorial for Causal Inference (3/6)