20. Tutorial for Causal Inference (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Tutorial for Causal Inference

Laura Balzer, Maya Petersen, and Mark van der Laan

CONTENTS

20.1 Why Bother with Causal Inference? ........................................... 361

20.2 The Scientiﬁc Question ........................................................ 363

20.3 The Causal Model .............................................................. 363

20.4 The Target Causal Quantity ................................................... 366

20.5 The Observed Data and Their Link to the Causal Model ..................... 368

20.6 Assessment of Identiﬁability ................................................... 369

20.7 Estimation and Inference ...................................................... 373

20.8 Interpretation of the Results ................................................... 376

20.9 Conclusion ...................................................................... 377

Appendix: Extensions to Multiple Time Point Interventions ......................... 378

Acknowledgments ...................................................................... 381

References ............................................................................. 381

20.1 Why Bother with Causal Inference?

This book has mostly been dedicated to large-scale computing and machine learning

algorithms. These tools help us describe the relationships between variables in vast, complex

datasets. This chapter goes one step further by introducing methods, as well as their

limitations, to learn causal relationships from these data. Consider, for example, the

following questions:

1. What proportion of patients taking drug X suﬀered adverse side eﬀects?

2. Which patients taking drug X are more likely to suﬀer adverse side eﬀects?

3. Would the risk of adverse eﬀects be lower if all patients took drug X instead of

drug Y ?

The ﬁrst question is purely descriptive; the second can be characterized as a prediction

problem, whereas the last is causal. Causal inference is distinct from statistical inference in

that it seeks to make conclusions about the world under changed conditions [1]. In the third

example, our goal is to make inferences about how the distribution of patient outcomes

would diﬀer if all patients had taken drug X versus if the same patients, over the same

time frame and under the same conditions, had taken drug Y . Purely statistical analyses

are sometimes endowed with causal interpretations. Furthermore, many of our noncausal

questions have causal elements. For example, Geng et al. [2] sought to assess whether sex was

361

362 Handbook of Big Data

an independent predictor of mortality among patients initiating drug therapy (i.e., describe

a noncausal association) but in the absence of loss to follow up (i.e., a change to the existing

conditions).

In this chapter, we review a formal framework for causal inference to (1) state the

scientiﬁc question; (2) express our causal knowledge and limits of that knowledge; (3) specify

the causal parameter; (4) specify the observed data and their link to the causal model;

(5) assess identiﬁability of our causal parameter as some function of the observed data

distribution; (6) estimate the corresponding statistical parameter, incorporating methods

discussed in this book; and (7) interpret our results [3–5]. Access to millions of data

points does not obviate the need for this framework. Analyses of big data are not

immune to the problems of small data. Instead, one might argue that analyses of big

data exacerbate many of the problems of small data. As illustrated in Figure 20.1, there

are many sources of association between two variables, including direct eﬀects, indirect

eﬀects, measured confounding, unmeasured confounding, and selection bias [6]. Methods to

delineate causation from correlation are perhaps more pressing now than ever [7,8].

(a)

(b)

(c)

(d)

(e)

(f)

FIGURE 20.1

Some of the sources of dependence between an exposure A and an outcome Y :(a)the

exposure A directly aﬀects the outcome Y ; (b) the exposure A directly aﬀects the outcome

Y as well as indirectly aﬀects it through the mediator Z;(c)theexposureA has no eﬀect

on the outcome Y , but an association is induced by a measured common cause W ;(d)the

exposure A has no eﬀect on the outcome Y , but an association is induced by an unmeasured

common cause U;(e)theexposureA has no eﬀect on the outcome Y , but an association

is induced by only examining data among those not censored C; (f) all these sources of

dependence are present. Please note this is not an exhaustive list.

Tutorial for Causal Inference 363

20.2 The Scientiﬁc Question

The ﬁrst step in the causal “roadmap” is to specify the scientiﬁc objective. As a running

example, we will consider the timing of antiretroviral therapy (ART) initiation and its

impact on outcomes among HIV+ individuals. Early ART initiation has been been shown to

improve patient outcomes as well as reduce transmission between discordant couples [9,10].

Suppose we want to learn the eﬀect of immediate ART initiation (i.e., irrespective of CD4+

T-cell count) on mortality. Large consortiums, such as the International Epidemiologic

Databases to Evaluate AIDS and Sustainable East Africa Research in Community Health,

are providing unprecedented quantities of data to answer this and other questions [12,13].

To sharply frame our scientiﬁc aim, we need to further specify the system, including the

target population (e.g., patients and context), the exposure (e.g., criteria and timing), and

the outcome. As a second try, consider our goal as learning the impact of initiating ART

within 1 month of diagnosis on 5-year all-cause mortality among adults, recently diagnosed

with HIV in Sub-Saharan Africa. This might seem like an insurmountable task, and it may

seem safer to frame our question in terms of an association. Indeed, there seems to be a

tendency to shy away from causal language when stating the scientiﬁc objective. However,

we are not fundamentally interested in the correlation between early ART initiation and

mortality among HIV+ adults. Instead, we want to isolate the eﬀect of interest from the

spurious sources of dependence (e.g., confounding, selection bias, informative censoring) as

shown in Figure 20.1. The framework, discussed in this chapter, provides a pathway from

our scientiﬁc aim to estimation of a statistical parameter that best approximates our causal

eﬀect, while keeping any assumptions transparent.

20.3 The Causal Model

The second step of the roadmap is to specify our causal model. Causal inference is distinct

from statistics in that it requires something more than a sample from the observed data

distribution. In particular, causal inference requires speciﬁcation of background knowledge,

and causal models provide a rigorous language for expressing this knowledge and its limits.

In this chapter, we focus on structural causal models [14] to formally represent which

variables potentially aﬀect one another, the roles of unmeasured factors, and the functional

form of those relationships. Structural causal models unify causal graphs [15], structural

equations [16,17], and counterfactuals. We also brieﬂy introduce the Neyman–Rubin

potential outcomes framework [18–20] and discuss its relation to the structural causal model.

Consider again our running example. Let W denote the set of baseline covariates,

including sociodemographics, clinical measurements, and social constructs. The exposure

A is an indicator, equalling 1 if the patient initiated ART within 1 month of diagnosis and

equalling 0 otherwise (i.e., initiation took longer than 1 month). Finally, the outcome Y is an

indicator that the patient did not survive 5 years of follow-up. These factors have scientiﬁc

meaning to the question and comprise the set of endogenous variables: X = {W, A, Y }.They

can be measurable (e.g., age and sex) or unmeasurable and are aﬀected by other variables

in the model.

Each endogenous variable is associated with a set of background factors U =

) with some joint distribution P

. These represent all the unmeasured factors,

aﬀecting other variables in the model but not included in X. For example, U

could include

unknown clinic-level factors, inﬂuencing whether or not a patient initiates early ART.

364 Handbook of Big Data

Likewise, U

may include a patient’s genetic risk proﬁle. Furthermore, there might be

shared unmeasured causes between the endogenous variables. For example, socioeconomic

status may impact whether a patient initiates early ART as well as his/her 5-year

mortality.

Each endogenous variable is also associated with a structural equation. These functions

help encode our causal knowledge. Suppose, for example, we believe that the set of baseline

covariates possibly impact whether a patient initiates early ART, and that both the

covariates and the exposure may aﬀect subsequent morality. Then we write each endogenous

variable as a deterministic function of its “parents,” variables that may impact its value:

W = f

)

A = f

(W, U

)

Y = f

(W, A, U

). (20.1)

These functions F = {f

} are left unspeciﬁed (nonparametric). For example, the

third equation f

encodes that the covariates W and the exposure A may have inﬂuenced

the value taken by the outcome Y . We have not, however, restricted their relationships: A

and any member of W may interact on an additive (or any other) scale to aﬀect Y and the

impacts of A and W on Y may be nonlinear.

The structural causal model, denoted M

, is deﬁned by all possible distributions of P

and all possible sets of functions F , which are compatible with our assumptions (if any). For

the above example, there is some true joint distribution P

U,0

of health care access, personal

preferences for ART use, socioeconomic factors, etc. Randomly sampling a patient from the

population corresponds to drawing a particular realization u from P

U,0

. Likewise, there are

some true structural equations F

that would deterministically generate the endogenous

variables X = x if given input U = u. For a given distribution P

and set of functions F ,

the structural causal model M

describes the following data generating process for (U, X):

1. Drawing the background factors U from some joint probability distribution P

2. Generating the baseline covariates W as some deterministic function f

of U

3. Generating the exposure A as some deterministic function f

of covariates W

and U

4. Generating the outcome Y as some deterministic function f

of covariates W ,

the exposure A,andU

Thus, the model M

is the collection of all possible probability distributions P

U,X

for the

exogenous and endogenous variables (U, X). The true joint distribution is an element of the

causal model: P

U,X,0

∈M

. The structural causal model is also sometimes also called a

nonparametric structural equation model [14,21].

In other settings, we may have more in-depth knowledge about the data generating

process. This knowledge is generally encoded in two ways. First, excluding a variable from

the parent set of X

encodes that this variable does not directly impact the value X

takes.

These assumptions are known as exclusion restrictions. Second, restricting the set of allowed

distributions for P

encodes that some variables do not have any unmeasured common

causes. These assumptions are known as independence assumptions. Suppose, for example,

that patients were randomized R to early ART initiation, but adherence A was imperfect.

Then the treatment assignment R would only be determined by chance (e.g., a coin ﬂip) and

not inﬂuenced by the baseline covariates W . The unmeasured factors determining treatment

assignment would be independent from all other unmeasured factors:

Tutorial for Causal Inference 365

This is an independence assumption that restricts the allowed distribution of background

factors P

. Furthermore, suppose that randomization R only aﬀects the mortality Y through

its eﬀect on adherence A. The resulting structural equations are then

W = f

)

R = f

)

A = f

(W, R, U

)

Y = f

(W, A, U

). (20.2)

We have made two exclusion restrictions: (1) the baseline covariates W do not inﬂuence

randomization R and (2) randomization R has no direct eﬀect on the outcome Y .The

structural causal model is then deﬁned by all probability distributions for U that are com-

patible with our independence assumptions and all sets of functions F =(f

)

that are compatible with our exclusion restrictions.

A causal graph can be drawn from the structural causal model [14]. Each endogenous

variable (node) is connected to its parents and background error term with a directed arrow.

The potential dependence between the background factors is encoded by the inclusion of

a node representing any unmeasured common cause. Exclusion restrictions are encoded

by absence of a directed arrow. Likewise, independence assumptions are encoded with the

absence of a node representing an unmeasured common cause. The corresponding causal

graphs for the two examples are given in Figure 20.2.

(a) (b)

FIGURE 20.2

Directed acyclic graphs representing the structural causal model for our study (Equa-

tion 20.1) and for the hypothetical randomized trial (Equation 20.2). (a) This graph

only encodes the time ordering between baseline covariates W ,theexposureA,andthe

outcome Y . A single node U represents the unmeasured common causes of the endogenous

variables. (b) This graph encodes the randomization R of some treatment with incomplete

adherence A. There are two exclusion restrictions: The baseline covariates W do not impact

the randomization R, and the randomization R has no direct eﬀect on the outcome Y .There

is also an independence assumption: the unmeasured factors contributing to randomization

are independent of the unmeasured factors, contributing to the other variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 20. Tutorial for Causal Inference (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
20. Tutorial for Causal Inference (1/6)