Abstract
This paper provides a critical review of the Bayesian perspective of causal inference based on the potential outcomes framework. We review the causal estimands, assignment mechanism, the general structure of Bayesian inference of causal effects and sensitivity analysis. We highlight issues that are unique to Bayesian causal inference, including the role of the propensity score, the definition of identifiability, the choice of priors in both low- and high-dimensional regimes. We point out the central role of covariate overlap and more generally the design stage in Bayesian causal inference. We extend the discussion to two complex assignment mechanisms: instrumental variable and time-varying treatments. We identify the strengths and weaknesses of the Bayesian approach to causal inference. Throughout, we illustrate the key concepts via examples.
This article is part of the theme issue ‘Bayesian inference: challenges, perspectives, and prospects’.
1. Introduction
Causality has long been central to the human philosophical debate and scientific pursuit. There are many relevant questions, e.g. the philosophical meaning of causation or deducing the causes of a given phenomenon. Among these questions, statistics—which concerns measurements—arguably can contribute the most to the question of measuring the effects of causes. Statistics infers associations between variables. Even though the research questions in many statistics-based studies are causal in nature, a first lesson in elementary statistics is that association does not imply causation. Distinguishing between causation and spurious association between various events is a challenging task in science. Broadly speaking, statistical causal inference is about building a framework that (i) defines causal effects under general scenarios, (ii) specifies assumptions under which one can identify causation from association and (iii) assesses the sensitivity to the causal assumptions and finds ways to mitigate.
A mainstream statistical framework for causal inference is the potential outcomes framework [1–3]. Under this framework, following the dictum ‘no causation without manipulation’ [3], a cause is a pre-specified treatment or intervention that is at least hypothetically manipulable. A typical causal question is ‘would an individual have a better outcome had he taken treatment versus treatment ?’ Causal effects are defined as comparisons of potential outcomes, also known as counterfactuals, under different treatment conditions for the same units. The main hurdle to interpreting the association between the treatment and the outcome as a causal effect is confounding, i.e. the presence of factors that are associated with both the treatment and the outcome. For example, patients with worse health conditions may be more likely to obtain a beneficial medical treatment; then directly comparing the outcomes of the treated and control patients, without adjusting for the difference in their baseline health conditions, would bias the causal comparisons and mistakenly conclude that the treatment is harmful. Randomized experiments, known as A/B testing in industry or randomized controlled trials in medicine, are the gold standard for casual inference by eliminating confounding via randomization. But modern causal inference has increasingly relied on observational data. The potential outcomes framework provides the basis for identifying and estimating causal effects—quantities defined based on counterfactuals—from the factual data in the presence of confounding, using randomized or observational data. This framework is applicable to a wide range of problems in many disciplines and has been increasingly adopted in the area of machine learning. Other frameworks for causal inference, including the causal diagram [4] and invariant prediction [5], are beyond the scope of this review.
There are three primary inferential approaches within the potential outcomes framework [6]: Fisherian randomization test, Neymanian repeated-sampling evaluation and Bayesian inference. The first two approaches belong to the Frequentist paradigm and have been dominant, with many popular tools such as propensity scores, matching and weighting. The Bayesian approach has several established advantages for general statistical analysis, including automatic uncertainty quantification, coherently incorporating prior knowledge, and offering a rich collection of advanced models for complex data. As causal studies increasingly involve real-world big data, there has been a recent surge of research in Bayesian inference of causal effects [7–11], but there is no comprehensive appraisal of the current state of the research. This paper aims to fill this gap. Due to the space limit, we do not intend to provide a catalogue of the existing research on this topic, but rather discuss the big picture of why and how to conduct Bayesian causal inference in general settings. We emphasize the unique questions, challenges and opportunities that the Bayesian approach brings to causal inference. We hope this review can stimulate broader and deeper cross-fertilization between causal inference and Bayesian analysis.
Section 2 introduces the preliminaries of the potential outcomes framework, and briefly discusses several Frequentist methods to causal inference. Section 3 outlines the general structure of Bayesian causal inference, focusing on ignorable treatment assignments at one time point. Section 4 discusses model specification and implications in high-dimensional settings. Section 5 reviews the role and various uses of the propensity score in Bayesian causal inference. Section 6 outlines sensitivity analysis in observational studies. Section 7 describes two complex assignment mechanisms: instrumental variable and time-varying treatments. Section 8 concludes.
2. Estimands, identification and frequentist estimation
To convey the main ideas, we focus on the case with a binary treatment at one time period, which can be readily extended to multiple treatments and multiple time points. Consider a sample of units drawn from a target population, indexed by . Each unit can potentially be assigned to one of two treatment levels , with for the active treatment and for the control. Let be the binary variable indicating unit ’s observed treatment status. For unit , a vector of covariates are observed before the treatment, and an outcome is observed after the treatment. A confounder is a pre-treatment variable that is associated with both the treatment and the outcome; it can be observed, as a subset of the covariates , or unobserved. Below we use covariates and confounders interchangeably. We use the notation to denote conditional independence between two variables and given variable [12]. We also use the bold font to indicate a vector consisting of the corresponding variables for the units, e.g. .
We maintain the standard stable unit treatment value assumption (SUTVA) [13], namely, there is (i) no different version of a treatment, and (ii) no interference in the sense that one unit’s potential outcomes are not affected by other units’ treatment assignment. Under SUTVA, each unit has two potential outcomes: and . Causal effects are contrasts of potential outcomes under different treatment conditions for the same set of units. The individual treatment effect (ITE) for unit is . Averaging over a sample we obtain the sample average treatment effect (SATE): . Furthermore, the conditional average treatment effect (CATE) is the average of the individual treatment effect of all units with the covariate value :
The fundamental problem of causal inference [14] is that, for each unit only the potential outcome corresponding to the actual treatment, , is observed or factual, and the other potential outcome, , is missing or counterfactual. Therefore, additional assumptions are necessary to identify the causal effects. The key identifying assumptions concern the assignment mechanism, i.e. the process that determines which units get what treatment and hence which potential outcomes are observed or missing [15]. The vast majority of causal studies assume certain versions of an ignorable assignment mechanism, where the treatment assignment is independent of the potential outcomes conditional on some observed variables. Specifically, in the simple setting of a binary treatment at one time, ignorability consists of two sub-assumptions [15,16].
Assumption 2.1.
(Ignorability). (a) Unconfoundedness. , or equivalently . (b) Overlap. for all , where is the propensity score [16].
The unconfoundedness assumption states that there is no unmeasured confounding, and the overlap assumption states that each unit has non-zero probability of being assigned to each treatment condition. These two assumptions together ensure that the conditional distribution of the potential outcomes is identifiable from observed data as
In randomized experiments, the treatment assignment is known and controlled by the experimenters, and ignorability holds by design. In observational studies, the treatment assignment is unknown and uncontrolled, and ignorability at best holds approximately. A key concept in causal inference is overlap and balance, which refers to the similarity in the distributions of the covariates between the comparison groups. In general, as the two groups become more balanced, the causal estimates become less sensitive to the estimate strategy and model specification. In the ideal case of a randomized experiment, all—measured and unmeasured—covariates are balanced in expectation, i.e. they have the same multivariate distribution in the two treatment arms. Consequently, the simple difference-in-means estimator, , is unbiased for and ; furthermore, even a misspecified linear outcome model leads to a consistent estimate of [17]. On the contrary, in observational studies, the two groups are often imbalanced in many covariates, e.g. patients receiving the treatment may be generally sicker and older than those receiving the control. In such cases, directly comparing the difference in the outcomes between the two groups would give biased estimates of the causal effects. Moreover, the fit of an outcome model would rely on extrapolation in the regions where the two groups are poorly overlapped, and consequently outcome-model-based estimators, such as , are sensitive to the model specification. Therefore, a main effort in causal inference with observational data is to ensure overlap and balance to mimic a randomized experiment as closely as possible. This process does not involve the outcome and is referred to as the design stage, in contrast to the analysis stage, which utilizes the outcome and estimates causal effects given the design stage [18]. A causal analysis of an observational study usually has both design and analysis stages, in parallel with those of a randomized controlled experiment.
The propensity score plays a central role in causal inference with observational data, owing to its two special properties [16]. First, the propensity score is a balancing score in the sense that . This means that balancing the scalar propensity score balances the multivariate distribution of the covariates. Second, if a treatment assignment is unconfounded given , then it is unconfounded given , that is, implies . In observational studies, is usually unknown and needs to be estimated, e.g. via a logistic regression model of the treatment on the covariates.
The propensity score is usually used with matching, weighting, or stratification to achieve balance and estimate causal effects. Specifically, matching methods use a certain algorithm to find pairs of units in the two groups with similar covariates according to a distance metric, e.g. the propensity score or the Mahalanobis distance, and then calculate the difference in the average observed outcome between the groups in the matched sample [19–21]. Weighting methods assign a weight to each unit, so that the weighted distribution of the covariates in the two groups are balanced [22], and then calculate the weighted difference in the outcomes between the two groups. An important weighting scheme is inverse probability weighting (IPW), based on the identification formula of the PATE: . A corresponding IPW estimator [23] is where denotes the estimated propensity score for unit . One can further augment the IPW estimator by an outcome model to obtain a semiparametric efficient estimator [24]: where is the residual from the outcome model. The IPW estimator is consistent for if the propensity score model is correct, and the outcome-model estimator is consistent if the outcome model is correct. Because the bias of the estimator is a product of the residual of the propensity score model and that of the outcome model, is doubly robust in the sense that it is consistent if either the propensity score or the outcome model, but not necessarily both, is correctly specified [25]. Despite the seemingly different construction, matching estimators, with proper mathematical formulations, can be viewed as non-parametric versions of , and based on nearest-neighbour regressions [26]. These are the main Frequentist estimation strategies for under ignorability. When the target estimand is the CATE, the primary estimation strategy is outcome modelling. We will discuss how to specify the outcome model in §4(a).
3. General structure of Bayesian causal inference
(a) Basic factorization and versions of causal estimands
Because of the unavoidable missing potential outcomes, causal inference under the potential outcomes framework is inherently a missing data problem [6,15]. The Bayesian paradigm offers a unified framework for statistical inference with missing data and thus for causal inference [27]. Below we review the general structure of Bayesian causal inference that was first outlined in [15].
Four quantities are associated with each unit , , where are observed but is missing. Bayesian inference views all these quantities as random variables and centres around specifying a model for them. Based on the Bayesian model, we can draw inference on causal estimands—functions of the model parameters, covariates and potential outcomes—from the posterior predictive distributions of the parameters and the unobserved potential outcomes. Specifically, we assume the joint distribution of these random variables of all units is governed by a parameter , conditional on which the random variables for each unit are i.i.d.. Then we can factorize the joint density for each unit as
Before diving into the technical details, we first clarify the subtle but important difference between the Bayesian estimation of the PATE and SATE estimands. For the PATE, we rewrite the outcome-model-based identification formula in §2 as , which depends only on the unknown parameters and . Therefore, Bayesian inference for the PATE requires obtaining posterior distributions of . By contrast, the SATE is a function of the potential outcomes , which involves both observed and missing quantities. Bayesian inference for the SATE requires imputing the missing potential outcomes from their posterior predictive distributions based on the outcome model, and consequently deriving the posterior distribution of .
However, in practice, we rarely model the possibly multi-dimensional covariates , and instead condition on the observed values of the covariates. This is equivalent to replacing with , the empirical distribution of the covariates. Therefore, most Bayesian causal inference (e.g. [9,28]) in fact focuses on the mixed average treatment effect (MATE) [6]
Example 3.1.
[Covariate adjustment in a randomized experiment] Consider a completely randomized experiment with covariates . Assume the true model for potential outcomes is
(b) Posterior inference of causal effects
Regardless of the version of the target estimand, the following assumption is commonly adopted.
Assumption 3.2.
(Prior independence). The parameters for the models of assignment mechanism , outcome , and covariates are a priori distinct and independent.
Assumption 3.2 imposes independent prior distributions for parameters . It is unique to the Bayesian paradigm of causal inference. It is imposed primarily for modelling and computational convenience and may appear innocuous. However, as elaborated in §4(b), it may lead to unintended and undesirable implications in high-dimensional problems. Under assumptions 2.1 and 3.2, the joint posterior distribution of and the missing potential outcomes is proportional to
By definition, does not depend on the association between and , denoted by the parameter . Similarly, does not depend on , but does. So in the inference of and , we usually directly specify the marginal models or equivalently under ignorability [28]. The observed-data likelihood based on (3.4) becomes . Imposing a prior for , we can proceed to infer and subsequently , , or using the usual Bayesian inferential procedures.
Bayesian inference of is more complex, because it depends on both and and thus requires posterior sampling of both and . The most common sampling strategy is through data augmentation: iteratively simulate and given each other and the observed data, namely from and . The former, given the observed data and the imputed , can be straightforwardly obtained by a complete-data analysis based on . The latter requires more elaboration. Specifically, we can show that is proportional to . This renders that imputing the depends crucially on the joint distribution of . Because and are never jointly observed, the data provide no information about their association . Unless the specific marginal model places constraints on , the posterior distribution of would be the same as its prior. Consequently, the posterior distribution of would be sensitive to the prior of .
The above discussion prompts us to clarify the notion of identifiability in Bayesian inference. Under the Frequentist paradigm, a parameter is identifiable if any of its two distinct values give two different distributions of the observed data. Under the Bayesian paradigm, there is no consensus. For example, Lindley [29] argued that all parameters are identifiable in Bayesian analysis because with proper prior distributions, posterior distributions are always proper. In this sense, is identifiable. However, due to the fundamental problem of causal inference, there is no information in the data on and it is reasonable to label it as non-identifiable. This is distinct from the parameters that the data provide direct information on, e.g. those in the marginal distributions of the outcomes in each arm, which are reasonable to label as identifiable. Lindley’s perspective of identifiability blurs such distinction. A more informative perspective is provided by Gustafson [30], who argued that a parameter is weakly or partially identifiable, if a substantial region of its posterior distribution is flat, or its posterior distribution depends crucially on its prior distribution even with large samples, such as . Another example of a partially identifiable parameter is , which depends on [6,31]. In this perspective, identifiability in Bayesian inference is no longer all-or-nothing; instead, it is a continuum in between. This issue motivates the strategy of transparent parameterization, where one separates identifiable and non-identifiable parameters, and treats the latter as sensitivity parameters in a sensitivity analysis [32–36]. More discussion will be given in §6(b).
Example 1 revisited We now illustrate the posterior inference of the causal estimands in example 3.1. Here, the parameters ’s and ’s are identifiable, but is not in the Frequentist sense. We fit a Bayesian linear regression model of on to each observed arm , with independent priors on and . The observed likelihood factorizes into two parts: the data in treatment group and control group contribute to the likelihood of and , respectively. For example, imposing the conventional conjugate normal-inverse priors, we can draw from the posterior distribution of and , and thus that of the MATE by plugging the posterior draws into the closed-form of in (3.3). To obtain the PATE, we would have to specify a multivariate model for , and derive the posterior distribution of and then plug it into the closed form of in (3.3). This can also be implemented, e.g. via a Bayesian bootstrap step without a model, as described in the next paragraph. To obtain the SATE, we could specify a prior for or fix it to a value. Given and each draw of , we can impute as follows: for treated units, , and we draw from ; for control units, , and we draw from Plugging these posterior predictive draws of and the observed outcomes into the definition of , we obtain its posterior distribution. We suggest varying the sensitivity parameter from 0 to 1, which corresponds to conditionally independent potential outcomes and perfectly correlated potential outcomes, respectively.
An interesting alternative Bayesian strategy is through the Bayesian bootstrap [37], where the units are re-weighted with weights drawn from a Dirichlet distribution. The Bayesian bootstrap is a general strategy to simulate the posterior distribution of a parameter under a non-parametric model, which can be viewed as the limit of the inference under the Dirichlet Process prior [38]. This renders the Bayesian bootstrap relevant to causal inference in at least two ways. First, one can generate posterior samples from the distribution of without specifying a parametric model. This is desirable in inferring the population estimands like the PATE and the CATE [39]. However, how to integrate these samples of into the inference of the target causal estimand is case-dependent and generally adds complexity to the analysis compared to the MATE. Second, the Bayesian bootstrap offers a general recipe for incorporating many standard Frequentist procedures into Bayesian inference. For example, Taddy et al. [40] used it to quantify the uncertainty in linear and tree-based methods for estimating the CATE. Chamberlain & Imbens [41] used it in M-estimation with an application to the setting of instrumental variables (see §7(a)). However, we view the Bayesian bootstrap approach as peripheral to Bayesian causal inference because it does not capitalize on arguably the main strength of Bayesian inference, namely, a unified inferential framework underpinned by the Bayes theorem with versatile choice of priors and outcome models.
4. Model specification
(a) Common specification of the outcome model
Section 3 shows that the central component in Bayesian inference of the CATE, PATE, and MATE is to specify the outcome model . We can either model the two treatment groups jointly with a single function or model each group separately with two functions and , known as S-learner or T-learner, respectively, in the literature [42]. The most basic outcome model is a linear regression: , with Gaussian error terms, where the treatment-covariate interaction term captures the treatment effect heterogeneity. This model is equivalent to specifying a linear regression in each group. But the equivalence does not hold for nonlinear models; in fact, S-learners and T-learners with the same type of nonlinear models often lead to markedly different causal estimates.
Linear regressions are easy to implement and interpret, but they are often too restricted. In real-world problems, it is crucial to specify flexibly enough to approximate the possibly complex underlying true data generating mechanism. This is particularly desirable as the recent focus in causal inference has been moving toward heterogeneous treatment effects. Outcome modelling is the most natural approach in these studies: one can simply specify an outcome model and derive the CATE as a function of the model parameters. There has been a rapidly increasing adoption of non-parametric and machine learning models for . One of the most widely used such models is based on regression trees. At a high level, regression trees partition the covariate space into non-overlapping regions and the prediction in each region is based solely on the data that fall in that region. The parameters of a regression tree characterize where to split the covariate space and how to predict the outcomes in a terminal node [43]. An ensemble of regression trees—usually referred to as forests—are often combined to improve the prediction.
Within the Bayesian paradigm, the Bayesian Additive Regression Tree (BART) [44] has become very popular for causal inference. BART places certain priors on the parameters of the regression trees to control the depth of the tree and the degree of shrinkage of the mean parameters in terminal nodes. Hill [9] first advocated using BART to specify the outcome model in an S-learner. One can also specify a T-learner with a separate BART model for each treatment group . However, without any additional structure on the marginal models , T-learners often result in large variance of the treatment effects. Hahn et al. [8] proposed the Bayesian Causal Forest method based on an alternative reparametrization, , where models the distribution of and represents the heterogeneous treatment effect, with a separate BART prior for and . The BART models have a number of advantages, including fast computation, good performance of default choice of hyperparameters and available software. When a study has adequate covariate overlap, BART has been shown to outperform numerous competing methods, including (the Frequentist) random forests, in many empirical applications, e.g. [45,46]. Other Bayesian non-parametric models, such as Gaussian process [47], Dirichlet process [48–51], have also been considered for causal inference. We refer interested readers to [10] for a more detailed review of these methods.
(b) Challenges in high dimensions
Conducting statistical inference in high dimensions is challenging in general. We differentiate between two high-dimensional settings: (i) an outcome model with an infinite or a large number of parameters, regardless of the number of covariates, such as non-parametric and semiparametric models, and (ii) a large number of covariates. Both settings are increasingly common in causal inference, particularly in studies targeting the CATE. As discussed earlier, outcome modelling is the primary method in these settings, and Bayesian non-parametric models have become a mainstay of the model choice.
A straightforward application of the standard Bayesian non-parametric priors to outcome modelling is sometimes inadequate for causal inference, even with low dimensional covariates. An important consideration is that a desirable prior should accurately reflect uncertainty according to the degree of covariate overlap because intuitively the uncertainty of causal estimates should increase as the degree of overlap decreases. Below, we reproduce a simple example in [52] (first constructed by Surya Tokdar) with a single covariate to illustrate this point.
Example 4.1.
[Choice of priors in estimating the CATE] Consider a study with 250 treated and 250 control units. Each unit has a single covariate that follows a Gamma distribution with mean 60 and 35 in the control () and treatment () group, respectively, and with s.d. 8 in both groups. To convey the main message, we consider a true outcome model with constant treatment effects: with , where the CATE for all . The scatterplots in the upper panel of figure 1 show that covariate overlap is good between the groups in the middle of the range of (around 40 to 50), but deteriorates towards the tails of .
Figure 1. Example 4.1: estimates of means of the potential outcomes (a) and the CATE (b) and corresponding uncertainty band as a function of the single covariate by linear model, Gaussian Process, BART, respectively. Cross symbols: treated units; circles: control units. (Online version in colour.)
To estimate the CATE, we fit an outcome model separately in each group: with . We choose three priors for : (i) a linear model with a Gaussian prior for the coefficients; (ii) a BART prior similar to [9]; (iii) a Gaussian Process prior [53] with the covariance function specified by a Gaussian kernel with signal-to-noise ratio parameter and inverse-bandwidth parameter : where . Figure 1 shows the posterior means of and the CATE, with corresponding uncertainty band as a function of . Here, we focus on the uncertainty quantification. In the region of good overlap, all three models lead to similar points and credible interval estimates of the CATE, but a marked difference emerges in the region of poor overlap. The linear model appears overconfident in estimating the CATE. The Gaussian process trades potential bias with wider credible interval as overlap decreases and produces a more adaptive uncertainty quantification. BART produces shorter error bars than the Gaussian Process (but wider than linear models), but the width of the credible interval remains similar regardless of the degree of overlap and thus is over confident in the presence of poor overlap.
Example 4.1 illustrates that, even with low dimensional covariates, standard Bayesian priors can have markedly different operating characteristics when the two groups are poorly overlapped, and not all priors can adaptively capture the uncertainty according to the degree of overlap. A primary reason for BART potentially underestimating uncertainty in poor overlap is its lack of smoothness, in contrast to the Gaussian process. Nonetheless, such a problem can be mitigated by soft decision trees as in [54].
When there are a large number of covariates, the Bayesian paradigm usually achieves regularization through sparsity-inducing priors for the outcome model, such as the spike-and-slab prior [55], the Bayesian LASSO [56], as well as the model averaging techniques [57–59]. The use of these methods in causal inference is surveyed in [10,50]. High-dimensional covariates pose additional complications to Bayesian causal inference. Robins & Ritov [60] pointed out that non-parametric estimates often have slow convergence rates in this regime, which translates into poor finite-sample performance. A central challenge in causal inference is that covariate overlap is rapidly diminishing as the covariates dimension increases, violating the overlap assumption that underpins standard causal analysis [61]. Lack of overlap exacerbates the usual inferential challenges—such as sparsity and slow convergence—in high-dimensional analysis. Even if we assume linear outcome models, we must carefully specify the priors on the regression coefficients. For example, Hahn et al. [7] showed that standard Bayesian regularization on the nuisance parameters may indirectly regularize important causal parameters and thus induce bias, namely the regularization induced confounding. This issue was rigorously investigated in [62]. Specifically, Linero [62] defines the selection bias as , and showed that, under the seemingly innocuous prior independence assumption 3.2, many Bayesian regularization priors would a priori induce the selection bias to sharply concentrate around zero as the number of covariates, , increases, to the extent that no amount of data would overcome such a bias. This implies that assumption 3.2 effectively acts as a strongly informative prior as increases. Such a phenomenon is referred to as prior dogmaticism and is the Bayesian analogue of the aforementioned problem in Ritov et al. [63]. This line of research highlighted the importance of incorporating the propensity score in Bayesian causal inference [7,62,64,65], which echos the insights from the Frequentist double machine learning method [66,67]. Specifically, the regularized propensity score model or outcome model alone would not be sufficient for valid causal inference, but combining the two would achieve desirable convergence rate and finite sample performance in high-dimensional causal analysis.
5. The role of the propensity score
A major debate in Bayesian causal inference is the role of the propensity score, which characterizes the assignment mechanism. On the one hand, as shown in §3, under assumptions 2.1 and 3.2, the propensity score drops out from the likelihood and thus its value appears to be irrelevant in Bayesian causal inference, which seemingly only involves the outcome model and thus the analysis stage. On the other hand, §2 shows that the propensity score is ubiquitous in the Frequentist approach to causal inference, e.g. in constructing weighting, matching and doubly-robust estimators. Regardless of the mode of inference, the propensity score is essential in ensuring overlap and balance in the design stage of an observational study, which consequently reduces the sensitivity to the outcome model specification. Such sensitivity reduction is key to robust Bayesian causal inference, which is primarily based on outcome modelling. The literature has recognized the importance of incorporating the propensity scores into Bayesian causal inference, either in the design or the analysis stage, but there is no consensus on how to proceed. Below we review three existing strategies.
(a) Include the propensity score as a covariate in the outcome model
The propensity score was first proposed to be included as the only covariate in a Bayesian outcome model under ignorability, which would reduce the model complexity [68]. However, as later pointed out by Zigler [59]: . So using the propensity score as the single covariate in the outcome model would not lead to the target outcome distribution , but using it as an additional covariate, i.e. specifying a model , would. This specification is effectively conducting an outcome regression at each value of the propensity score, and thus can be viewed as a smoothed version of combining propensity score stratification and outcome modelling. In a sense, this specification provides a Bayesian analogue of the double robustness [52,69]. On the one hand, when the outcome model is correctly specified, reduces to because is a function of and thus is redundant regardless of its specification. On the other hand, when the outcome model is misspecified but the propensity model is correctly specified, the results are robust to the outcome model specification because the treatment and control groups are approximately balanced in covariates within each stratum of the propensity score. Various reparametrizations have been proposed. One example is to specify , with being a non-parametric model and being a parametric model. Little [70] adopted a penalized spline model of for . In the aforementioned Bayesian Causal Forest, Hahn et al. [8] imposed a separate BART model for and , and demonstrated that adding the propensity score as an additional predictor in significantly improves the empirical estimation of the CATE.
This strategy is usually implemented in two stages: first estimate the propensity score as and then plug it into the Bayesian outcome model . Such a two-stage procedure is not dogmatically Bayesian, which generically refers to the procedure of specifying a model with parameters and prior distributions of these parameters and then use the Bayes theorem to obtain the posterior distributions of the parameters. A direct consequence is that this procedure may not properly propagate the uncertainty of estimating the propensity score in the outcome model [69]. A dogmatic Bayesian approach would jointly model and and draw posterior inference of and simultaneously [71]. However, when the outcome model is misspecified, the joint-modelling approach would introduce a feedback problem, that is, the fit of the outcome model would inform the estimation of the propensity scores. This violates the unconfoundedness assumption, distorts the balancing property of the propensity score, and consequently leads to biased estimate of causal effects. A suggested remedy is to first fit a Bayesian model for and then plug the posterior predictive draws of into the outcome [11]. Such a two-stage procedure is still not dogmatically Bayesian, but provides more robust posterior inference to model misspecification empirically.
However, adding the propensity score into the outcome model is controversial conceptually, because the outcome model reflects the nature of the generating process of the potential outcomes, which arguably should not depend on how the treatment is assigned [72].
(b) Dependent priors
The Bayesian causal inference outlined in §3 rests on the prior independence assumption 3.2, without which the propensity score model cannot be ignored from the likelihood. But this assumption is not always plausible in real applications. Various priors that do not rely on this assumption have been proposed [47,63–65]. Below we show two simple examples.
The first example is due to [58] and is designed for simultaneous variable selection for the propensity score and outcome models. Specifically, assume a logistic propensity score model and a linear outcome model . Assume each of the th components of the coefficients, , follow the spike-and-slab prior [73]: , where is a latent indicator of whether is included in the model and denotes the point mass at 0. A similar spike-and-slab prior is assumed for the coefficients of the outcome model with a latent inclusion indicator : . Then assume the probability of the events and are dependent a priori: , where is a dependence hyperparameter that controls the prior odds of including into the outcome model when it is included in the propensity score model. Larger implies stronger prior dependence between the variable selection in the two models.
The second example is due to [74]. Assume and , with flat priors on and . If the propensity scores are known, the posterior mean of the PATE equals the Hajék version of the IPW estimator:
Carefully designed dependent priors often achieve desirable finite sample results and are more reasonable in real world studies. However, specification of such priors is case-dependent, and there is no general solution.
(c) Posterior predictive inference
A general, albeit not dogmatically Bayesian, strategy is to specify both a propensity score model and an outcome model , and obtain posterior draws of and from their respective posterior predictive distributions, and then plug these posterior draws into the doubly-robust estimator [66,75]. A variance estimator of the resulting estimator is given in [66]. In the same vein, Ding & Guo [76] incorporated the propensity score in Bayesian posterior predictive -value. For the model with the Fisher’s sharp null hypothesis of no causal effect for any units whatsoever (i.e. for all ), the procedure in [76] is equivalent to the Fisher randomization test averaged over the posterior predictive distribution of the propensity score. Simulations in [76] show the advantages of the Bayesian -value compared to the Frequentist analogue. This perspective offers a straightforward and flexible strategy to integrate Bayesian modelling and common Frequentist procedures for causal inference and enables proper uncertainty quantification.
Besides the above three strategies, another general approach is through the aforementioned Bayesian bootstrap, which can be used to simulate the posterior distribution of any parameter that can be formulated as M-estimation or estimating equation [41,77]. As special cases, because the IPW estimator and the doubly robust estimator —both involving the propensity scores—are both solutions to estimating equations, they can be naturally combined with the Bayesian bootstrap to devise a Bayesian version. However, such an approach may be guilty of ‘Bayesian for the sake of being Bayesian’, and their methodological and practical value compared to competing methods is unclear.
6. Sensitivity analysis in observational studies
Unconfoundedness is a central assumption for causal inference. It holds by design in randomized experiments. However, its validity is fundamentally untestable in observational studies. Therefore, it is of great importance to assess the sensitivity of the results with respect to unmeasured confounding in any observational study. Such procedures are broadly called sensitivity analysis. Different classes of sensitivity analysis methods are characterized by the specific parametrization of confounding. Below we review the two most popular classes.
(a) Parametrization involving distributions with unmeasured confounders
The first parametrization used for sensitivity analysis is motivated by the intuition that a hidden confounder may completely explain away the association between the treatment and the outcome even after adjusting for observed covariates. In a historic debate, Fisher [78] hypothesized that the strong association between cigarette smoking and lung cancer might be due to a hidden genetic factor as their ‘common cause’ or confounder. Cornfield et al. [79] derived an inequality showing that to explain away the observed association, the association between the unmeasured confounder and cigarette smoking must be larger than or equal to the association between cigarette smoking and lung cancer. Their work helped to initiate the field of sensitivity analysis.
Let denote an unmeasured confounder and assume that unconfoundedness holds conditional on : . The joint distribution of all variables factorizes into
As an extension of [79], Ding & VanderWeele [84] treated the treatment-confounder ( and ) and outcome-confounder ( and ) associations as two sensitivity parameters, and derived analytical thresholds for them in order to explain away the observed treatment-outcome ( and ) association. Based on that theory, VanderWeele & Ding [85] further simplified by assuming the two associations to be the same and called the resulting threshold the E-value, as a measure of robustness of the causal conclusions with respect to unmeasured confounding. The E-value framework is model-free because it avoids modelling assumptions with ; it also avoids repeating the analysis over a range of sensitivity parameters as in the competing methods and thus is simple to calculate.
(b) Parametrization involving distributions of potential outcomes
The second parametrization is motivated by an alternative mathematical expression of the unconfoundedness assumption: for , representing the fact that the units in the two randomized arms are comparable in terms of potential outcomes. This class of sensitivity analysis is based on sensitivity parameters that directly represent the difference between the distributions and instead of modelling the difference with an unobserved . This is implemented in the context of time-varying treatments (see §7(b)) and Frequentist semiparametric estimation [86]. Franks et al. [34] pointed out the importance of distinguishing between model fit and sensitivity to unconfoundedness: the former involves identifiable parameters (e.g. the parameters in the model of the marginal distributions of the outcome ) whereas the latter involves unidentifiable parameters (e.g. the association between and ). The merit of this parametrization is apparent in this perspective because it separates identifiable and unidentifiable parameters. Franks et al. [34] proceeded under the Bayesian paradigm and used a copula—parameters of which are the sensitivity parameters—to connect the two identifiable marginal distributions.
A related branch of sensitivity analysis is Rosenbaum’s bounds [87]. His original formulation takes the association between and the potential outcomes conditional on observed , denoted by , as the sole sensitivity parameter for quantifying unmeasured confounding. He has also made connections to the parametrization in §6(a) [88]. Starting with a matched sample to mimic a randomized matched-pairs experiment, one can then repeat the Fisher randomization test on the sharp null hypothesis of no treatment effect given a range of values, and find the threshold of at which the -value of the test changes from significant to insignificant. This approach was later generalized to derive the bounds of a given estimator under different values. Grounded in Fisherian randomization inference, Rosenbaum’s framework does not have a natural Bayesian analogue.
Besides the above two classes, there are numerous other approaches to sensitivity analysis based on alternative parameterization of unmeasured confounding. However, a common criticism of various approaches to sensitivity analysis is that, in order to assess the consequence of the untestable unconfoundedness assumption, one has to make even more untestable assumptions, e.g. specifying models involving . Moreover, sensitivity analysis, after all, is a secondary analysis in causal studies, and thus simple implementation and intuitive interpretation is much desired. The considerations underpin the dominance of the E-value method over other methods in practice, particularly in medicine and public health.
7. Complex assignment mechanisms
So far, we have discussed the simplest causal setting of an ignorable treatment at one time point. The basic formulation can be extended to many more complex assignment mechanisms. There are also many popular quasi-experimental designs that rely on identification strategies alternative to ignorability, e.g. regression discontinuity designs, difference-in-differences, synthetic controls. These designs are widely used in socioeconomic applications. Due to the space limit, below we will only briefly review two important extensions and refer interested readers to [89] for a review of quasi-experimental designs and related econometric methods.
(a) From instrumental variable to principal stratification
Instrumental variable (IV) is one of the most important techniques for causal inference in economics and social sciences. IVs are used in settings where dependence of the assignment on the potential outcomes cannot plausibly be ruled out, even conditional on observed covariates. An IV is a variable that provides a source of exogenous (or unconfounded) variation that helps identify causal effects. IV methods are based on a set of assumptions alternative to ignorability. Specifically, an IV satisfies three conditions: (i) it occurs before a treatment; (ii) it is independent of the treatment-outcome confounding; and (iii) it affects the outcome only through its (non-zero) effects on the treatment assignment. Finding a valid IV is challenging in observational studies and many clever natural experiments have been identified [89]. Given a valid IV, one can extract the causal effects of the treatment on an outcome by a two-stage least-squares (2SLS) estimator: first, fit a linear regression of the treatment on the IV; second, fit a linear regression of the outcome on the fitted value of the treatment from the first stage, the coefficient of which is taken as the causal effect of the treatment on the outcome. Covariates can be added in both stages.
The IV method has been developed within the structural equation model framework (see [89] for a review), and the 2SLS IV estimator may not correspond to a causal effect within the potential outcomes framework except for a few special cases. In a landmark paper, Angrist et al. [90] connects IV to the potential outcomes framework in the setting of randomized experiments with binary treatment and all-or-nothing compliance, with the initial random assignment playing the role of an IV. But many questions remain on the connection between the IV method and the potential outcomes framework in more general settings. Below we will describe the special setting of Angrist et al. [90].
We introduce some new notation. For unit , let be the randomly assigned treatment (1 for the treatment and 0 for the control), and be the actual treatment status (1 for the treatment and 0 for the control). When , non-compliance arises. Because occurs post-assignment, it has two potential values, and , with . As before, the outcome has two potential outcomes, and . Based on their joint potential status of the actual treatment , the units fall into four compliance types: compliers , never-takers , always-takers and defiers [90]. A key property of is that it is not affected by the treatment assignment, and thus can be regarded as a pre-treatment characteristic. Therefore, comparisons of and within the stratum of have standard subgroup causal interpretations: , for ; are later called principal causal effects. The conventional causal estimand in clinical trials is the intention-to-treat effect that ignores the compliance information, which is the weighted average of the four stratum-specific effects: , where is the proportion of the stratum . The intention-to-treat effect measures the effect of the assignment instead of the actual treatment.
Due to the fundamental problem of causal inference, individual compliance stratum is not observed. So the principal causal effects are non-identifiable without additional assumptions. Besides randomization of , Angrist et al. [90] make two additional assumptions: (i) monotonicity: , and (ii) exclusion restriction: whenever . Monotonicity rules out defiers, and exclusion restriction imposes that the assignment has zero effects among never-takers and always-takers. Then the compiler average causal effect is identified by
We now describe the Bayesian inference of the IV set-up, first outlined in [91]. Without additional assumptions, the observed cells of and consist of a mixture of units from more than one stratum. For example, the units who are assigned to the treatment arm and took the treatment () can be either always-takers or compliers. One must disentangle the causal effects for different compliance types from observed data. Therefore, model-based inference here resembles that of a mixture model. In Bayesian analysis, it is natural to impute the missing label under some model assumptions. Specifically, six quantities are now associated with each unit, , , , , , , four of which, , are observed and the remaining two, , are unobserved. Assume the joint distribution of these random variables of all units is governed by a parameter , conditional on which the random variables for each unit are iid. We assume unconfoundedness , and impose a prior distribution . Then the joint posterior distribution of and the missing potential outcomes are proportional to the complete-data likelihood as follows:
Using the same arguments as in §3, to infer population and mixed estimands, we only need to specify two marginal outcome models for and instead of a joint model, and do not need to impute the missing potential outcomes . But we do need to impute the latent , or, equivalently, the missing intermediate variable . We can simulate the joint posterior distribution by iteratively imputing the missing from and updating the posterior distribution of from .
Below we illustrate the Bayesian procedure via a simple example of the IV approach.
Example 7.1.
[Randomized experiment with one-sided non-compliance] Consider a randomized experiment with a binary outcome, where control units have no access to the treatment, i.e. for all units. Therefore, we only have two strata: with and with , respectively, with . Assume for , and . So . For simplicity, we impose conjugate priors on the parameters: are iid . To sample the posterior distributions, the key is to impute the missing ’s given the observed data. If , then implies and implies , respectively. If , then and is latent. For units with , we can impute with probability
Frangakis & Rubin [93] generalized the IV approach to principal stratification, a unified framework for causal inference with post-treatment confounding. In the simplest scenario, a post-treatment confounded variable lies in the causal pathway between the treatment and the outcome; it cannot be adjusted in the same fashion as a pre-treatment covariate in causal inference. A principal stratification with respect to a post-treatment variable is the classification of units based on the joint potential values of the post-treatment variable, and the stratum-specific effects are called principal causal effects, of which is a special case. The post-treatment variable setting includes a wide range of examples. For instance, in the non-compliance setting, the ‘treatment’ is the randomized treatment assignment, the ‘post-treatment’ variable is the actual treatment received, and the compliance types are the principal strata [91,92,94,95]. Zeng et al. [96] connects principal stratification to the local IV method with a continuous IV and binary treatment [97]. Other examples include censoring due to death [98], surrogate endpoints [99,100], regression discontinuity designs [101], time-varying treatments [102], and many more. The choices of target strata and thus estimands, interpretations, and identifying assumptions depend on specific applications, details of which are omitted here.
(b) Time-varying treatment and confounding
In real-world situations, subjects often receive treatments sequentially at multiple time points, and the treatment assignment at each time is affected by both baseline and time-varying confounders as well as previous treatments [103,104]. Such settings are referred to as time-varying, or sequential, or longitudinal treatments.
Consider a study where treatments are assigned at time points. Let denote the treatment at time for unit . At baseline (), each unit has time-invariant covariates measured; then after the treatment assignment at time and prior to the assignment at time , a set of time-varying confounders are measured, which include the intermediate measurements of the final outcome and the covariates that are affected by the previous treatments. For example, in a cancer study, baseline covariates can include sex, age, race and time-varying confounders can include intermediate cancer progression and other clinical traits such as blood pressure measured prior to the next treatment. Denote the observed and hypothetical treatment sequence of length by and , respectively, and the sequence of time-varying confounders by . For each , there is a potential outcome . The final observed outcome , corresponding to the entire observed treatment sequence, is measured after treatment assigned at . A common causal estimand is the marginal effect comparing two pre-specified treatment sequences, : . For simplicity, below we drop the subscript .
The central question to causal inference with sequential treatments is the role of the time-varying confounders in the assignment mechanism. These variables are affected by the previous treatments and also affect the future treatment assignment and outcome. Much of the literature assumes a sequentially ignorable assignment mechanism [103], that is, the treatment at each time is unconfounded conditional on the observed history, which consists of past treatments and time-varying confounders , as stated below.
Assumption 7.2.
(Sequential Ignorability). for .
A full Bayesian approach to time-varying treatments [105] would specify a joint model for treatment assignment and time-varying confounders at all time points as well as all the potential outcomes , and then derive the posterior predictive distributions of the missing potential outcomes and thus of the estimands. This procedure is a straightforward extension from the structure introduced in §3. However, the joint modelling approach is rarely used because it quickly becomes intractable as the time and the number of time-varying confounders increases.
Instead, most of the Bayesian methods are grounded in the g-computation. Under sequential ignorability, the average potential outcome is identified from the observed data via the -formula [103]:
Example 7.3.
[Bayesian -computation with two periods] Consider the simplest possible scenario with two time periods, binary covariates and a binary outcome. Let be a binary baseline covariate, is a binary treatment at time 1, is a binary time-varying covariate between time 1 and 2, is a binary treatment, and is a binary outcome. To obtain the posterior distribution of
-computation quickly becomes complex as the number of time periods and time-varying confounders increases, which requires specifying a large number of models. Then it is necessary to impose more structural restrictions on the data-generating process. However, Robins & Wasserman [107] showed that unsaturated models might rule out the null hypothesis of zero causal effect a priori, a phenomenon termed the ‘-null paradox’.
A popular alternative strategy is the marginal structural model [104], which generalizes IPW to time-varying treatments. Saarela et al. [108] devised a Bayesian version of the marginal structural model via the Bayesian bootstrap. Because the marginal structural model relies on IPW, a key component in its implementation is to estimate the propensity scores and ensure overlap at each time point. However, overlap between different treatment paths usually becomes limited as the number of time periods increases, rendering the marginal structural model sensitive to extreme weights.
The above discussion focuses on static treatment sequences. Another important class of time-varying treatment is the dynamic treatment regime, which consists of a sequence of decision rules, one per time point of intervention, that determines how to individualize treatments to units based on evolving treatment and covariate history. Inferring optimal dynamic treatment regimes requires combining causal inference and decision theory techniques and is closely related to reinforcement learning. See [109] for a review. Due to the space limit, we omit the discussion of the closely related topics of Bayesian multi-armed bandit [110] and Bayesian reinforcement learning [111].
8. Discussion
This paper reviews the Bayesian approach to causal inference under the potential outcomes framework. We discussed the causal estimands, identification strategies, the general structure of Bayesian inference of causal effects, and sensitivity analysis. We highlight issues that are unique to Bayesian causal inference, including the role of the propensity score, definition of identifiability, the choice of priors in both low- and high-dimensional regimes. In particular, under ignorability and prior independence, the propensity score is seemingly irrelevant for the posterior distributions of the causal parameters. However, we pointed out that even in this setting, the propensity score and more generally the design stage plays a central role in obtaining robust Bayesian causal inference. Regardless of the mode of inference, a critical step in causal inference with observational data is to ensure adequate covariate overlap and balance in the design or analysis stages. In high dimensions, such a task is particularly challenging and what is the optimal practice remains an open question.
The Bayesian approach offers several advantages for causal inference. First and most importantly, by enabling imputation of all missing potential outcomes, the Bayesian paradigm provides a unified inferential framework for any causal estimand. This is particularly appealing for inferring complex estimands such as the conditional average treatment effects or individual treatment effects as well as partially identifiable causal estimands such as the principal strata causal effects. In contrast, the Frequentist approach to these problems needs to be customized for each scenario, and the inference usually relies on bounds or asymptotic arguments, which are often either non-informative or questionable in cases like individual treatment effects. Second, the automatic uncertainty quantification of any estimand renders it straightforward to combine causal inference and decision theory for dynamic decision-making, e.g. in personalized medicine. Third, the Bayeian approach naturally incorporates prior knowledge into a causal analysis, e.g. in evaluating spatially correlated treatments and/or outcomes. Fourth, there is rich collection of Bayesian models for complex data with limited Frequentist counterparts. A few examples are (i) spatial or temporal data, (ii) functional data, and (iii) interference, i.e. when the SUTVA assumption is violated. In these settings, special care must be taken on issues key to causal inference such as defining relevant estimands and ensuring overlap. Moreover, it is important to ensure that the Bayesian models are coherent to the model-free identification assumptions such as ignorability. For example, adding spatial random effects into an outcome model may inadvertently bias the coefficient of the treatment variable as the estimate of a causal effect [112]. Research on Bayesian analysis of these topics has been rapidly increasing [10,112–115] and is expected to continue to grow.
Despite the above advantages, the theory and practice in causal inference has long been dominated by non-Bayesian methods. One reason is that many popular Frequentist techniques, such as matching and weighting, as well as Fisherian randomization test, do not require specifying outcome models and prior distribution of parameters, and thus offer a perception of ‘model-free’ or ‘objective’. This is appealing to many applied researchers. Another reason is that the Bayesian approach requires more advanced computing and programming, which may not be readily available to many practitioners. The Stan programming language [116] mitigates some of these issues, but Bayesian computation remains inaccessible to most domain scientists. To popularize Bayesian causal inference in practice, it is crucial to provide (i) more examples of successful Bayesian applications with clear advantages over other inferential modes, e.g. [45], (ii) accessible tutorials, ideally with generalizable computer code and illustrations of important scientific problems and (iii) developing and disseminating user-friendly, general purpose software packages.
We have occasionally commented on whether a method is dogmatically Bayesian in the discussion. However, we do not regard the conceptual purity of being dogmatically Bayesian, per se, as advantageous, nor should it be the motivating goal in real applications. When a quasi-Bayesian method outperforms its dogmatically Bayesian counterpart (if available) with methodological footing and empirical evidence, as the example of adding estimated propensity score in an outcome model in §5(a), we would endorse the former over the latter. We also doubt the value of devising a Bayesian version of an established Frequentist method without clear theoretical or practical advantages. As a general view, we believe whether to choose a Bayesian approach should be dictated by its practical utility in a specific context rather than an unconditional commitment to the Bayesian doctrine. For causal inference and perhaps everything in statistics, being Bayesian should be a tool, not a goal.
Data accessibility
This article has no additional data.
Authors' contributions
F.L.: conceptualization, formal analysis, investigation, methodology, writing—original draft, writing—review and editing; P.D.: formal analysis, investigation, methodology, writing—original draft, writing—review and editing; F.M.: methodology, writing—review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration
We declare we have no competing interests.
Funding
P.D.’s research is partially funded by the US National Science Foundation grant no. 1945136. F.M. thanks the Department of Excellence 2018–2022 funding provided by the Italian Ministry of University and Research (MUR).
Acknowledgements
The authors are grateful to Joey Antonelli, Yuansi Chen, Sid Chib, Ruobin Gong, Guido Imbens, Zhichao Jiang, Antonio Linero, Georgia Papadogeorgou, Donald Rubin, Surya Tokdar, Mike West, Jason Xu, Cory Zigler, Anqi Zhao and two anonymous reviewers for discussions and suggestions.