# Learning Causal Effects via Weighted Empirical Risk Minimization

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Learning causal effects from data is a fundamental problem across the sciences. Determining the identifiability of a target effect from a combination of the observational distribution and the causal graph underlying a phenomenon is wellunderstood in theory. However, in practice, it remains a challenge to apply the identification theory to...More

Code:

Data:

Introduction

- Inferring causal effects from data is a fundamental challenge that cuts across the empirical sciences [35, 47, 36].
- One common task in the field is known as the problem of causal effect identification.
- Consider the task of identifying the effect of X on Y , P (y|do(x)), from the causal graph G in Fig. 1a and an observational distribution P (v), where V = {Z, X, Y } is the set of observed variables.

Highlights

- Inferring causal effects from data is a fundamental challenge that cuts across the empirical sciences [35, 47, 36]
- The goal of this paper is to develop a learning framework that could work for any identifiable causal functional without the BD/ignorability assumption, by marrying two families of methods, benefiting from the generality of the causal identification methods based on graphs (i.e., ID) and the effectiveness of the estimators produced based on the principle of weighted empirical risk minimization (WERM)
- We evaluate the proposed WERM learning framework against the plug-in estimators in Examples (1,2,3)
- This paper aims to fill the gap from causal identification to causal estimation
- We developed a learning framework that brings together the causal identification theory and powerful empirical risk minimization (ERM) methods
- We proposed a learning objective based on the WERM theory and provided a practical learning algorithm for estimating causal effects from finite samples

Methods

- The authors consider the following two practical examples shown in Fig. 2, in addition to Example 1.
- In the causal graph in Fig. 2a, X represents sign-up for the job-training program, Z actual participation, and Y the postprogram earnings [17].
- The authors denote WERM-ID-R the estimator given in Algo.
- 2. H and HW are set as the gradient boosting regression classes.
- The authors compare the proposed methods with the Plug-in estimator, the only natural method applicable to any causal functionals, which computes each conditional probability such as P (x|r, w) by plugging-in the gradient boosting regression

Results

- The authors evaluate the proposed WERM learning framework against the plug-in estimators in Examples (1,2,3).
- All variables are binary except that W is set to be a vector of D binary variables to represent high-dimensional covariates.
- Example 1 (Fig. 1b).
- The authors test on estimating E [Y |do(x)] with D = 15 where the causal effect P (y|do(x)) is given by Eq (1).
- The MAAE plots are given in Fig. 3a.
- The authors observe that the WERM-based methods (WERM-ID/WERM-ID-R) significantly outperform Plug-in

Conclusion

- This paper aims to fill the gap from causal identification to causal estimation.
- To this end, the authors developed a learning framework that brings together the causal identification theory and powerful ERM methods.
- The authors proposed a learning objective based on the WERM theory and provided a practical learning algorithm for estimating causal effects from finite samples.
- The authors hope that the conceptual framework and practical methods introduced in this work can inspire future investigation in the ML and CI communities towards the development of robust and efficient methods for learning causal effects in applied settings

Summary

## Introduction:

Inferring causal effects from data is a fundamental challenge that cuts across the empirical sciences [35, 47, 36].- One common task in the field is known as the problem of causal effect identification.
- Consider the task of identifying the effect of X on Y , P (y|do(x)), from the causal graph G in Fig. 1a and an observational distribution P (v), where V = {Z, X, Y } is the set of observed variables.
## Objectives:

This paper aims to bridge this gap, from causal identification to causal estimation.- The goal of this paper is to develop a learning framework that could work for any identifiable causal functional without the BD/ignorability assumption, by marrying two families of methods, benefiting from the generality of the causal identification methods based on graphs (i.e., ID) and the effectiveness of the estimators produced based on the principle of WERM.
- This paper aims to fill the gap from causal identification to causal estimation
## Methods:

The authors consider the following two practical examples shown in Fig. 2, in addition to Example 1.- In the causal graph in Fig. 2a, X represents sign-up for the job-training program, Z actual participation, and Y the postprogram earnings [17].
- The authors denote WERM-ID-R the estimator given in Algo.
- 2. H and HW are set as the gradient boosting regression classes.
- The authors compare the proposed methods with the Plug-in estimator, the only natural method applicable to any causal functionals, which computes each conditional probability such as P (x|r, w) by plugging-in the gradient boosting regression
## Results:

The authors evaluate the proposed WERM learning framework against the plug-in estimators in Examples (1,2,3).- All variables are binary except that W is set to be a vector of D binary variables to represent high-dimensional covariates.
- Example 1 (Fig. 1b).
- The authors test on estimating E [Y |do(x)] with D = 15 where the causal effect P (y|do(x)) is given by Eq (1).
- The MAAE plots are given in Fig. 3a.
- The authors observe that the WERM-based methods (WERM-ID/WERM-ID-R) significantly outperform Plug-in
## Conclusion:

This paper aims to fill the gap from causal identification to causal estimation.- To this end, the authors developed a learning framework that brings together the causal identification theory and powerful ERM methods.
- The authors proposed a learning objective based on the WERM theory and provided a practical learning algorithm for estimating causal effects from finite samples.
- The authors hope that the conceptual framework and practical methods introduced in this work can inspire future investigation in the ML and CI communities towards the development of robust and efficient methods for learning causal effects in applied settings

Funding

- Elias Bareinboim and Yonghan Jung were partially supported by grants from NSF IIS-1704352 and IIS-1750807 (CAREER)
- Jin Tian was partially supported by NSF grant IIS-1704352 and ONR grant N000141712140

Study subjects and analysis

samples: 107

**Experiments Setup**

We specify a SCM M for each causal graph and generate datasets D from M. In order to estimate the ground truth μ(x) ≡ E [Y |do(x)], we generate mint = 107 samples Dint from Mx, the model induced by the intervention do(X = x), and compute the mean of Y in Dint. We denote WERM-ID-R the estimator given in Algo. 2

samples: 107

We specify a SCM M for each causal graph and generate datasets D from M. In order to estimate the ground truth μ(x) ≡ E [Y |do(x)], we generate mint = 107 samples Dint from Mx, the model induced by the intervention do(X = x), and compute the mean of Y in Dint. We denote WERM-ID-R the estimator given in Algo

datasets: 100

For each μ ∈ {μIDR, μID, μplug}, we compute the average absolute error (AAE) as |μ(x) − μ(x)| averaged over x. We generate 100 datasets for each sample size m. We call the median of the 100 AAEs the median average absolute error, or MAAE, and its plot vs. the sample size m, the MAAE plot

Reference

- H. Bang and J. M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
- E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identifiability. In In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, pages 113–120. AUAI Press, 2012.
- E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
- S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
- S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, volume 19, page 137. MIT Press, 2007.
- R. Bhattacharya, R. Nabi, and I. Shpitser. Semiparametric inference for causal effects in graphical models with hidden variables. arXiv preprint arXiv:2003.12659, 2020.
- J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129–136, 2008.
- L. Bottou, J. Peters, J. Quiñonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.
- J. Byrd and Z. Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning, pages 872–881, 2019.
- G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.
- T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.
- C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. Advances in Neural Information Processing Systems, 23:442–450, 2010.
- C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
- C. Cortes, M. Mohri, and D. Storcheus. Regularized gradient boosting. In Advances in Neural Information Processing Systems, pages 5450–5459, 2019.
- R. M. Daniel, S. Cousens, B. De Stavola, M. G. Kenward, and J. Sterne. Methods for dealing with time-dependent confounding. Statistics in medicine, 32(9):1584–1618, 2013.
- I. R. Fulcher, I. Shpitser, S. Marealle, and E. J. Tchetgen Tchetgen. Robust inference on population indirect causal effects: the generalized front door criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1):199–214, 2020.
- A. N. Glynn and K. Kashin. Front-door versus back-door adjustment with unmeasured confounding: Bias formulas for front-door and hybrid adjustments with application to a job training program. Journal of the American Statistical Association, 113(523):1040–1049, 2018.
- A. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt, and B. Schöllkopf. Covariate shift by kernel mean matching. In Dataset shift in machine learning, pages 131–160. MIT Press, 2009.
- I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the bayesian/frequentist divide. Journal of Machine Learning Research, 11(Jan):61–87, 2010.
- N. Hassanpour and R. Greiner. Counterfactual regression with importance sampling weights. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 5880–5887. AAAI Press, 2019.
- J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
- Y. Huang and M. Valtorta. Pearl’s calculus of intervention is complete. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, pages 217–224. AUAI Press, 2006.
- M. D. Hughes, M. J. Daniels, M. A. Fischl, S. Kim, and R. T. Schooley. Cd4 cell count as a surrogate endpoint in hiv clinical trials: a meta-analysis of studies of the aids clinical trials group. Aids, 12(14):1823–1832, 1998.
- A. Jaber, J. Zhang, and E. Bareinboim. Causal identification under markov equivalence. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, 2018.
- F. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In International conference on machine learning, pages 3020–3029, 2016.
- F. D. Johansson, N. Kallus, U. Shalit, and D. Sontag. Learning weighted representations for generalization across designs. arXiv preprint arXiv:1802.08598, 2018.
- Y. Jung, J. Tian, and E. Bareinboim. Estimating causal effects using weighting-based estimators. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020.
- Y. Jung, J. Tian, and E. Bareinboim. Learning causal effects via weighted empirical risk minimization. Technical report, 2020.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- S. Lee, J. Correa, and E. Bareinboim. Generalized transportability: Synthesis of experiments from heterogeneous domains. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020.
- S. Lee, J. D. Correa, and E. Bareinboim. General identifiability with arbitrary surrogate experiments. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2019.
- Y. Liu, O. Gottesman, A. Raghu, M. Komorowski, A. A. Faisal, F. Doshi-Velez, and E. Brunskill. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2644–2653, 2018.
- B. London and T. Sandler. Bayesian counterfactual risk minimization. In International Conference on Machine Learning, pages 4125–4133, 2019.
- J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–710, 1995.
- J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009.
- J. Pearl and D. Mackenzie. The book of why: the new science of cause and effect. Basic Books, 2018.
- J. Pearl and J. Robins. Probabilistic evaluation of sequential plans from causal models with hidden variables. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pages 444–453. Morgan Kaufmann Publishers Inc., 1995.
- D. Pollard. Convergence of Stochastic Processes. David Pollard, 1984.
- J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine Learning. The MIT Press, 2009.
- J. Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling, 7(9-12):1393–1512, 1986.
- J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5), 2000.
- P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
- D. B. Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of statistics, pages 34–58, 1978.
- U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3076–3085. JMLR. org, 2017.
- H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
- I. Shpitser and J. Pearl. Identification of joint interventional distributions in recursive semimarkovian causal models. In Proceedings of the 21st AAAI Conference on Artificial Intelligence, page 1219. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.
- P. Spirtes, C. N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2nd edition, 2001.
- A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the 32th International Conference on Machine Learning, pages 814–823, 2015.
- J. Tian and J. Pearl. A general identification condition for causal effects. In Proceedings of the 18th National Conference on Artificial Intelligence, pages 567–573, 2002.
- J. Tian and J. Pearl. On the testable implications of causal models with hidden variables. In Proceedings of the 18th conference on Uncertainty in artificial intelligence, pages 519–527. Morgan Kaufmann Publishers Inc., 2002.
- J. Tian and J. Pearl. On the identification of causal effects. Technical Report R-290-L, 2003.
- M. J. Van Der Laan and D. Rubin. Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1), 2006.
- V. Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
- [55] R. Vogel, M. Achab, S. Clémençon, and C. Tillier. Weighted empirical risk minimization: Sample selection bias correction based on importance sampling. arXiv preprint arXiv:2002.05145, 2020.
- [56] H. Zhao and R. Tachet. On learning invariant representations for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- [57] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.

Full Text

Tags

Comments