Learning Causal Effects via Weighted Empirical Risk Minimization

Yonghan Jung
Yonghan Jung

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views7
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We proposed a learning objective based on the weighted empirical risk minimization theory and provided a practical learning algorithm for estimating causal effects from finite samples

Abstract:

Learning causal effects from data is a fundamental problem across the sciences. Determining the identifiability of a target effect from a combination of the observational distribution and the causal graph underlying a phenomenon is wellunderstood in theory. However, in practice, it remains a challenge to apply the identification theory to...More

Code:

Data:

0
Introduction
  • Inferring causal effects from data is a fundamental challenge that cuts across the empirical sciences [35, 47, 36].
  • One common task in the field is known as the problem of causal effect identification.
  • Consider the task of identifying the effect of X on Y , P (y|do(x)), from the causal graph G in Fig. 1a and an observational distribution P (v), where V = {Z, X, Y } is the set of observed variables.
Highlights
  • Inferring causal effects from data is a fundamental challenge that cuts across the empirical sciences [35, 47, 36]
  • The goal of this paper is to develop a learning framework that could work for any identifiable causal functional without the BD/ignorability assumption, by marrying two families of methods, benefiting from the generality of the causal identification methods based on graphs (i.e., ID) and the effectiveness of the estimators produced based on the principle of weighted empirical risk minimization (WERM)
  • We evaluate the proposed WERM learning framework against the plug-in estimators in Examples (1,2,3)
  • This paper aims to fill the gap from causal identification to causal estimation
  • We developed a learning framework that brings together the causal identification theory and powerful empirical risk minimization (ERM) methods
  • We proposed a learning objective based on the WERM theory and provided a practical learning algorithm for estimating causal effects from finite samples
Methods
  • The authors consider the following two practical examples shown in Fig. 2, in addition to Example 1.
  • In the causal graph in Fig. 2a, X represents sign-up for the job-training program, Z actual participation, and Y the postprogram earnings [17].
  • The authors denote WERM-ID-R the estimator given in Algo.
  • 2. H and HW are set as the gradient boosting regression classes.
  • The authors compare the proposed methods with the Plug-in estimator, the only natural method applicable to any causal functionals, which computes each conditional probability such as P (x|r, w) by plugging-in the gradient boosting regression
Results
  • The authors evaluate the proposed WERM learning framework against the plug-in estimators in Examples (1,2,3).
  • All variables are binary except that W is set to be a vector of D binary variables to represent high-dimensional covariates.
  • Example 1 (Fig. 1b).
  • The authors test on estimating E [Y |do(x)] with D = 15 where the causal effect P (y|do(x)) is given by Eq (1).
  • The MAAE plots are given in Fig. 3a.
  • The authors observe that the WERM-based methods (WERM-ID/WERM-ID-R) significantly outperform Plug-in
Conclusion
  • This paper aims to fill the gap from causal identification to causal estimation.
  • To this end, the authors developed a learning framework that brings together the causal identification theory and powerful ERM methods.
  • The authors proposed a learning objective based on the WERM theory and provided a practical learning algorithm for estimating causal effects from finite samples.
  • The authors hope that the conceptual framework and practical methods introduced in this work can inspire future investigation in the ML and CI communities towards the development of robust and efficient methods for learning causal effects in applied settings
Summary
  • Introduction:

    Inferring causal effects from data is a fundamental challenge that cuts across the empirical sciences [35, 47, 36].
  • One common task in the field is known as the problem of causal effect identification.
  • Consider the task of identifying the effect of X on Y , P (y|do(x)), from the causal graph G in Fig. 1a and an observational distribution P (v), where V = {Z, X, Y } is the set of observed variables.
  • Objectives:

    This paper aims to bridge this gap, from causal identification to causal estimation.
  • The goal of this paper is to develop a learning framework that could work for any identifiable causal functional without the BD/ignorability assumption, by marrying two families of methods, benefiting from the generality of the causal identification methods based on graphs (i.e., ID) and the effectiveness of the estimators produced based on the principle of WERM.
  • This paper aims to fill the gap from causal identification to causal estimation
  • Methods:

    The authors consider the following two practical examples shown in Fig. 2, in addition to Example 1.
  • In the causal graph in Fig. 2a, X represents sign-up for the job-training program, Z actual participation, and Y the postprogram earnings [17].
  • The authors denote WERM-ID-R the estimator given in Algo.
  • 2. H and HW are set as the gradient boosting regression classes.
  • The authors compare the proposed methods with the Plug-in estimator, the only natural method applicable to any causal functionals, which computes each conditional probability such as P (x|r, w) by plugging-in the gradient boosting regression
  • Results:

    The authors evaluate the proposed WERM learning framework against the plug-in estimators in Examples (1,2,3).
  • All variables are binary except that W is set to be a vector of D binary variables to represent high-dimensional covariates.
  • Example 1 (Fig. 1b).
  • The authors test on estimating E [Y |do(x)] with D = 15 where the causal effect P (y|do(x)) is given by Eq (1).
  • The MAAE plots are given in Fig. 3a.
  • The authors observe that the WERM-based methods (WERM-ID/WERM-ID-R) significantly outperform Plug-in
  • Conclusion:

    This paper aims to fill the gap from causal identification to causal estimation.
  • To this end, the authors developed a learning framework that brings together the causal identification theory and powerful ERM methods.
  • The authors proposed a learning objective based on the WERM theory and provided a practical learning algorithm for estimating causal effects from finite samples.
  • The authors hope that the conceptual framework and practical methods introduced in this work can inspire future investigation in the ML and CI communities towards the development of robust and efficient methods for learning causal effects in applied settings
Funding
  • Elias Bareinboim and Yonghan Jung were partially supported by grants from NSF IIS-1704352 and IIS-1750807 (CAREER)
  • Jin Tian was partially supported by NSF grant IIS-1704352 and ONR grant N000141712140
Study subjects and analysis
samples: 107
Experiments Setup

We specify a SCM M for each causal graph and generate datasets D from M. In order to estimate the ground truth μ(x) ≡ E [Y |do(x)], we generate mint = 107 samples Dint from Mx, the model induced by the intervention do(X = x), and compute the mean of Y in Dint. We denote WERM-ID-R the estimator given in Algo. 2

samples: 107
We specify a SCM M for each causal graph and generate datasets D from M. In order to estimate the ground truth μ(x) ≡ E [Y |do(x)], we generate mint = 107 samples Dint from Mx, the model induced by the intervention do(X = x), and compute the mean of Y in Dint. We denote WERM-ID-R the estimator given in Algo

datasets: 100
For each μ ∈ {μIDR, μID, μplug}, we compute the average absolute error (AAE) as |μ(x) − μ(x)| averaged over x. We generate 100 datasets for each sample size m. We call the median of the 100 AAEs the median average absolute error, or MAAE, and its plot vs. the sample size m, the MAAE plot

Reference
  • H. Bang and J. M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
    Google ScholarLocate open access versionFindings
  • E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identifiability. In In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, pages 113–120. AUAI Press, 2012.
    Google ScholarLocate open access versionFindings
  • E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
    Google ScholarLocate open access versionFindings
  • S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
    Google ScholarLocate open access versionFindings
  • S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, volume 19, page 137. MIT Press, 2007.
    Google ScholarLocate open access versionFindings
  • R. Bhattacharya, R. Nabi, and I. Shpitser. Semiparametric inference for causal effects in graphical models with hidden variables. arXiv preprint arXiv:2003.12659, 2020.
    Findings
  • J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129–136, 2008.
    Google ScholarLocate open access versionFindings
  • L. Bottou, J. Peters, J. Quiñonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.
    Google ScholarLocate open access versionFindings
  • J. Byrd and Z. Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning, pages 872–881, 2019.
    Google ScholarLocate open access versionFindings
  • G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.
    Google ScholarLocate open access versionFindings
  • T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.
    Google ScholarLocate open access versionFindings
  • C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. Advances in Neural Information Processing Systems, 23:442–450, 2010.
    Google ScholarLocate open access versionFindings
  • C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
    Google ScholarLocate open access versionFindings
  • C. Cortes, M. Mohri, and D. Storcheus. Regularized gradient boosting. In Advances in Neural Information Processing Systems, pages 5450–5459, 2019.
    Google ScholarLocate open access versionFindings
  • R. M. Daniel, S. Cousens, B. De Stavola, M. G. Kenward, and J. Sterne. Methods for dealing with time-dependent confounding. Statistics in medicine, 32(9):1584–1618, 2013.
    Google ScholarLocate open access versionFindings
  • I. R. Fulcher, I. Shpitser, S. Marealle, and E. J. Tchetgen Tchetgen. Robust inference on population indirect causal effects: the generalized front door criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1):199–214, 2020.
    Google ScholarLocate open access versionFindings
  • A. N. Glynn and K. Kashin. Front-door versus back-door adjustment with unmeasured confounding: Bias formulas for front-door and hybrid adjustments with application to a job training program. Journal of the American Statistical Association, 113(523):1040–1049, 2018.
    Google ScholarLocate open access versionFindings
  • A. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt, and B. Schöllkopf. Covariate shift by kernel mean matching. In Dataset shift in machine learning, pages 131–160. MIT Press, 2009.
    Google ScholarLocate open access versionFindings
  • I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the bayesian/frequentist divide. Journal of Machine Learning Research, 11(Jan):61–87, 2010.
    Google ScholarLocate open access versionFindings
  • N. Hassanpour and R. Greiner. Counterfactual regression with importance sampling weights. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 5880–5887. AAAI Press, 2019.
    Google ScholarLocate open access versionFindings
  • J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
    Google ScholarLocate open access versionFindings
  • Y. Huang and M. Valtorta. Pearl’s calculus of intervention is complete. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, pages 217–224. AUAI Press, 2006.
    Google ScholarLocate open access versionFindings
  • M. D. Hughes, M. J. Daniels, M. A. Fischl, S. Kim, and R. T. Schooley. Cd4 cell count as a surrogate endpoint in hiv clinical trials: a meta-analysis of studies of the aids clinical trials group. Aids, 12(14):1823–1832, 1998.
    Google ScholarLocate open access versionFindings
  • A. Jaber, J. Zhang, and E. Bareinboim. Causal identification under markov equivalence. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • F. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In International conference on machine learning, pages 3020–3029, 2016.
    Google ScholarLocate open access versionFindings
  • F. D. Johansson, N. Kallus, U. Shalit, and D. Sontag. Learning weighted representations for generalization across designs. arXiv preprint arXiv:1802.08598, 2018.
    Findings
  • Y. Jung, J. Tian, and E. Bareinboim. Estimating causal effects using weighting-based estimators. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Jung, J. Tian, and E. Bareinboim. Learning causal effects via weighted empirical risk minimization. Technical report, 2020.
    Google ScholarFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • S. Lee, J. Correa, and E. Bareinboim. Generalized transportability: Synthesis of experiments from heterogeneous domains. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • S. Lee, J. D. Correa, and E. Bareinboim. General identifiability with arbitrary surrogate experiments. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Liu, O. Gottesman, A. Raghu, M. Komorowski, A. A. Faisal, F. Doshi-Velez, and E. Brunskill. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2644–2653, 2018.
    Google ScholarLocate open access versionFindings
  • B. London and T. Sandler. Bayesian counterfactual risk minimization. In International Conference on Machine Learning, pages 4125–4133, 2019.
    Google ScholarLocate open access versionFindings
  • J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–710, 1995.
    Google ScholarLocate open access versionFindings
  • J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009.
    Google ScholarFindings
  • J. Pearl and D. Mackenzie. The book of why: the new science of cause and effect. Basic Books, 2018.
    Google ScholarFindings
  • J. Pearl and J. Robins. Probabilistic evaluation of sequential plans from causal models with hidden variables. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pages 444–453. Morgan Kaufmann Publishers Inc., 1995.
    Google ScholarLocate open access versionFindings
  • D. Pollard. Convergence of Stochastic Processes. David Pollard, 1984.
    Google ScholarFindings
  • J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine Learning. The MIT Press, 2009.
    Google ScholarFindings
  • J. Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling, 7(9-12):1393–1512, 1986.
    Google ScholarLocate open access versionFindings
  • J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5), 2000.
    Google ScholarLocate open access versionFindings
  • P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
    Google ScholarLocate open access versionFindings
  • D. B. Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of statistics, pages 34–58, 1978.
    Google ScholarLocate open access versionFindings
  • U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3076–3085. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
    Google ScholarLocate open access versionFindings
  • I. Shpitser and J. Pearl. Identification of joint interventional distributions in recursive semimarkovian causal models. In Proceedings of the 21st AAAI Conference on Artificial Intelligence, page 1219. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.
    Google ScholarLocate open access versionFindings
  • P. Spirtes, C. N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2nd edition, 2001.
    Google ScholarFindings
  • A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the 32th International Conference on Machine Learning, pages 814–823, 2015.
    Google ScholarLocate open access versionFindings
  • J. Tian and J. Pearl. A general identification condition for causal effects. In Proceedings of the 18th National Conference on Artificial Intelligence, pages 567–573, 2002.
    Google ScholarLocate open access versionFindings
  • J. Tian and J. Pearl. On the testable implications of causal models with hidden variables. In Proceedings of the 18th conference on Uncertainty in artificial intelligence, pages 519–527. Morgan Kaufmann Publishers Inc., 2002.
    Google ScholarLocate open access versionFindings
  • J. Tian and J. Pearl. On the identification of causal effects. Technical Report R-290-L, 2003.
    Google ScholarFindings
  • M. J. Van Der Laan and D. Rubin. Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1), 2006.
    Google ScholarLocate open access versionFindings
  • V. Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
    Google ScholarLocate open access versionFindings
  • [55] R. Vogel, M. Achab, S. Clémençon, and C. Tillier. Weighted empirical risk minimization: Sample selection bias correction based on importance sampling. arXiv preprint arXiv:2002.05145, 2020.
    Findings
  • [56] H. Zhao and R. Tachet. On learning invariant representations for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • [57] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments