# Recovering from Selection Bias in Causal and Statistical Inference

AAAI, pp. 2410-2416, 2014.

EI

Weibo:

Abstract:

Selection bias is caused by preferential exclusion of units from the samples and represents a major obstacle to valid causal and statistical inferences; it cannot be removed by randomized experiments and can rarely be detected in either experimental or observational studies. In this paper, we provide complete graphical and algorithmic con...More

Code:

Data:

Introduction

- Selection bias is induced by preferential selection of units for data analysis, usually governed by unknown factors including treatment, outcome, and their consequences, and represents a major obstacle to valid causal and statistical inferences
- It cannot be removed by randomized experiments and can rarely be detected in either experimental or observational studies.1.
- Both action and outcome affect the entry in the data pool, which will be shown not to be recoverable – i.e., there is no method capable of unbiasedly estimating the population-level distribution using data gathered under this selection process

Highlights

- Selection bias is induced by preferential selection of units for data analysis, usually governed by unknown factors including treatment, outcome, and their consequences, and represents a major obstacle to valid causal and statistical inferences
- We provide conditions for recoverability from selection bias in statistical and causal inferences applicable for arbitrary structures in non-parametric settings
- Theorem 1 provides a complete characterization of recoverability when no external information is available
- Theorem 5 further gives a graphical condition for recovering causal effects, which generalizes the backdoor adjustment
- Since selection bias is a common problem across many disciplines, the methods developed in this paper should help to understand, formalize, and alleviate this problem in a broad range of data-intensive applications
- This paper complements another aspect of the generalization problem in which causal effects are transported among differing environments (Bareinboim and Pearl 2013a; 2013b)

Conclusion

- The authors provide conditions for recoverability from selection bias in statistical and causal inferences applicable for arbitrary structures in non-parametric settings.
- Since selection bias is a common problem across many disciplines, the methods developed in this paper should help to understand, formalize, and alleviate this problem in a broad range of data-intensive applications.
- This paper complements another aspect of the generalization problem in which causal effects are transported among differing environments (Bareinboim and Pearl 2013a; 2013b)

Summary

## Introduction:

Selection bias is induced by preferential selection of units for data analysis, usually governed by unknown factors including treatment, outcome, and their consequences, and represents a major obstacle to valid causal and statistical inferences- It cannot be removed by randomized experiments and can rarely be detected in either experimental or observational studies.1.
- Both action and outcome affect the entry in the data pool, which will be shown not to be recoverable – i.e., there is no method capable of unbiasedly estimating the population-level distribution using data gathered under this selection process
## Conclusion:

The authors provide conditions for recoverability from selection bias in statistical and causal inferences applicable for arbitrary structures in non-parametric settings.- Since selection bias is a common problem across many disciplines, the methods developed in this paper should help to understand, formalize, and alleviate this problem in a broad range of data-intensive applications.
- This paper complements another aspect of the generalization problem in which causal effects are transported among differing environments (Bareinboim and Pearl 2013a; 2013b)

Related work

**Related work and Our contributions**

There are three sets of assumptions that are enlightening to acknowledge if we want to understand the procedures avail-

able in the literature for treating selection bias – qualitative assumptions about the selection mechanism, parametric assumptions regarding the data-generating model, and quantitative assumptions about the selection process.

In the data-generating model in Fig. 1(c), the selection of units to the sample is treatment-dependent, which means that it is caused by X, but not Y . This case has been studied in the literature and Q = P (y|x) is known to be non-parametrically recoverable from selection (Greenland and Pearl 2011). Alternatively, in the data-generating model in Fig. 1(d), the selection is caused by Y (outcomedependent), and Q is not recoverable from selection (formally shown later on), but is the odds ratio4 (Cornfield 1951; Whittemore 1978; Geng 1992; Didelez, Kreiner, and Keiding 2010). As mentioned earlier, Q is also not recoverable in the graph in Fig. 1(a). By and large, the literature is concerned with treatment-dependent or outcome-dependent selection, but selection might be caused by multiple reasons and embedded in more intricate realities. For instance, a driver of the treatment Z (e.g., age, sex, socio-economic status) may also be causing selection, see Fig. 1(e,f). As it turns out, Q is recoverable in Fig 1(e) but not in (f), so different qualitative assumptions need to be modelled explicitly since each topology entails a different answer for recoverability.

Funding

- This research was supported in parts by grants from NSF #IIS-1249822 and #IIS-1302448, and ONR #N00014-13-1-0153 and #N00014-10-1-0933

Reference

- Acid, S., and de Campos, L. 1996. An algorithm for finding minimum d-separating sets in belief networks. In Proceedings of the 12th Annual Conference on Uncertainty in Artificial Intelligence, 3–10. San Francisco, CA: Morgan Kaufmann.
- Angrist, J. D. 1997. Conditional independence in sample selection models. Economics Letters 54(2):103–112.
- Bareinboim, E., and Pearl, J. 2012. Controlling selection bias in causal inference. In Girolami, M., and Lawrence, N., eds., Proceedings of The Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012), 100–108. JMLR (22).
- Bareinboim, E., and Pearl, J. 2013a. Meta-transportability of causal effects: A formal approach. In Proceedings of The Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2013), 135–143. JMLR (31).
- Bareinboim, E., and Pearl, J. 2013b. Causal transportability with limited experiments. In desJardins, M., and Littman, M. L., eds., Proceedings of The Twenty-Seventh Conference on Artificial Intelligence (AAAI 2013), 95–101.
- Bareinboim, E.; Tian, J.; and Pearl, J. 2014. Recovering from selection bias in causal and statistical inference. Technical Report R-425, Cognitive Systems Laboratory, Department of Computer Science, UCLA.
- Cooper, G. 1995. Causal discovery from data in the presence of selection bias. Artificial Intelligence and Statistics 140–150.
- Cornfield, J. 1951. A method of estimating comparative rates from clinical data; applications to cancer of the lung, breast, and cervix. Journal of the National Cancer Institute 11:1269–1275.
- Cortes, C.; Mohri, M.; Riley, M.; and Rostamizadeh, A. 2008. Sample selection bias correction theory. In Proceedings of the 19th International Conference on Algorithmic Learning Theory, ALT ’08, 38–53.
- Didelez, V.; Kreiner, S.; and Keiding, N. 20Graphical models for inference under outcome-dependent sampling. Statistical Science 25(3):368–387.
- Elkan, C. 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’01, 973–978. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
- Geng, Z. 1992. Collapsibility of relative risk in contingency tables with a response variable. Journal Royal Statistical Society 54(2):585–593.
- Glymour, M., and Greenland, S. 2008. Causal diagrams. In Rothman, K.; Greenland, S.; and Lash, T., eds., Modern Epidemiology. Philadelphia, PA: Lippincott Williams & Wilkins, 3rd edition. 183– 209.
- Greenland, S., and Pearl, J. 2011. Adjustments and their consequences – collapsibility analysis using graphical models. International Statistical Review 79(3):401–426.
- Heckman, J. 1979. Sample selection bias as a specification error. Econometrica 47:153–161.
- Hein, M. 2009. Binary classification under sample selection bias. In Candela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N., eds., Dataset Shift in Machine Learning. Cambridge, MA: MIT Press. 41–64.
- Jewell, N. P. 1991. Some surprising results about covariate adjustment in logistic regression models. International Statistical Review 59(2):227–240.
- Koller, D., and Friedman, N. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.
- Kuroki, M., and Cai, Z. 2006. On recovering a population covariance matrix in the presence of selection bias. Biometrika 93(3):601–611.
- Little, R. J. A., and Rubin, D. B. 1986. Statistical Analysis with Missing Data. New York, NY, USA: John Wiley & Sons, Inc.
- Mefford, J., and Witte, J. S. 2012. The covariate’s dilemma. PLoS Genet 8(11):e1003096.
- Pearl, J., and Paz, A. 2013. Confounding equivalence in causal equivalence. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI 2010), 433–441.
- Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kaufmann.
- Pearl, J. 1993. Aspects of graphical models connected with causality. In Proceedings of the 49th Session of the International Statistical Institute, 391–401.
- Pearl, J. 1995. Causal diagrams for empirical research. Biometrika 82(4):669–710.
- Pearl, J. 2000. Causality: Models, Reasoning, and Inference. New York: Cambridge University Press. Second ed., 2009.
- Pearl, J. 2013. Lindear models: A useful “microscope” for causal analysis. Journal of Causal Inference 1:155–170.
- Pirinen, M.; Donnelly, P.; and Spencer, C. 2012. Including known covariates can reduce power to detect genetic effects in case-control studies. Nature Genetics 44:848–851.
- Robins, J. 2001.
- Smith, A. T., and Elkan, C. 2007. Making generative classifiers robust to selection bias. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, 657–666. New York, NY, USA: ACM.
- Spirtes, P.; Glymour, C.; and Scheines, R. 2000.
- Storkey, A. 2009. When training and test sets are different: characterising learning transfer. In Candela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N., eds., Dataset Shift in Machine Learning. Cambridge, MA: MIT Press. 3–28.
- Textor, J., and Liskiewicz, M. 2011. Adjustment criteria in causal diagrams: An algorithmic perspective. In Pfeffer, A., and Cozman, F., eds., Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence (UAI 2011), 681–688. AUAI Press.
- Tian, J.; Paz, A.; and Pearl, J. 1998. Finding minimal separating sets. Technical Report R-254, University of California, Los Angeles, CA.
- Whittemore, A. 1978. Collapsibility of multidimensional contingency tables. Journal of the Royal Statistical Society, Series B 40(3):328–340.
- Zadrozny, B. 2004. Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, 114–. New York, NY, USA: ACM.
- Zhang, J. 2008. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artif. Intell. 172:1873–1896.

Full Text

Best Paper

Best Paper of AAAI, 2014

Tags

Comments