## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Causal Estimation with Functional Confounders

NIPS 2020, (2020): 5115-5125

EI WOS

Keywords

Abstract

Causal inference relies on two fundamental assumptions: ignorability and positivity. We study causal inference when the true confounder value can be expressed as a function of the observed data; we call this setting estimation with functional confounders (EFC). In this setting, ignorability is satisfied, however positivity is violated, ...More

Code:

Data:

Introduction

- Determining the effect of interventions on outcomes using observational data lies at the core of many fields like medicine, economic policy, and genomics.
- There could exist unobserved variables that affect both the intervention and the outcome, called confounders.
- A necessary condition for the causal effect to be identified is that all confounders are observed; called ignorability.
- A sufficient condition for causal effect estimation is adequate variation in the intervention after conditioning on the confounders; called positivity

Highlights

- Determining the effect of interventions on outcomes using observational data lies at the core of many fields like medicine, economic policy, and genomics
- We develop a sufficient condition to estimate the effects of said functional interventions, called functional positivity (F-POSITIVITY)
- Given a confounder value, C-REDUNDANCY allows us to compute a surrogate intervention such that the conditional effect of the surrogate is equal to that of the original intervention. We show that such surrogate interventions exist only under a certain condition that we call Effect Connectivity, that is necessary for nonparametric effect estimation in estimation with functional confounders (EFC)
- When positivity is violated in traditional observational causal inference (OBS-CI), not all effects are estimable without further assumptions
- We develop a new general setting of observational causal effect estimation called estimation with functional confounders (EFC) where the confounder can be expressed as a function of the data, meaning positivity is violated
- We develop a sufficient condition called functional positivity (F-POSITIVITY) to estimate effects of functional interventions

Methods

- The authors evaluate LODE on simulated data first and show that LODE can correct for confounding.
- The authors investigate different properties of LODE on simulated data where ground truth is available.
- Let the dimension of t be T = 20 and outcome noise be η ∼ N(0, 0.1).

Results

- The authors select relevant SNPs by thresholding estimated effects at a magnitude > 0.1.
- From 1050 SNPs (1000 not reported before) LODE returned 31 SNPs, out of which 13 were previously reported as being associated with Celiac disease [8, 25, 14, 1].
- In appendix B.2 the authors plot the true positive and false negative rates of identifying previously reported SNPs, as a function of the effect threshold.
- In table 1, the authors list a few SNPs that were both deemed relevant by LODE and were reported in existing litera- SNP EFFECT.

Conclusion

- When positivity is violated in traditional OBS-CI, not all effects are estimable without further assumptions.
- In such cases, practitioners have to turn to parametric models to estimate causal effects.
- The authors develop a sufficient condition called functional positivity (F-POSITIVITY) to estimate effects of functional interventions.
- Such effects could be of independent interest; like the effect of cumulative dosage of a drug instead of joint effects of multiple dosages at different times

Summary

## Introduction:

Determining the effect of interventions on outcomes using observational data lies at the core of many fields like medicine, economic policy, and genomics.- There could exist unobserved variables that affect both the intervention and the outcome, called confounders.
- A necessary condition for the causal effect to be identified is that all confounders are observed; called ignorability.
- A sufficient condition for causal effect estimation is adequate variation in the intervention after conditioning on the confounders; called positivity
## Methods:

The authors evaluate LODE on simulated data first and show that LODE can correct for confounding.- The authors investigate different properties of LODE on simulated data where ground truth is available.
- Let the dimension of t be T = 20 and outcome noise be η ∼ N(0, 0.1).
## Results:

The authors select relevant SNPs by thresholding estimated effects at a magnitude > 0.1.- From 1050 SNPs (1000 not reported before) LODE returned 31 SNPs, out of which 13 were previously reported as being associated with Celiac disease [8, 25, 14, 1].
- In appendix B.2 the authors plot the true positive and false negative rates of identifying previously reported SNPs, as a function of the effect threshold.
- In table 1, the authors list a few SNPs that were both deemed relevant by LODE and were reported in existing litera- SNP EFFECT.
## Conclusion:

When positivity is violated in traditional OBS-CI, not all effects are estimable without further assumptions.- In such cases, practitioners have to turn to parametric models to estimate causal effects.
- The authors develop a sufficient condition called functional positivity (F-POSITIVITY) to estimate effects of functional interventions.
- Such effects could be of independent interest; like the effect of cumulative dosage of a drug instead of joint effects of multiple dosages at different times

- Table1: A few SNPs previously reported as relevant and recovered by LODE, with estimated effects and Lasso coefficients. LODE produces effect estimates that do not rely purely on the coefficients

Related work

- The problem of genome-wide association studies (GWAS) is to estimate the effect of genetic variations(also called single nucleotide polymorphisms (SNPs)) on the phenotype [29]. The ancestry of the subjects acts as a confounder in GWAS. In GWAS practice, principle component analysis (PCA) and linear mixed models (LMMs) are used to compute this confounding structure [19, 31]. Lippert et al [15] suggest estimating the confounders and effects on separate subsets of the SNPs. This separation disregards the confounding that is captured in the interaction of the two subsets of SNPs. GWAS is a special case of effects from multiple treatments (MTE) where the confounder value is specified via optimization as a function of the pre-outcome variables [20, 30]. In all these settings, positivity is violated and not all effects are estimable. We provide an avenue for nonparametric effect-estimation of the full intervention under a new sufficient condition.

Funding

- The authors were partly supported by NIH/NHLBI Award R01HL148248, and by NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science

Study subjects and analysis

cases: 3796

In this experiment, we explore the associations of genetic factors and Celiac disease. We utilize data from the Wellcome Trust Celiac disease GWAS dataset [8, 6] consisting of individuals with celiac disease, called cases (n = 3796), and controls (n = 8154). We construct our dataset by filtering from the ∼ 550, 000 SNPs

people: 11950

The only preprocessing in our experiments is linkage disequilibrium pruning of adjacent SNPs (at 0.5 R2) and PLINK [5] quality control. After this, 337, 642 SNPs remain for 11, 950 people. We imputed missing SNPs for each person by sampling from the marginal distribution of that SNP

Reference

- Svetlana Adamovic, SS Amundsen, BA Lie, AH Gudjonsdottir, H Ascher, J Ek, DA Van Heel, S Nilsson, LM Sollid, and A Torinsson Naluai. Association study of il2/il21 and fcgriia: significant association with the il2/il21 region in scandinavian coeliac disease families. Genes and immunity, 9(4):364, 2008.
- Carl A Anderson, Gabrielle Boucher, Charlie W Lees, Andre Franke, Mauro D’Amato, Kent D Taylor, James C Lee, Philippe Goyette, Marcin Imielinski, Anna Latiano, et al. Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associations to 47. Nature genetics, 43(3):246, 2011.
- Uri M Ascher and Linda R Petzold. Computer methods for ordinary differential equations and differential-algebraic equations, volume 61.
- William Astle, David J Balding, et al. Population structure and cryptic relatedness in genetic association studies. Statistical Science, 24(4):451–471, 2009.
- Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, and James J Lee. Second-generation plink: rising to the challenge of larger and richer datasets. Gigascience, 4(1):s13742–015, 2015.
- Wellcome Trust Case Control Consortium et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661, 2007.
- J. Correa and E. Bareinboim. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, 2020. AAAI Press.
- Patrick CA Dubois, Gosia Trynka, Lude Franke, Karen A Hunt, Jihane Romanos, Alessandra Curtotti, Alexandra Zhernakova, Graham AR Heap, Roza Adany, Arpo Aromaa, et al. Multiple common variants for celiac disease influencing immune gene expression. Nature genetics, 42 (4):295, 2010.
- Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of Science, 74(5):981–995, 2007.
- Miguel A Hernan and James M Robins. Causal inference: what if. Boca Raton: Chapman & Hill/CRC, 2020, 2020.
- Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 20doi: 10.1198/jcgs.2010.08162. URL https://doi.org/10.1198/jcgs.2010.08162.
- Lucia A Hindorff, Praveen Sethupathy, Heather A Junkins, Erin M Ramos, Jayashri P Mehta, Francis S Collins, and Teri A Manolio. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences, 106(23):9362–9367, 2009.
- Morris W Hirsch, Robert L Devaney, and Stephen Smale. Differential equations, dynamical systems, and linear algebra, volume 60. Academic press, 1974.
- Karen A Hunt, Alexandra Zhernakova, Graham Turner, Graham AR Heap, Lude Franke, Marcel Bruinenberg, Jihane Romanos, Lotte C Dinesen, Anthony W Ryan, Davinder Panesar, et al. Novel celiac disease genetic determinants related to the immune response. Nature genetics, 40 (4):395, 2008.
- Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson, and David Heckerman. Fast linear mixed models for genome-wide association studies. Nature methods, 8 (10):833, 2011.
- Virginia Pascual, Romina Dieli-Crimi, Natalia Lopez-Palacios, Andres Bodas, Luz Marıa Medrano, and Concepcion Nunez. Inflammatory bowel disease and celiac disease: overlaps and differences. World journal of gastroenterology: WJG, 20(17):4846, 2014.
- Judea Pearl et al. Causal inference in statistics: An overview. Statistics surveys, 3:96–146, 2009.
- Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
- Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38(8):904, 2006.
- Rajesh Ranganath and Adler Perotte. Multiple causal inference with latent confounding. arXiv preprint arXiv:1805.08273, 2018.
- Marc Ratkovic. Balancing within the margin: Causal effect estimation with support vector machines. Department of Politics, Princeton University, Princeton, NJ, 2014.
- James M Robins. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, volume 1999, pages 6–10. Indianapolis, IN, 2000.
- Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
- Donald B Rubin. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371):591–593, 1980.
- Ludvig M Sollid. Coeliac disease: dissecting a complex inflammatory disorder. Nature Reviews Immunology, 2(9):647, 2002.
- Michael Spivak. Calculus on manifolds: a modern approach to classical theorems of advanced calculus. CRC press, 2018.
- Gerald Teschl. Ordinary differential equations and dynamical systems, volume 140. American Mathematical Soc., 2012.
- Timothy Thornton and Michael Wu. Summer institute in statistical genetics 2015.
- Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew A Brown, and Jian Yang. 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
- Yixin Wang and David M Blei. The blessings of multiple causes. Journal of the American Statistical Association, (just-accepted):1–71, 2019.
- Jianming Yu, Gael Pressoir, William H Briggs, Irie Vroh Bi, Masanori Yamasaki, John F Doebley, Michael D McMullen, Brandon S Gaut, Dahlia M Nielsen, James B Holland, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature genetics, 38(2):203, 2006.

Tags

Comments