# Differentiable Causal Discovery from Interventional Data

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Discovering causal relationships in data is a challenging task that involves solving a combinatorial problem for which the solution is not always identifiable. A new line of work reformulates the combinatorial problem as a continuous constrained optimization one, enabling the use of different powerful optimization techniques. However, m...More

Introduction

- The inference of causal relationships is a problem of fundamental interest in science.
- The authors aim to learn a causal graphical model (CGM) [28], which consists of a joint distribution coupled with a directed acyclic graph (DAG), where edges indicate direct causal relationships
- Achieving this based on observational data alone is challenging since, under the faithfulness assumption, the true DAG is only identifiable up to a Markov equivalence class [38].
- The kth interventional joint density is p(k)(x1, · · · , xd) :=

Highlights

- The inference of causal relationships is a problem of fundamental interest in science
- We aim to learn a causal graphical model (CGM) [28], which consists of a joint distribution coupled with a directed acyclic graph (DAG), where edges indicate direct causal relationships
- We propose the approach Differentiable Causal Discovery with Interventions (DCDI): a general differentiable causal structure learning method that can leverage perfect, imperfect and unknown interventions (Section 3)
- We proposed a general continuous-constrained method for causal discovery which can leverage various types of interventional data as well as expressive neural architectures, such as normalizing flows
- One direction is to extend DCDI to time-series data, where non-stationarities can be modeled as unknown interventions [29]
- Another exciting direction is to learn representations of variables across multiple systems that could serve as prior knowledge for causal discovery in low data settings

Methods

- IGSP GIES CAM DCDI-G 36 43.
- DCDI-DSF 33 47 tp fn fp rev F1 score.
- In Table 3 the authors report SHD and SID for all methods, along with the number of true positive, false negative, false positive, reversed edges, and the F1 score.
- IGSP has a low SHD, but a high SID, which can be explained by the relatively high number of false negative.
- DCDI-G and DCDI-DSF have SHDs comparable to GIES and CAM, but higher than IGSP.

Results

**Results for different intervention types**.- The authors compare the methods to GIES [12], a modified version of CAM [2] that support interventions and IGSP [39].
- Boxplots for SHD and SID over 10 graphs are shown in Figure 2.
- DCDI-G and DCDI-DSF shows competitive results in term of SHD and SID.
- For graphs with a higher number of average edges, DCDI-G and DCDI-DSF outperform all methods.
- GIES often shows the best performance for the linear data set, which is not surprising given that it makes the right assumptions, i.e., linear functions with Gaussian noise

Conclusion

- The authors proposed a general continuous-constrained method for causal discovery which can leverage various types of interventional data as well as expressive neural architectures, such as normalizing flows.
- This approach is rooted in a sound theoretical framework and is competitive with other stateof-the-art algorithms on real and simulated data sets, both in terms of graph recovery and scalability.
- Another exciting direction is to learn representations of variables across multiple systems that could serve as prior knowledge for causal discovery in low data settings

Summary

## Introduction:

The inference of causal relationships is a problem of fundamental interest in science.- The authors aim to learn a causal graphical model (CGM) [28], which consists of a joint distribution coupled with a directed acyclic graph (DAG), where edges indicate direct causal relationships
- Achieving this based on observational data alone is challenging since, under the faithfulness assumption, the true DAG is only identifiable up to a Markov equivalence class [38].
- The kth interventional joint density is p(k)(x1, · · · , xd) :=
## Objectives:

The authors' goal is to design an algorithm that can automatically discover causal relationships from data.- The authors aim to learn a causal graphical model (CGM) [28], which consists of a joint distribution coupled with a directed acyclic graph (DAG), where edges indicate direct causal relationships
## Methods:

IGSP GIES CAM DCDI-G 36 43.- DCDI-DSF 33 47 tp fn fp rev F1 score.
- In Table 3 the authors report SHD and SID for all methods, along with the number of true positive, false negative, false positive, reversed edges, and the F1 score.
- IGSP has a low SHD, but a high SID, which can be explained by the relatively high number of false negative.
- DCDI-G and DCDI-DSF have SHDs comparable to GIES and CAM, but higher than IGSP.
## Results:

**Results for different intervention types**.- The authors compare the methods to GIES [12], a modified version of CAM [2] that support interventions and IGSP [39].
- Boxplots for SHD and SID over 10 graphs are shown in Figure 2.
- DCDI-G and DCDI-DSF shows competitive results in term of SHD and SID.
- For graphs with a higher number of average edges, DCDI-G and DCDI-DSF outperform all methods.
- GIES often shows the best performance for the linear data set, which is not surprising given that it makes the right assumptions, i.e., linear functions with Gaussian noise
## Conclusion:

The authors proposed a general continuous-constrained method for causal discovery which can leverage various types of interventional data as well as expressive neural architectures, such as normalizing flows.- This approach is rooted in a sound theoretical framework and is competitive with other stateof-the-art algorithms on real and simulated data sets, both in terms of graph recovery and scalability.
- Another exciting direction is to learn representations of variables across multiple systems that could serve as prior knowledge for causal discovery in low data settings

- Table1: Hyperparameter search spaces for each algorithm
- Table2: Default Hyperparameter for DCDI-G and DCDI-DSF
- Table3: Results for the flow cytometry data sets
- Table4: Results for the linear data set with perfect intervention
- Table5: Results for the additive noise model data set with perfect intervention
- Table6: Results for the nonlinear with non-additive noise data set with perfect intervention
- Table7: Results for the linear data set with imperfect intervention
- Table8: Results for the additive noise model data set with imperfect intervention
- Table9: Results for the nonlinear with non-additive noise data set with imperfect intervention
- Table10: Results for the linear data set with perfect intervention with unknown targets
- Table11: Results for the additive noise model data set with perfect intervention with unknown targets
- Table12: Results for the nonlinear with non-additive noise data set with perfect intervention with unknown targets
- Table13: Table 13
- Table14: Table 14
- Table15: Table 15
- Table16: Table 16
- Table17: Table 17
- Table18: Table 18
- Table19: Results for linear data set with perfect intervention

Funding

- This research was partially supported by the Canada CIFAR AI Chair Program, by an IVADO excellence PhD scholarship and by a Google Focused Research award

Study subjects and analysis

nodes and on data sets: 100

So far the experiments focused on moderate size data sets, both in terms of number of variables (10 or 20) and number of examples (≈ 104). In Appendix C.3, we compare the running times of DCDI to those of other methods on graphs of up to 100 nodes and on data sets of up to 1 million examples. The augmented Lagrangian procedure on which DCDI relies requires the computation of the matrix exponential at each gradient step, which costs O(d3)

Reference

- L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010. 2010.
- P. Bühlmann, J. Peters, and J. Ernest. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 2014.
- D. M. Chickering. Optimal structure identification with greedy search. In Journal of Machine Learning Research, 2003.
- A. Dixit, O. Parnas, B. Li, J. Chen, C. P. Fulco, L. Jerby-Arnon, N. D. Marjanovic, D. Dionne, T. Burks, R. Raychndhury, T. M. Adamson, B. Norman, E. S. Lander, J. S. Weissman, N. Friedman, and A. Regev. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. Cell, 2016.
- D. Eaton and K. Murphy. Exact bayesian structure learning from uncertain interventions. In Artificial intelligence and statistics, 2007.
- F. Eberhardt. Causation and intervention. Unpublished doctoral dissertation, Carnegie Mellon University, 2007.
- F. Eberhardt. Almost Optimal Intervention Sets for Causal Discovery. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, 2008.
- F. Eberhardt and R. Scheines. Interventions and causal inference. Philosophy of Science, 2007.
- F. Eberhardt, C. Glymour, and R. Scheines. On the Number of Experiments Sufficient and in the Worst Case Necessary to Identify all Causal Relations among N Variables. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, 2005.
- K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. In Advances in neural information processing systems, 2008.
- X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010.
- A. Hauser and P. Bühlmann. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 2012.
- C. Heinze-Deml, M. H. Maathuis, and N. Meinshausen. Causal structure learning. Annual Review of Statistics and Its Application, 2018.
- C. Heinze-Deml, J. Peters, and N. Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 2018.
- C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive flows. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- A. Hyttinen, F. Eberhardt, and M. Järvisalo. Constraint-based causal discovery: Conflict resolution with answer set programming. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, 2014.
- E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. Proceedings of the 34th International Conference on Machine Learning, 2017.
- D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag. Sam: Structural agnostic model, causal discovery and penalized adversarial learning. arXiv preprint arXiv:1803.04929, 2018.
- N. R. Ke, O. Bilaniuk, A. Goyal, S. Bauer, H. Larochelle, C. Pal, and Y. Bengio. Learning neural causal models from unknown interventions. arXiv preprint arXiv:1910.01075, 2019.
- K. B. Korb, L. R. Hope, A. E. Nicholson, and K. Axnick. Varieties of causal intervention. In Pacific Rim International Conference on Artificial Intelligence, 2004.
- S. Lachapelle, P. Brouillard, T. Deleu, and S. Lacoste-Julien. Gradient-based neural DAG learning. In Proceedings of the 8th International Conference on Learning Representations, 2020.
- C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. Proceedings of the 34th International Conference on Machine Learning, 2017.
- J. M. Mooij, S. Magliacane, and T. Claassen. Joint causal inference from multiple contexts. arXiv preprint arXiv:1611.10351, 2016.
- I. Ng, Z. Fang, S. Zhu, Z. Chen, and J. Wang. Masked gradient-based causal structure learning. arXiv preprint arXiv:1910.08527, 2019.
- J. Pearl. Causality. Cambridge university press, 2009.
- J. Peters and P. Bühlmann. Structural intervention distance (SID) for evaluating causal graphs. Neural Computation, 2015.
- J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016.
- J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference - Foundations and Learning Algorithms. MIT Press, 2017.
- N. Pfister, P. Bühlmann, and J. Peters. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 2019.
- D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. Proceedings of the 32nd International Conference on Machine Learning, 2015.
- D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, 2014.
- K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 2005.
- P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman. Causation, prediction, and search. 2000.
- C. Squires, Y. Wang, and C. Uhler. Permutation-based causal structure learning with unknown intervention targets. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, 2020.
- E. V. Strobl, K. Zhang, and S. Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 2019.
- T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.
- S. Triantafillou and I. Tsamardinos. Constraint-based causal discovery from multiple interventions over overlapping variable sets. Journal of Machine Learning Research, 2015.
- T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, 1990.
- Y. Wang, L. Solus, K. Yang, and C. Uhler. Permutation-based causal inference algorithms with interventions. In Advances in Neural Information Processing Systems, 2017.
- K. D. Yang, A. Katcoff, and C. Uhler. Characterizing and learning equivalence classes of causal DAGs under interventions. Proceedings of the 35th International Conference on Machine Learning, 2018.
- Y. Yu, J. Chen, T. Gao, and M. Yu. DAG-GNN: DAG structure learning with graph neural networks. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence test and application in causal discovery. Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, 2011.
- Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic. Large-scale kernel methods for independence testing. Statistics and Computing, 2018.
- X. Zheng, B. Aragam, P.K. Ravikumar, and E.P. Xing. Dags with no tears: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31, 2018.
- X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. Xing. Learning sparse nonparametric dags. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 2020.
- S. Zhu and Z. Chen. Causal discovery with reinforcement learning. Proceedings of the 8th International Conference on Learning Representations, 2020.
- A. M. Zimmer, Y. K. Pan, T. Chandrapalan, R. WM Kwong, and S. F. Perry. Loss-of-function approaches in comparative physiology: is there a future for knockdown experiments in the era of genome editing? Journal of Experimental Biology, 2019.

Tags

Comments