# PRover: Proof Generation for Interpretable Reasoning over Rules

EMNLP 2020, 2020.

Keywords:

Weibo:

Abstract:

Recent work by Clark et al. (2020) shows that transformers can act as 'soft theorem provers' by answering questions over explicitly provided knowledge in natural language. In our work, we take a step closer to emulating formal theorem provers, by proposing PROVER, an interpretable transformer-based model that jointly answers binary ques...More

Code:

Data:

Introduction

- Developing systems that can understand and reason over explicitly provided knowledge has been a fundamental goal of AI (Newell and Simon, 1956).
- 1988), and backed by the recent successes of transformers (Vaswani et al, 2017) in NLP, Clark et al (2020) propose a new version of the problem by replacing the formal representations of rule-bases with natural language (English)
- Their task requires predicting the truth value of a statement by reasoning over a set of facts and rules, all expressed in natural language.

Highlights

- Developing systems that can understand and reason over explicitly provided knowledge has been a fundamental goal of AI (Newell and Simon, 1956)
- We present PROVER, an interpretable joint model that learns to reason over natural language rulebases and generate corresponding proofs
- The 5-6% drop in edge and proof accuracy for the “Unconstrained Train + No Integer Linear Program (ILP)” model, compared to PROVER, shows that removing constraints results in a harder learning problem and the model fails to automatically learn all the constraints
- We introduce PROVER, an interpretable joint model that answers binary questions over natural language rule-bases and generates corresponding proofs
- Our model improves state-of-theart QA accuracy in the zero-shot scenario by 6% and generates proofs accurately
- PROVER’s modeling is relatively generic, and similar proof generation methods can be explored in traditional multi-hop QA tasks

Methods

- Each input to PROVER is a context C and a question Q, about the context. PROVER predicts the answer A ∈

{T rue, F alse} and generates a proof P.

3.1 Proof Representation

A proof, P = (N , E), is a directed graph with nodes n ∈ N and edges e ∈ E. - Edges in the proof are directed either from a fact to a rule or from a rule to another rule
- These indicate that a fact is consumed by a rule, or the output of a rule is consumed by another rule, respectively.
- The authors use these constraints both during PROVER’s training and inference, as described later in the paper.

Results

**Evaluation on Complex Language**

The authors test PROVER’s ability to generate proofs for more human-like natural language theories.- The authors train and test the following models on DU5: (1) QA+Node: The authors train a model consisting of only the QA and Node modules; (2) No NAF: The authors train a model using random NAF embeddings; (3) Unconstrained Train (UT) + No ILP: The authors remove constraints both during training and inference; (4) Unconstrained Train (UT) + ILP: The authors remove constraints only during training; (5) No Connectivity: the authors train a model where the authors only remove the connectivity constraint during inference
- More details about these models in appendix

Conclusion

**Discussion and Future**

Work

Graph-based Explanations: While the authors have presented PROVER as a model that can emulate formal reasoning, it has further potential use as an explanation generation system.- QA and Proof Consistency: Currently, PROVER predicts the answer and generates the proof by jointly optimizing the QA, node and edge modules using a shared RoBERTa model.
- PROVER’s constraints like ensuring connectivity are necessary constraints for generating valid proofs for any graph-based explanation generation system.
- Other tasks may require imposing additional constraints to ensure valid explanations.PROVER’s inference mechanism can be extended to incorporate these.The authors introduce PROVER, an interpretable joint model that answers binary questions over natural language rule-bases and generates corresponding proofs.
- PROVER can be a helpful aid to formal reasoners in scenarios where rules are fuzzy and creating rule-bases in a formal language is tedious or infeasible

Summary

## Introduction:

Developing systems that can understand and reason over explicitly provided knowledge has been a fundamental goal of AI (Newell and Simon, 1956).- 1988), and backed by the recent successes of transformers (Vaswani et al, 2017) in NLP, Clark et al (2020) propose a new version of the problem by replacing the formal representations of rule-bases with natural language (English)
- Their task requires predicting the truth value of a statement by reasoning over a set of facts and rules, all expressed in natural language.
## Methods:

Each input to PROVER is a context C and a question Q, about the context. PROVER predicts the answer A ∈

{T rue, F alse} and generates a proof P.

3.1 Proof Representation

A proof, P = (N , E), is a directed graph with nodes n ∈ N and edges e ∈ E.- Edges in the proof are directed either from a fact to a rule or from a rule to another rule
- These indicate that a fact is consumed by a rule, or the output of a rule is consumed by another rule, respectively.
- The authors use these constraints both during PROVER’s training and inference, as described later in the paper.
## Results:

**Evaluation on Complex Language**

The authors test PROVER’s ability to generate proofs for more human-like natural language theories.- The authors train and test the following models on DU5: (1) QA+Node: The authors train a model consisting of only the QA and Node modules; (2) No NAF: The authors train a model using random NAF embeddings; (3) Unconstrained Train (UT) + No ILP: The authors remove constraints both during training and inference; (4) Unconstrained Train (UT) + ILP: The authors remove constraints only during training; (5) No Connectivity: the authors train a model where the authors only remove the connectivity constraint during inference
- More details about these models in appendix
## Conclusion:

**Discussion and Future**

Work

Graph-based Explanations: While the authors have presented PROVER as a model that can emulate formal reasoning, it has further potential use as an explanation generation system.- QA and Proof Consistency: Currently, PROVER predicts the answer and generates the proof by jointly optimizing the QA, node and edge modules using a shared RoBERTa model.
- PROVER’s constraints like ensuring connectivity are necessary constraints for generating valid proofs for any graph-based explanation generation system.
- Other tasks may require imposing additional constraints to ensure valid explanations.PROVER’s inference mechanism can be extended to incorporate these.The authors introduce PROVER, an interpretable joint model that answers binary questions over natural language rule-bases and generates corresponding proofs.
- PROVER can be a helpful aid to formal reasoners in scenarios where rules are fuzzy and creating rule-bases in a formal language is tedious or infeasible

- Table1: QA comparison between RT and PR for varying depths along with node, edge, proof and full accuracy for PROVER on DU5. Cnt = Sample Count
- Table2: Zero-shot performance comparison on the Birds-Electricity dataset after training on DU5
- Table3: Comparison of models trained on DU3 and ParaRules training sets and tested on ParaRules test set
- Table4: Comparison of PROVER models trained with varying amount of training data on DU5. Count = Number of training examples
- Table5: Ablation studies of PROVER showing the importance of each component and constraints
- Table6: Performance of PROVER trained on the training splits of DU0, DU1, DU2 and DU3 and tested on the validation and test splits of DU5
- Table7: Performance of PROVER trained on the training split of DU5 and tested on the validation split of DU5
- Table8: PROVER results on the ParaRules validation set after training on DU3+ParaRules training splits
- Table9: Comparison of critical sentence identification on the No Negation subset of DU5 test set

Related work

- Our work is related to multiple bodies of previous work in NLP and formal reasoning.

QA and NLI: The rule reasoning task is related to reasoning tasks that have been proposed recently. These include tasks in the bAbI dataset (Weston et al, 2015), synthetically generated probe tasks (Richardson et al, 2020) or reading comprehension tasks in datasets such as QuaRTz (Tafjord et al., 2019) and ROPES (Lin et al, 2019). Unlike our task, most of these require reasoning over implicit rules, the focus being on language understanding and one step of rule application. Multi-hop QA datasets like HotpotQA (Yang et al, 2018) require multiple reasoning steps, but the inference rules needed are again implicitly inferred, rather than explicitly provided. Our task also bears similarity with Natural Language Inference (MacCartney and Manning, 2014), but NLI also allows unsupported inferences by filling gaps in explicitly stated knowledge (Dagan et al, 2013).

Funding

- This work was supported by DARPA MCS Grant N66001-19-2-4031, NSF-CAREER Award 1846185, DARPA YFA17-D17AP00022, ONR Grant N00014-18-1-2871, Microsoft Investigator Fellowship, and Munroe & Rebecca Cobey Fellowship
- The views in this article are those of the authors and not the funding agency

Study subjects and analysis

sets of datasets: 3

4.1 Datasets and Evaluation Metrics. We conduct experiments on all the three sets of datasets introduced in Clark et al (2020) and consisting of gold answers and proofs. Further details of the datasets can be found in the appendix

datasets: 5

Further details of the datasets can be found in the appendix. DU0-DU5: The first set consists of five datasets, each containing 100k questions with theories in synthetic language and requiring reasoning paths up to depth D (D = 0, 1, 2, 3, 5). We refer to these datasets as DU0, DU1, DU2, DU3 and DU5, where DU stands for “Depth Upto”

datasets: 6

PROVER on the Birds-Electricity dataset (Table 2). The DU5-trained model is tested on six datasets, two from the birds domain (B1, B2) and another four from the electricity domain (E1, E2, E3, E4). Overall, our model achieves a 6% QA improvement over RuleTakers

Reference

- Ralph Abboud, Ismail Ilkan Ceylan, and Thomas Lukasiewicz. 2020. Learning to reason: Leveraging neural networks for approximate DNF counting. In Proceedings of the 34th AAAI Conference on Artificial Intelligence.
- Ibrahim Abdelaziz, Veronika Thost, Maxwell Crouse, and Achille Fokoue. 2020. An experimental study of formula embeddings for automated theorem proving in first-order logic. arXiv preprint arXiv:2002.00423.
- Mohit Bansal, Claire Cardie, and Lillian Lee. 2008. The power of negative thinking: Exploiting label disagreement in the min-cut classification framework. In COLING 2008: Companion Volume: Posters, pages 15–18.
- Regina Barzilay and Mirella Lapata. 2005. Collective content selection for concept-to-text generation. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 331–338. Association for Computational Linguistics.
- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544.
- Jonathan Berant and Percy Liang. 2014. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1415–1425.
- Oana-Maria Camburu, Tim Rocktaschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pages 9539–9549.
- Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over language. In International Joint Conference on Artificial Intelligence.
- Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto. 2013. Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220.
- Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
- Shimon Even and R Endre Tarjan. 1975. Network flow and testing graph connectivity. SIAM journal on computing, 4(4):507–518.
- Peter Hase and Mohit Bansal. 2020. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Han He and Jinho Choi. 2020. Establishing strong baselines for the new decade: Sequence tagging, syntactic and semantic parsing with bert. In The Thirty-Third International Flairs Conference.
- Shalmali Joshi, Oluwasanmi Koyejo, Been Kim, and Joydeep Ghosh. 2018. xGEMs: Generating examplars to explain black-box models. arXiv preprint arXiv:1806.08867.
- Aishwarya Kamath and Rajarshi Das. 2019. A survey on semantic parsing. In Automated Knowledge Base Construction (AKBC).
- Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794.
- Guillaume Lample and Francois Charton. 2020. Deep learning for symbolic mathematics. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net.
- Tom Leighton and Satish Rao. 1999. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. Journal of the ACM (JACM), 46(6):787–832.
- Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 20Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 58–62, Hong Kong, China. Association for Computational Linguistics.
- Hui Liu, Qingyu Yin, and William Yang Wang. 2019a. Towards explainable NLP: A generative explanation framework for text classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5570–5581.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
- Bill MacCartney and Christopher D Manning. 2014. Natural logic and natural language inference. In Computing meaning, pages 129–147. Springer.
- Mark A Musen and Johan Van Der Lei. 1988. Of brittleness and bottlenecks: Challenges in the creation of pattern-recognition and expert-system models. In Machine Intelligence and Pattern Recognition, volume 7, pages 335–352. Elsevier.
- Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. 2016. Neural programmer: Inducing latent programs with gradient descent. In 4th International Conference on Learning Representations, ICLR 2016, Conference Track Proceedings.
- Allen Newell and Herbert Simon. 1956. The logic theory machine–a complex information processing system. IRE Transactions on information theory, 2(3):61–79.
- Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics.
- Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942.
- Scott E. Reed and Nando de Freitas. 2016. Neural programmer-interpreters. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision modelagnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Kyle Richardson, Hai Hu, Lawrence S Moss, and Ashish Sabharwal. 2020. Probing natural language inference models through semantic fragments. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215.
- Pouya Samangouei, Ardavan Saeedi, Liam Nakagawa, and Nathan Silberman. 2018. ExplainGAN: Model explanation via decision boundary crossing transformations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 666–681.
- David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2019. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations.
- Daniel Selsam, Matthew Lamm, Benedikt Bunz, Percy Liang, Leonardo de Moura, and David L. Dill. 2019. Learning a SAT solver from single-bit supervision. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. 2019. QuaRTz: An open-domain dataset of qualitative relationship questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5941–5946, Hong Kong, China. Association for Computational Linguistics.
- Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. 2019. oLMpics–on what language model pre-training captures. arXiv preprint arXiv:1912.13283.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1810–1822, Florence, Italy. Association for Computational Linguistics.
- Leon Weber, Pasquale Minervini, Jannes Munchmeyer, Ulf Leser, and Tim Rocktaschel. 2019. NLProlog: Reasoning with weak unification for question answering in natural language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6151–6161, Florence, Italy. Association for Computational Linguistics.
- Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merrienboer, Armand Joulin, and Tomas Mikolov. 2015. Towards AI-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698.
- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
- Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, UAI’05, page 658–666. AUAI Press.
- Hongming Zhang, Xinran Zhao, and Yangqiu Song. 2020. WinoWhy: A deep diagnosis of essential commonsense knowledge for answering Winograd schema challenge. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL).
- Below we briefly describe the three sets of datasets we conduct experiments on.8 Each dataset has a train, validation and test split, except for the zeroshot test-only one. Further details about these can be found in Clark et al. (2020).

Full Text

Tags

Comments