Transformers as Soft Reasoners over Language

IJCAI 2020, pp. 3882-3890, 2020.

Cited by: 0|Bibtex|Views21|DOI:https://doi.org/10.24963/ijcai.2020/537
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
An interactive demo and all our datasets are available at https://rule-reasoning.apps.allenai.org/ and https://allenai.org/data/ruletaker

Abstract:

AI has long pursued the goal of having systems reason over *explicitly provided* knowledge, but building suitable representations has proved challenging. Here we explore whether transformers can similarly learn to reason (or emulate reasoning), but using rules expressed in language, thus bypassing a formal representation. We provide the...More

Code:

Data:

0
Introduction
Highlights
  • AI has long pursued the goal of giving a system explicit knowledge, and having it reason over that knowledge to reach conclusions, dating back to the earliest years of the field, e.g., McCarthy’s Advice Taker (1959), and Newell and Simon’s Logic Theorist (1956)
  • The results are in Table 4, tested using the earlier trained models. Note that these new problems and vocabularies were unseen during training
  • We might describe a world in which plastic is a type of metal, and see how the conductivity of objects change
  • Just as McCarthy advocated 60 years ago for machines reasoning (“taking advice”) in logic, we have shown that machines can by trained to reason over language
  • The ability to reason over rules expressed in language has potentially far-reaching implications
  • An interactive demo and all our datasets are available at https://rule-reasoning.apps.allenai.org/ and https://allenai.org/data/ruletaker
Methods
  • The authors conduct all the experiments using RoBERTa-large, fine-tuned on the RACE dataset [Lai et al, 2017].
  • This additional fine-tuning step has been previously shown to help with sensitivity to hyperparameters [Phang et al, 2018] and improve question-answering [Sun et al, 2018].
  • The authors train RoBERTa to predict true/false for each question statement.
  • The authors measure accuracy. (The test data has an balance of TRUE/FALSE answers, the baseline of random guessing is 50%)
Results
  • The results are in Table 4, tested using the earlier trained models.
  • Note that these new problems and vocabularies were unseen during training.
  • 2. The MMax model solves all but one of these datasets with 90%+ scores.
  • The authors ran the earlier trained models on the ParaRules test partition.
  • The resulting model has an accuracy of 98.8% on the ParaRules test partition, showing near-perfect performance is learnable.
  • This suggests that the findings may extend to rulebases expressed in more natural language
Conclusion
  • Discussion and Future

    Work

    the demonstrations have been in a limited setting, the implications of being able to predictably reason with language are significant.
  • While the authors have assumed a particular semantics of inference, the methodology the authors have used is general: Characterize the desired behavior in a formal way, synthesize examples, generate linguistic equivalents, and train a model.
  • Rules might be authored by a person, sidestepping some of the intricacies of a formal language; or they could be retrieved from natural sources.
  • An interactive demo and all the datasets are available at https://rule-reasoning.apps.allenai.org/ and https://allenai.org/data/ruletaker
Summary
  • Introduction:

    AI has long pursued the goal of giving a system explicit knowledge, and having it reason over that knowledge to reach conclusions, dating back to the earliest years of the field, e.g., McCarthy’s Advice Taker (1959), and Newell and Simon’s Logic Theorist (1956).
  • Methods:

    The authors conduct all the experiments using RoBERTa-large, fine-tuned on the RACE dataset [Lai et al, 2017].
  • This additional fine-tuning step has been previously shown to help with sensitivity to hyperparameters [Phang et al, 2018] and improve question-answering [Sun et al, 2018].
  • The authors train RoBERTa to predict true/false for each question statement.
  • The authors measure accuracy. (The test data has an balance of TRUE/FALSE answers, the baseline of random guessing is 50%)
  • Results:

    The results are in Table 4, tested using the earlier trained models.
  • Note that these new problems and vocabularies were unseen during training.
  • 2. The MMax model solves all but one of these datasets with 90%+ scores.
  • The authors ran the earlier trained models on the ParaRules test partition.
  • The resulting model has an accuracy of 98.8% on the ParaRules test partition, showing near-perfect performance is learnable.
  • This suggests that the findings may extend to rulebases expressed in more natural language
  • Conclusion:

    Discussion and Future

    Work

    the demonstrations have been in a limited setting, the implications of being able to predictably reason with language are significant.
  • While the authors have assumed a particular semantics of inference, the methodology the authors have used is general: Characterize the desired behavior in a formal way, synthesize examples, generate linguistic equivalents, and train a model.
  • Rules might be authored by a person, sidestepping some of the intricacies of a formal language; or they could be retrieved from natural sources.
  • An interactive demo and all the datasets are available at https://rule-reasoning.apps.allenai.org/ and https://allenai.org/data/ruletaker
Tables
  • Table1: Accuracy of models (Mod0,...) trained and tested on the five datasets (“Test (own)” row), and tested on all, and different slices, of the DMax test set. The boxed area indicates test problems at depths unseen during training
  • Table2: Accuracy on the DMax (no negation) subset, and all its (113k) perturbed (one context sentence removed) variants. The overall accuracy (Remove Any, last column) is largely unchanged, but with a drop for the subset where a critical sentence was removed
  • Table3: On the true questions that were originally answered correctly (column 1), the predicted T answer should flip to predicted F when a critical sentence is removed. In practice, we observe this happens 81% of the time (16654/(16654+3895)). In a few (197) cases, the predicted answer was incorrect to start with (column 2). When an irrelevant sentence is removed, the predicted answer stays correct (T) over 99% of the time (not shown)
  • Table4: Accuracy of the earlier models tested on hand-crafted rulebases (zero shot, no fine-tuning). Note that the models were only trained on the earlier datasets (e.g., Figures 1 and 3), and thus the new rulebases’ entities, attributes, and predicates (bar is()) are completely unseen until test time
  • Table5: Accuracy with rules paraphrased into more natural language (ParaRules), without fine-tuning (zero shot) and with (last column only). The strongest zero-shot model (MMax) partially solves (66.6%) this problem zero-shot, with strongest performance for depth 0 and 1 inferences
  • Table6: Transformers (RoBERTa,BERT) are sufficient but not strictly necessary for this task, although other architectures (ESIM) do not score as well. DECOMP was run as a sanity check that the datasets are not trivially solvable - its low score (random baseline is 50%) suggests they are not
Download tables as Excel
Related work
  • While our work is, to the best of our knowledge, the first systematic study of transformers reasoning over explicitly stated rule sets, there are several datasets that make a first step towards this by testing whether neural systems can apply explicit, general knowledge in a particular situation. Two synthetic datasets that test whether a single rule can be applied correctly are as follows: 1. Task 15 in the bAbI dataset [Weston et al, 2016] applies rules of the form “Xs are afraid of Ys” to an instance, e.g., “Sheep are afraid of wolves. Gertrude is a sheep. What is Gertrude afraid of? A:wolves”

    2. The synthetic, conditional probes in [Richardson et al, 2020] test single rule application, e.g., “If Joe has visited Potsdam then Anne has visited Tampa. Joe has visited Potsdam. Has Anne visited Tampa? A:yes”
Reference
  • [Abboud et al., 2020] Ralph Abboud, Ismail Ilkan Ceylan, and Thomas Lukasiewicz. Learning to reason: Leveraging neural networks for approximate dnf counting. In AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • [Abdelaziz et al., 2020] Ibrahim Abdelaziz, Veronika Thost, Maxwell Crouse, and Achille Fokoue. An experimental study of formula embeddings for automated theorem proving in first-order logic. arXiv, 2002.00423, 2020.
    Findings
  • [Apt et al., 1988] Krzysztof R. Apt, Howard A. Blair, and Adrian Walker. Towards a theory of declarative knowledge. In Foundations of Deductive Databases and Logic Programming., 1988.
    Google ScholarLocate open access versionFindings
  • [Bidoit and Froidevaux, 1991] Nicole Bidoit and Christine Froidevaux. General logical databases and programs: Default logic semantics and stratification. Inf. Comput., 91:15–54, 1991.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2017] Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. In ACL, 2017.
    Google ScholarLocate open access versionFindings
  • [Crouse et al., 2019] Maxwell Crouse, Ibrahim Abdelaziz, Cristina Cornelio, Veronika Thost, Lingfei Wu, Kenneth D. Forbus, and Achille Fokoue. Improving graph neural network representations of logical formulae with subgraph pooling. arXiv, 1911.06904, 2019.
    Findings
  • [Dagan et al., 2013] Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto. Recognizing Textual Entailment: Models and Applications. Morgan and Claypool, 2013.
    Google ScholarFindings
  • [Goodfellow, 2016] Ian J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. ArXiv, abs/1701.00160, 2016.
    Findings
  • [He and Choi, 2019] Han He and Jinho D. Choi. Establishing strong baselines for the new decade: Sequence tagging, syntactic and semantic parsing with BERT. ArXiv, abs/1908.04943, 2019.
    Findings
  • [Kalyan et al., 2019] Ashwin Kalyan, Oleksandr Polozov, and Adam Kalai. Adaptive generation of programming puzzles. Technical report, Georgia Tech, 2019. https://openreview.net/forum?id=HJeRveHKDH.
    Findings
  • [Kamath and Das, 2019] Aishwarya Kamath and Rajarshi Das. A survey on semantic parsing. In AKBC’19, 2019.
    Google ScholarLocate open access versionFindings
  • [Lai et al., 2017] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. In EMNLP, 2017.
    Google ScholarLocate open access versionFindings
  • [Lample and Charton, 2019] Guillaume Lample and Franccois Charton. Deep learning for symbolic mathematics. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • [Lin et al., 2019] Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. Reasoning over paragraph effects in situations. In Proc. MRQA Workshop (EMNLP’19), 2019. also arXiv:1908.05852.
    Findings
  • [Liu et al., 2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • [MacCartney and Manning, 2014] Bill MacCartney and Christopher D. Manning. Natural logic and natural language inference. Computing Meaning, 47:129–147, 2014.
    Google ScholarLocate open access versionFindings
  • [Manning and MacCartney, 2009] Christopher D. Manning and Bill MacCartney. Natural language inference. Stanford University, 2009.
    Google ScholarLocate open access versionFindings
  • [McCarthy, 1959] John W. McCarthy. Programs with common sense. In Proc. Tedding Conf. on the Mechanization of Thought Processes, pages 75–91, 1959.
    Google ScholarLocate open access versionFindings
  • [McCarthy, 1984] John McCarthy. Applications of circumscription to formalizing common sense knowledge. In NMR, 1984.
    Google ScholarLocate open access versionFindings
  • [Metaxiotis et al., 2002] Kostas S Metaxiotis, Dimitris Askounis, and John Psarras. Expert systems in production planning and scheduling: A state-of-the-art survey. Journal of Intelligent Manufacturing, 13(4):253–260, 2002.
    Google ScholarLocate open access versionFindings
  • [Minervini et al., 2018] Pasquale Minervini, Matko Bosnjak, Tim Rocktaschel, and Sebastian Riedel. Towards neural theorem proving at scale. ArXiv, abs/1807.08204, 2018.
    Findings
  • [Minervini et al., 2019] Pasquale Minervini, Matko Bovsnjak, Tim Rocktaschel, Sebastian Riedel, and Edward Grefenstette. Differentiable reasoning on large knowledge bases and natural language. ArXiv, abs/1912.10824, 2019.
    Findings
  • [Moss, 2010] Lawrence S Moss. Natural logic and semantics. In Logic, Language and Meaning, pages 84–93.
    Google ScholarLocate open access versionFindings
  • [Musen and Van der Lei, 1988] Mark A Musen and Johan Van der Lei. Of brittleness and bottlenecks: Challenges in the creation of pattern-recognition and expert-system models. In Machine Intelligence and Pattern Recognition, volume 7, pages 335–352.
    Google ScholarLocate open access versionFindings
  • [Newell and Simon, 1956] Allen Newell and Herbert A. Simon. The logic theory machine-a complex information processing system. IRE Trans. Information Theory, 2:61– 79, 1956.
    Google ScholarLocate open access versionFindings
  • [Niemelaand Simons, 1997] Ilkka Niemelaand Patrik Simons. Smodels - an implementation of the stable model and well-founded semantics for normal lp. In LPNMR, 1997.
    Google ScholarLocate open access versionFindings
  • [Parikh et al., 2016] Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • [Phang et al., 2018] Jason Phang, Thibault Fevry, and Samuel R. Bowman. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. ArXiv, abs/1811.01088, 2018.
    Findings
  • [Rajpurkar et al., 2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • [Richardson et al., 2020] Kyle Richardson, Hai Hu, Lawrence S Moss, and Ashish Sabharwal. Probing natural language inference models through semantic fragments. In AAAI’20, 2020.
    Google ScholarLocate open access versionFindings
  • [Saxton et al., 2019] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • [Selsam et al., 2019] Daniel Selsam, Matthew Lamm, Benedikt Bunz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a SAT solver from single-bit supervision. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • [Sinha et al., 2019] Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton. Clutrr: A diagnostic benchmark for inductive reasoning from text. In EMNLP/IJCNLP, 2019.
    Google ScholarLocate open access versionFindings
  • [Sun et al., 2018] Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. Improving machine reading comprehension with general reading strategies. In NAACL-HLT, 2018.
    Google ScholarLocate open access versionFindings
  • [Tafjord et al., 2019] Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. Quartz: An open-domain dataset of qualitative relationship questions. In EMNLP/IJCNLP, 2019.
    Google ScholarLocate open access versionFindings
  • [Talmor et al., 2019] Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. olmpics - on what language model pre-training captures. ArXiv, abs/1912.13283, 2019.
    Findings
  • [Wang et al., 2019] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. Learning deep transformer models for machine translation. In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • [Weber et al., 2019] Leon Weber, Pasquale Minervini, Jannes Munchmeyer, Ulf Leser, and Tim Rocktaschel. Nlprolog: Reasoning with weak unification for question answering in natural language. In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • [Weston et al., 2016] J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards AI-Complete question answering: A set of prerequisite toy tasks. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • [Yang et al., 2018] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
    Findings
Full Text
Your rating :
0

 

Tags
Comments