Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning

EMNLP 2020, pp. 8846-8863, 2020.

Other Links: academic.microsoft.com
Weibo:
Our results suggest that carefully devising tests that probe for desirable aspects of multifact reasoning is an effective way forward

Abstract:

Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This limits our ability to measure true progress and defeats the purpose of building multi-hop QA datasets. We make three contributions towards ad...More

Code:

Data:

0
Introduction
  • Multi-hop question answering requires connecting and synthesizing information from multiple facts in the input text, a process the authors refer to as multifact reasoning.
  • Shown that bad reasoning models, ones that by design do not connect information from multiple facts, can achieve high scores because they can exploit specific types of biases and artifacts in existing datasets (Min et al, 2019; Input Which country got independence when the cold war started?.
Highlights
  • Multi-hop question answering requires connecting and synthesizing information from multiple facts in the input text, a process we refer to as multifact reasoning
  • While it may appear that the newer models are more capable of multifact reasoning, we show most of these gains are from better exploitation of disconnected reasoning
  • Progress in multi-hop QA relies on understanding and quantifying the types of bad reasoning that can happen in current models
  • This work introduced a formalization of disconnected reasoning, a form of bad reasoning prevalent in multi-hop models
  • It showed that a large portion of current progress in multifact reasoning can be attributed to disconnected reasoning
  • Our results suggest that carefully devising tests that probe for desirable aspects of multifact reasoning is an effective way forward
Methods
  • To obtain a more realistic picture of the progress in multifact reasoning, the authors compare the performance of the original Glove-based baseline model (Yang et al, 2018) and a state-of-the-art transformerbased LM, XLNet (Yang et al, 2019) on the multihop QA dataset HotpotQA (Yang et al, 2018).
  • The authors' proposed transformation reduces disconnected reasoning exploitable by these models and gives a more accurate picture of the state of multifact reasoning.
  • As described in Sec. 4.2, the authors use these supporting paragraph annotations as Fs to create a transformed dataset T(D).6
Results
  • The authors show F1 scores, but trends are similar for ExactMatch score.7.
Conclusion
  • Progress in multi-hop QA relies on understanding and quantifying the types of bad reasoning that can happen in current models.
  • This work introduced a formalization of disconnected reasoning, a form of bad reasoning prevalent in multi-hop models.
  • It showed that a large portion of current progress in multifact reasoning can be attributed to disconnected reasoning.
  • To create a more difficult and less cheatable dataset that results in reduced disconnected reasoning.
  • The authors' results suggest that carefully devising tests that probe for desirable aspects of multifact reasoning is an effective way forward
Summary
  • Introduction:

    Multi-hop question answering requires connecting and synthesizing information from multiple facts in the input text, a process the authors refer to as multifact reasoning.
  • Shown that bad reasoning models, ones that by design do not connect information from multiple facts, can achieve high scores because they can exploit specific types of biases and artifacts in existing datasets (Min et al, 2019; Input Which country got independence when the cold war started?.
  • Methods:

    To obtain a more realistic picture of the progress in multifact reasoning, the authors compare the performance of the original Glove-based baseline model (Yang et al, 2018) and a state-of-the-art transformerbased LM, XLNet (Yang et al, 2019) on the multihop QA dataset HotpotQA (Yang et al, 2018).
  • The authors' proposed transformation reduces disconnected reasoning exploitable by these models and gives a more accurate picture of the state of multifact reasoning.
  • As described in Sec. 4.2, the authors use these supporting paragraph annotations as Fs to create a transformed dataset T(D).6
  • Results:

    The authors show F1 scores, but trends are similar for ExactMatch score.7.
  • Conclusion:

    Progress in multi-hop QA relies on understanding and quantifying the types of bad reasoning that can happen in current models.
  • This work introduced a formalization of disconnected reasoning, a form of bad reasoning prevalent in multi-hop models.
  • It showed that a large portion of current progress in multifact reasoning can be attributed to disconnected reasoning.
  • To create a more difficult and less cheatable dataset that results in reduced disconnected reasoning.
  • The authors' results suggest that carefully devising tests that probe for desirable aspects of multifact reasoning is an effective way forward
Tables
  • Table1: Performance of XLNet-Base compared to other transformer models (of similar size) on HotpotQA. Our model scores higher than BERT-Base models QFE (<a class="ref-link" id="cNishida_et+al_2019_a" href="#rNishida_et+al_2019_a">Nishida et al, 2019</a>) and DFGN (<a class="ref-link" id="cXiao_et+al_2019_a" href="#rXiao_et+al_2019_a">Xiao et al, 2019</a>), and performs comparable to recent models using RoBERTa and Longformer (Beltagy et al, 2020)
Download tables as Excel
Related work
  • Multi-hop Reasoning: Many multifact reasoning approaches have been proposed for HotpotQA and other multi-hop reasoning datasets (Mihaylov et al, 2018; Khot et al, 2020). These models use iterative fact selection (Nishida et al, 2019; Tu et al, 2020) graph neural networks (Xiao et al, 2019; Fang et al, 2019; Tu et al, 2020), or just crossdocument self-attention (Yang et al, 2019; Beltagy et al, 2020) in an attempt to capture the interactions between the paragraphs. While these approaches have pushed state-of-the-art, it is unclear whether the underlying models are making any progress on the problem of multifact reasoning.

    Identifying Dataset Artifacts: Several works have identified dataset artifacts for tasks such as NLI (Gururangan et al, 2018), Reading Comprehension (Feng et al, 2018; Sugawara et al, 2019) and even multi-hop reasoning (Min et al, 2019; Chen and Durrett, 2019). These artifacts allow models to solve the dataset without actually solving the underlying task. On HotpotQA, prior work has shown existence of models that identify the support (Groeneveld et al, 2020) and the answer (Min et al, 2019; Chen and Durrett, 2019) by operating on each paragraph independently. We, on the other hand, estimate the amount of disconnected reasoning in any model and quantify the cheatability of both answer and support identification.
Funding
  • This work was supported in part by the National Science Foundation under Grant IIS-1815358 and Grant CCF-1918225
  • Computations on beaker.org were supported in part by credits from Google Cloud
Study subjects and analysis
datasets: 4
F1 scores of two models on D and T(D) under two common metrics. Transformed dataset is harder for both models since they rely on disconnected reasoning. The weaker, Baseline model drops more as it relies more heavily on disconnected reasoning. F1 score on various metrics for four datasets: original D, adversarial Tadv(D), transformed T(D), and transformed adversarial T(Tadv(D)). Transformation is more effective than, and complementary to, Adversarial Augmentation for reducing DiRe scores. Proposed dataset transformation and probes for the case of ∣Fs∣ = 2 supporting facts

datasets: 4
EM score of XLNet-Base on our DiRe probes for D and T(D). Dataset transformation reduces disconnected reasoning bias, demonstrated by DiRe scores being substantially lower on T(D) than on D. EM score on various metrics for four datasets: original D, adversarial Tadv(D), transformed T(D), and transformed adversarial T(Tadv(D)). Transformation is more effective than, and complementary to, Adversarial Augmentation for reducing DiRe scores.

Reference
  • Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150.
    Findings
  • Jifan Chen and Greg Durrett. 2019. Understanding dataset design choices for multi-hop reasoning. In NAACL-HLT.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
    Google ScholarFindings
  • Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jing jing Liu. 2019. Hierarchical graph network for multi-hop question answering. ArXiv, abs/1911.03631.
    Findings
  • Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of neural models make interpretations difficult. In EMNLP.
    Google ScholarFindings
  • Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Quan Zhang, and Ben Zhou. 2020. Evaluating nlp models via contrast sets. ArXiv, abs/2004.02709.
    Findings
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 201AllenNLP: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640.
    Findings
  • Dirk Groeneveld, Tushar Khot, Mausam, and Ashish Sabharwal. 2020. A simple yet strong pipeline for HotpotQA. ArXiv, abs/2004.06753.
    Findings
  • Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. 2018. Annotation artifacts in natural language inference data. In NAACL-HLT.
    Google ScholarLocate open access versionFindings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In EMNLP.
    Google ScholarFindings
  • Yichen Jiang and Mohit Bansal. 2019. Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop QA. In ACL.
    Google ScholarFindings
  • Tushar Khot, Peter Clark, Michal Guerquin, Paul Edward Jansen, and Ashish Sabharwal. 2020. QASC: A dataset for question answering via sentence composition. In AAAI.
    Google ScholarFindings
  • Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning over paragraph effects in situations. In MRQA@EMNLP.
    Google ScholarFindings
  • Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019. Inoculation by fine-tuning: A method for analyzing challenge datasets. In NAACL-HLT.
    Google ScholarFindings
  • Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
    Google ScholarFindings
  • Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In ACL.
    Google ScholarFindings
  • Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Atsushi Otsuka, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Answering while summarizing: Multi-task learning for multi-hop QA with evidence extraction. In ACL.
    Google ScholarFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. 20Know what you don’t know: Unanswerable questions for SQuAD. In ACL.
    Google ScholarFindings
  • Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and Akiko Aizawa. 20Assessing the benchmarking capacity of machine reading comprehension datasets. arXiv preprint arXiv:1911.09241.
    Findings
  • Ming Tu, Kevin Huang, Guangtao Wang, Jui-Ting Huang, Xiaodong He, and Bufang Zhou. 2020.
    Google ScholarFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically fused graph network for multi-hop reasoning. In ACL.
    Google ScholarFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS.
    Google ScholarFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments