A Simple Yet Strong Pipeline for HotpotQA

Groeneveld Dirk
Groeneveld Dirk

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views24|Links
Keywords:
Reading Comprehensionmulti hop datasetquestion decompositionGraph Neural Networksimple pipelineMore(9+)
Weibo:
We show that while relevant sentences can be selected independently, operating jointly over these sentences chosen from multiple paragraphs can lead to state-of-the-art questionanswering results, outperforming independent answer selection by several points

Abstract:

State-of-the-art models for multi-hop question answering typically augment large-scale language models like BERT with additional, intuitively useful capabilities such as named entity recognition, graph-based reasoning, and question decomposition. However, does their strong performance on popular multi-hop datasets really justify this ad...More

Code:

Data:

0
Introduction
  • Textual Multi-hop Question Answering (QA) is the task of answering questions by combining information from multiple sentences or documents.
  • This is a challenging reasoning task that requires QA systems to identify relevant pieces of information in the given text and learn to compose them to answer a question.
  • E.g., the relevance of “Obama was born in Hawaii.” to the question “Where was the 44th President of USA born?” depends on the other relevant sentence: “Obama was the 44th President of US.” As a result, many approaches designed for this task focus on jointly identifying the relevant sentences via mechanisms such as cross-document attention, graph networks, and entity linking
Highlights
  • Textual Multi-hop Question Answering (QA) is the task of answering questions by combining information from multiple sentences or documents
  • We show that while relevant sentences can be selected independently, operating jointly over these sentences chosen from multiple paragraphs can lead to state-of-the-art questionanswering results, outperforming independent answer selection by several points
  • 3Note that ra(s) is the logit score and can be negative, so adding a sentence may not always improve this score. We evaluate on both the distractor and fullwiki settings of HotpotQA with the following goal: Can a simple pipeline model outperform previous, more complex, approaches? We present the EM (Exact Match) and F1 scores on the evaluation metrics proposed for HotpotQA: (1) answer selection, (2) support selection, and (3) joint score
  • Our work shows that on the HotpotQA tasks, a simple pipeline model can do as well as or better than more complex solutions
  • By operating jointly over these sentences chosen from multiple paragraphs, we arrive at answers and supporting sentences on par with state-of-the-art approaches. This result shows that retrieval in HotpotQA is not itself a multi-hop problem, and suggests focusing on other multi-hop datasets to demonstrate the value of more complex techniques
Methods
  • The authors evaluate on both the distractor and fullwiki settings of HotpotQA with the following goal: Can a simple pipeline model outperform previous, more complex, approaches? The authors present the EM (Exact Match) and F1 scores on the evaluation metrics proposed for HotpotQA: (1) answer selection, (2) support selection, and (3) joint score.

    Table 1 shows that on the distractor setting, QUARK outperforms all previous models based on BERT, including HGN, which like them uses whole word masking for contextual embeddings.
  • QUARK performs better than the recent single-paragraph approach for the QA subtask (Min et al, 2019a) by 14 points F1
  • While most of this gain comes from using a larger language model, QUARK scores 2 points higher even with a language model of the same size (BERT-Base).
  • While the authors rely on retrieval from SR-MRS (Nie et al, 2019) for the initial paragraphs, the authors outperform the original work
  • The authors attribute this improvement to two factors: the sentence selection capitalizing on the sentence’s paragraph context leading to better support selection, and a better span selection model
Conclusion
  • The authors' work shows that on the HotpotQA tasks, a simple pipeline model can do as well as or better than more complex solutions.
  • Powerful pre-trained models allow them to score sentences one at a time, without looking at other paragraphs.
  • By operating jointly over these sentences chosen from multiple paragraphs, the authors arrive at answers and supporting sentences on par with state-of-the-art approaches.
  • This result shows that retrieval in HotpotQA is not itself a multi-hop problem, and suggests focusing on other multi-hop datasets to demonstrate the value of more complex techniques
Summary
  • Introduction:

    Textual Multi-hop Question Answering (QA) is the task of answering questions by combining information from multiple sentences or documents.
  • This is a challenging reasoning task that requires QA systems to identify relevant pieces of information in the given text and learn to compose them to answer a question.
  • E.g., the relevance of “Obama was born in Hawaii.” to the question “Where was the 44th President of USA born?” depends on the other relevant sentence: “Obama was the 44th President of US.” As a result, many approaches designed for this task focus on jointly identifying the relevant sentences via mechanisms such as cross-document attention, graph networks, and entity linking
  • Methods:

    The authors evaluate on both the distractor and fullwiki settings of HotpotQA with the following goal: Can a simple pipeline model outperform previous, more complex, approaches? The authors present the EM (Exact Match) and F1 scores on the evaluation metrics proposed for HotpotQA: (1) answer selection, (2) support selection, and (3) joint score.

    Table 1 shows that on the distractor setting, QUARK outperforms all previous models based on BERT, including HGN, which like them uses whole word masking for contextual embeddings.
  • QUARK performs better than the recent single-paragraph approach for the QA subtask (Min et al, 2019a) by 14 points F1
  • While most of this gain comes from using a larger language model, QUARK scores 2 points higher even with a language model of the same size (BERT-Base).
  • While the authors rely on retrieval from SR-MRS (Nie et al, 2019) for the initial paragraphs, the authors outperform the original work
  • The authors attribute this improvement to two factors: the sentence selection capitalizing on the sentence’s paragraph context leading to better support selection, and a better span selection model
  • Conclusion:

    The authors' work shows that on the HotpotQA tasks, a simple pipeline model can do as well as or better than more complex solutions.
  • Powerful pre-trained models allow them to score sentences one at a time, without looking at other paragraphs.
  • By operating jointly over these sentences chosen from multiple paragraphs, the authors arrive at answers and supporting sentences on par with state-of-the-art approaches.
  • This result shows that retrieval in HotpotQA is not itself a multi-hop problem, and suggests focusing on other multi-hop datasets to demonstrate the value of more complex techniques
Tables
  • Table1: HotpotQA’s distractor setting, Dev set. The bottom two models use larger language models than QUARK
  • Table2: HotpotQA’s fullwiki setting, Test set. The bottom-most model uses a larger language model than QUARK
  • Table3: Ablation study on sentence selection in the distractor setting. top-n indicates the number of sentences required to cover the annotated support sentences in 90% of the questions
Download tables as Excel
Related work
  • Most approaches for HotpotQA attempt to capture the interactions between the paragraphs by either relying on cross-attention between documents or sequentially selecting paragraphs based on the previously selected paragraphs.

    While Nishida et al (2019) also use a standard Reading Comprehension (RC) model, they combine it with a special Query Focused Extractor (QFE) module to select relevant sentences for QA and explanation. The QFE module sequentially identifies relevant sentences by updating a RNN state representation in each step, allowing the model to capture the dependency between sentences across time-steps. Xiao et al (2019) propose a Dynamically Fused Graph Networks (DFGN) model that first extracts entities from paragraphs to create an entity graph, dynamically extract subgraphs and fuse them with the paragraph representation. The Select, Answer, Explain (SAE) model (Tu et al, 2019) is similar to our approach in that it also first selects relevant documents and uses them to produce answers and explanations. However, it relies on a self-attention over all document representations to capture potential interactions. Additionally, they rely on a Graph Neural Network (GNN) to answer the questions. Hierarchical Graph Network (HGN) model (Fang et al, 2019) builds a hierarchical graph with three levels: entities, sentences and paragraphs to allow for joint reasoning. DecompRC (Min et al, 2019b) takes a completely different approach of learning to decompose the question (using additional annotations) and then answer the decomposed questions using a standard single-hop RC system.
Reference
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
    Google ScholarFindings
  • Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jing jing Liu. 2019. Hierarchical graph network for multi-hop question answering. ArXiv, abs/1911.03631.
    Findings
  • Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. 2020. QASC: A dataset for question answering via sentence composition. In AAAI.
    Google ScholarFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv, abs/1907.11692.
    Findings
  • Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke S. Zettlemoyer. 2019a. Compositional questions do not necessitate multi-hop reasoning. In ACL.
    Google ScholarFindings
  • Sewon Min, Victor Zhong, Luke S. Zettlemoyer, and Hannaneh Hajishirzi. 2019b. Multi-hop reading comprehension through question decomposition and rescoring. In ACL.
    Google ScholarFindings
  • Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. Revealing the importance of semantic retrieval for machine reading at scale. In EMNLP-IJCNLP.
    Google ScholarFindings
  • Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Atsushi Otsuka, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Answering while summarizing: Multi-task learning for multi-hop QA with evidence extraction. In ACL.
    Google ScholarFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP.
    Google ScholarFindings
  • Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In NAACL-HLT.
    Google ScholarFindings
  • Ming Tu, Kevin Huang, Guangtao Wang, Jui-Ting Huang, Xiaodong He, and Bufang Zhou. 2019.
    Google ScholarFindings
  • Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. TACL, 6:287–302.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically fused graph network for multi-hop reasoning. In ACL.
    Google ScholarFindings
  • Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
    Google ScholarFindings
  • During training, we follow the fine-tuning advice from (Devlin et al., 2019), with two exceptions. We ramp up the learning rate from 0 to 10−5 over the first 10% of the batches, and then linearly decrease it again to 0.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments