Evidence Aggregation for Answer Re-Ranking in Open-Domain Question Answering

international conference on learning representations, 2018.

Cited by: 84|Bibtex|Views159
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
The results showed that R3 achieved F1 56.0, Exact Match 50.9 on Wiki domain and F1 68.5, EM 63.0 on Web domain, which is competitive to the state-of-the-arts

Abstract:

Very recently, it comes to be a popular approach for answering open-domain questions by first searching question-related passages, then applying reading comprehension models to extract answers. Existing works usually extract answers from single passages independently, thus not fully make use of the multiple searched passages, especially f...More

Code:

Data:

0
Introduction
  • Open-domain question answering (QA) aims to answer questions from a broad range of domains by effectively marshalling evidence from large open-domain knowledge sources.
  • Recent work on open-domain QA has focused on using unstructured text retrieved from the web to build machine comprehension models (Chen et al, 2017; Dhingra et al, 2017b; Wang et al, 2017)
  • These studies adopt a two-step process: an information retrieval (IR) model to coarsely select passages relevant to a question, followed by a reading comprehension (RC) model (Wang & Jiang, 2017; Seo et al, 2017; Chen et al, 2017) to infer an answer from the passages.
  • The answer can only be determined by combining multiple passages
Highlights
  • Open-domain question answering (QA) aims to answer questions from a broad range of domains by effectively marshalling evidence from large open-domain knowledge sources
  • 10To demonstrate that R3 serves as a strong baseline on the TriviaQA data, we generate the R3 results following the leaderboard setting
  • The results showed that R3 achieved F1 56.0, Exact Match (EM) 50.9 on Wiki domain and F1 68.5, EM 63.0 on Web domain, which is competitive to the state-of-the-arts
  • We use F1 score and Exact Match (EM) as our evaluation metrics 12
  • Our model is much better than the human performance on the SearchQA dataset
  • We see that our coverage-based re-ranker achieves consistently good performance on the three datasets, even though its performance is marginally lower than the strength-based re-ranker on the SearchQA dataset
Methods
  • The authors are trying to find the correct answer ag to q using information retrieved from the web.
  • A reading comprehension (RC) model is used to extract the answer from these passages.
  • When developing a reading comprehension system, the authors can use the specific positions of the answer sequence in the given passage for training.
  • The goal of the re-ranker is to rank this list of candidates so that the top-ranked candidates are more likely to be the correct answer ag
  • With access to these additional features, the re-ranking step has the potential to prioritize answers not discoverable by the base system alone.
  • An overview of the method is shown in Figure 2
Results
  • RESULTS AND ANALYSIS

    the authors present results and analysis of the different re-ranking methods on the three different public datasets.

    10To demonstrate that R3 serves as a strong baseline on the TriviaQA data, the authors generate the R3 results following the leaderboard setting.
  • The results showed that R3 achieved F1 56.0, EM 50.9 on Wiki domain and F1 68.5, EM 63.0 on Web domain, which is competitive to the state-of-the-arts
  • This confirms that R3 is a competitive baseline when extending the TriviaQA questions to open-domain setting.
  • From the results in Table 2, the authors see that the BM25-based re-ranker improves the F1 scores compared with the R3 model, but it is still lower than the coverage-based re-ranker with neural network models.
  • The F1 score could be improved but the EM score sometimes becomes worse
Conclusion
  • The authors have observed that open-domain QA can be improved by explicitly combining evidence from multiple retrieved passages.
  • The authors experimented with two types of re-rankers, one for the case where evidence is consistent and another when evidence is complementary.
  • Both re-rankers helped to significantly improve the results individually, and even more together.
  • The authors' proposed methods achieved some successes in modeling the union or co-occurrence of multiple passages, there are still much harder problems in open-domain QA that require reasoning and commonsense inference abilities.
  • The authors will explore the above directions, and the authors believe that the proposed approach could be potentially generalized to these more difficult multipassage reasoning scenarios
Summary
  • Introduction:

    Open-domain question answering (QA) aims to answer questions from a broad range of domains by effectively marshalling evidence from large open-domain knowledge sources.
  • Recent work on open-domain QA has focused on using unstructured text retrieved from the web to build machine comprehension models (Chen et al, 2017; Dhingra et al, 2017b; Wang et al, 2017)
  • These studies adopt a two-step process: an information retrieval (IR) model to coarsely select passages relevant to a question, followed by a reading comprehension (RC) model (Wang & Jiang, 2017; Seo et al, 2017; Chen et al, 2017) to infer an answer from the passages.
  • The answer can only be determined by combining multiple passages
  • Methods:

    The authors are trying to find the correct answer ag to q using information retrieved from the web.
  • A reading comprehension (RC) model is used to extract the answer from these passages.
  • When developing a reading comprehension system, the authors can use the specific positions of the answer sequence in the given passage for training.
  • The goal of the re-ranker is to rank this list of candidates so that the top-ranked candidates are more likely to be the correct answer ag
  • With access to these additional features, the re-ranking step has the potential to prioritize answers not discoverable by the base system alone.
  • An overview of the method is shown in Figure 2
  • Results:

    RESULTS AND ANALYSIS

    the authors present results and analysis of the different re-ranking methods on the three different public datasets.

    10To demonstrate that R3 serves as a strong baseline on the TriviaQA data, the authors generate the R3 results following the leaderboard setting.
  • The results showed that R3 achieved F1 56.0, EM 50.9 on Wiki domain and F1 68.5, EM 63.0 on Web domain, which is competitive to the state-of-the-arts
  • This confirms that R3 is a competitive baseline when extending the TriviaQA questions to open-domain setting.
  • From the results in Table 2, the authors see that the BM25-based re-ranker improves the F1 scores compared with the R3 model, but it is still lower than the coverage-based re-ranker with neural network models.
  • The F1 score could be improved but the EM score sometimes becomes worse
  • Conclusion:

    The authors have observed that open-domain QA can be improved by explicitly combining evidence from multiple retrieved passages.
  • The authors experimented with two types of re-rankers, one for the case where evidence is consistent and another when evidence is complementary.
  • Both re-rankers helped to significantly improve the results individually, and even more together.
  • The authors' proposed methods achieved some successes in modeling the union or co-occurrence of multiple passages, there are still much harder problems in open-domain QA that require reasoning and commonsense inference abilities.
  • The authors will explore the above directions, and the authors believe that the proposed approach could be potentially generalized to these more difficult multipassage reasoning scenarios
Tables
  • Table1: Statistics of the datasets. #q represents the number of questions for training (not counting the questions that don’t have ground-truth answer in the corresponding passages for training set), development, and testing datasets. #p is the number of passages for each question. For TriviaQA, we split the raw documents into sentence level passages and select the top 100 passages based on the its overlaps with the corresponding question. #p(golden) means the number of passages that contain the ground-truth answer in average. #p(aggregated) is the number of passages we aggregated in average for top 10 candidate answers provided by RC model
  • Table2: Experiment results on three open-domain QA test datasets: Quasar-T, SearchQA and TriviaQA (open-domain setting). EM: Exact Match. Full Re-ranker is the combination of three different re-rankers
  • Table3: The upper bound (recall) of the Top-K answer candidates generated by the baseline R3 system (on dev set), which indicates the potential of the coverage-based re-ranker
  • Table4: Results of running coverage-based re-ranker on different number of the top-K answer candidates on Quasar-T (dev set)
  • Table5: Results of running strength-based re-ranker (counting) on different number of top-K answer candidates on Quasar-T (dev set)
  • Table6: An example from Quasar-T dataset. The ground-truth answer is ”Sesame Street”. Q: question, A: answer, P: passages containing corresponding answer
Download tables as Excel
Related work
  • Open Domain Question Answering The task of open domain question answering dates back to as early as (Green Jr et al, 1961) and was popularized by TREC-8 (Voorhees, 1999). The task is to produce the answer to a question by exploiting resources such as documents (Voorhees, 1999), webpages (Kwok et al, 2001) or structured knowledge bases (Berant et al, 2013; Bordes et al, 2015; Yu et al, 2017).

    Recent efforts (Chen et al, 2017; Dunn et al, 2017; Dhingra et al, 2017b; Wang et al, 2017) benefit from the advances of machine reading comprehension (RC) and follow the search-and-read QA direction. These deep learning based methods usually rely on a document retrieval module to retrieve a list of passages for RC models to extract answers. As there is no passage-level annotation about which passages entail the answer, the model has to find proper ways to handle the noise introduced in the IR step. Chen et al (2017) uses bi-gram passage index to improve the retrieval step; Dunn et al (2017); Dhingra et al (2017b) propose to reduce the length of the retrieved passages. Wang et al (2017) focus more on noise reduction in the passage ranking step, in which a ranker module is jointly trained with the RC model with reinforcement learning.
Funding
  • This work was partially supported by DSO grant DSOCL15223
Study subjects and analysis
public open-domain QA datasets: 3
We propose two methods, namely, strengthbased re-ranking and coverage-based re-ranking, to make use of the aggregated evidence from different passages to better determine the answer. Our models have achieved state-of-the-art results on three public open-domain QA datasets: Quasar-T, SearchQA and the open-domain version of TriviaQA, with about 8 percentage points of improvement over the former two datasets. Given a question q, we are trying to find the correct answer ag to q using information retrieved from the web

public open-domain QA datasets: 3
We propose two methods, namely, strengthbased re-ranking and coverage-based re-ranking, to make use of the aggregated evidence from different passages to better determine the answer. Our models have achieved state-of-the-art results on three public open-domain QA datasets: Quasar-T, SearchQA and the open-domain version of TriviaQA, with about 8 percentage points of improvement over the former two datasets. Open-domain question answering (QA) aims to answer questions from a broad range of domains by effectively marshalling evidence from large open-domain knowledge sources

datasets: 3
3.1 DATASETS. The statistics of the three datasets are shown in Table 1. Quasar-T 5 (Dhingra et al, 2017b) is based on a trivia question set

public datasets: 3
3.3 IMPLEMENTATION DETAILS. We first use a pre-trained R3 model (Wang et al, 2017), which gets the state-of-the-art performance on the three public datasets we consider, to generate the top 50 candidate spans for the training, development and test datasets, and we use them for further ranking. During training, if the groundtruth answer does not appear in the answer candidates, we will manually add it into the answer candidate list

datasets: 3
Moreover, our model is much better than the human performance on the SearchQA dataset. In addition, we see that our coverage-based re-ranker achieves consistently good performance on the three datasets, even though its performance is marginally lower than the strength-based re-ranker on the SearchQA dataset. 4.2 ANALYSIS

open-domain QA datasets: 3
Both re-rankers helped to significantly improve our results individually, and even more together. Our results considerably advance the state-of-the-art on three open-domain QA datasets. Although our proposed methods achieved some successes in modeling the union or co-occurrence of multiple passages, there are still much harder problems in open-domain QA that require reasoning and commonsense inference abilities

open-domain QA test datasets: 3
Statistics of the datasets. #q represents the number of questions for training (not counting the questions that don’t have ground-truth answer in the corresponding passages for training set), development, and testing datasets. #p is the number of passages for each question. For TriviaQA, we split the raw documents into sentence level passages and select the top 100 passages based on the its overlaps with the corresponding question. #p(golden) means the number of passages that contain the ground-truth answer in average. #p(aggregated) is the number of passages we aggregated in average for top 10 candidate answers provided by RC model. Experiment results on three open-domain QA test datasets: Quasar-T, SearchQA and TriviaQA (open-domain setting). EM: Exact Match. Full Re-ranker is the combination of three different re-rankers. The upper bound (recall) of the Top-K answer candidates generated by the baseline R3 system (on dev set), which indicates the potential of the coverage-based re-ranker

Reference
  • Hannah Bast and Elmar Haussmann. More accurate question answering on freebase. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1431–1440. ACM, 2015.
    Google ScholarLocate open access versionFindings
  • Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013.
    Google ScholarLocate open access versionFindings
  • Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple question answering with memory networks. Proceedings of the International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Andrea Gesmundo, Neil Houlsby, Wojciech Gajewski, and Wei Wang. Ask the right questions: Active question reformulation with reinforcement learning. arXiv preprint arXiv:1705.07830, 2017.
    Findings
  • Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer opendomain questions. In Proceedings of the Conference on Association for Computational Linguistics, 2017.
    Google ScholarLocate open access versionFindings
  • Michael Collins and Terry Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–70, 2005.
    Google ScholarLocate open access versionFindings
  • Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-overattention neural networks for reading comprehension. Proceedings of the Conference on Association for Computational Linguistics, 2017.
    Google ScholarLocate open access versionFindings
  • Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Gatedattention readers for text comprehension. Proceedings of the Conference on Association for Computational Linguistics, 2017a.
    Google ScholarLocate open access versionFindings
  • Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. QUASAR: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904, 2017b.
    Findings
  • Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho. SearchQA: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179, 2017.
    Findings
  • Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. Recurrent neural network grammars. In Proceedings of the Conference on the North American Chapter of the Association for Computational Linguistics, 2016.
    Google ScholarLocate open access versionFindings
  • David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79, 2010.
    Google ScholarLocate open access versionFindings
  • Bert F Green Jr, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference, pp. 219–224. ACM, 1961.
    Google ScholarFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693–1701, 2015.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Liang Huang. Forest reranking: Discriminative parsing with non-local features. In Proceedings of the Conference on Association for Computational Linguistics, 2008.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2017.
    Google ScholarLocate open access versionFindings
  • Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understanding with the attention sum reader network. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2016.
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Cody Kwok, Oren Etzioni, and Daniel S Weld. Scaling question answering to the web. ACM Transactions on Information Systems (TOIS), 19(3):242–262, 2001.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence and graph kernels. International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Ankur P Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016.
    Google ScholarLocate open access versionFindings
  • Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends R in Information Retrieval, 3(4):333–389, 2009.
    Google ScholarLocate open access versionFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Libin Shen, Anoop Sarkar, and Franz Josef Och. Discriminative reranking for machine translation. In Proceedings of the Conference on the North American Chapter of the Association for Computational Linguistics, 2004.
    Google ScholarLocate open access versionFindings
  • Chuanqi Tan, Furu Wei, Nan Yang, Weifeng Lv, and Ming Zhou. S-net: From answer extraction to answer generation for machine reading comprehension. arXiv preprint arXiv:1706.04815, 2017.
    Findings
  • Adam Trischler, Zheng Ye, Xingdi Yuan, and Kaheer Suleman. Natural language comprehension with the epireader. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016.
    Google ScholarLocate open access versionFindings
  • Ellen M. Voorhees. The trec-8 question answering track report. In Trec, volume 99, pp. 77–82, 1999.
    Google ScholarLocate open access versionFindings
  • Shuohang Wang and Jing Jiang. Learning natural language inference with LSTM. In Proceedings of the Conference on the North American Chapter of the Association for Computational Linguistics, 2016.
    Google ScholarLocate open access versionFindings
  • Shuohang Wang and Jing Jiang. Machine comprehension using match-LSTM and answer pointer. In Proceedings of the International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced reader-ranker for open-domain question answering. arXiv preprint arXiv:1709.00023, 2017.
    Findings
  • Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. Question answering on freebase via relation extraction and textual evidence. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2326–2336, Berlin, Germany, August 2016. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1321–1331, Beijing, China, July 2015. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. Improved neural relation detection for knowledge base question answering. Proceedings of the Conference on Association for Computational Linguistics, 2017.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments