R^3: Reinforced Ranker-Reader for Open-Domain Question Answering

Shuohang Wang
Shuohang Wang
Xiaoxiao Guo
Xiaoxiao Guo
Zhiguo Wang
Zhiguo Wang
Wei Zhang
Wei Zhang

national conference on artificial intelligence, 2018.

Cited by: 137|Bibtex|Views78
Other Links: academic.microsoft.com
Weibo:
Our framework achieves the best performance on several question answering datasets

Abstract:

In recent years researchers have achieved considerable success applying neural network methods to question answering (QA). These approaches have achieved state of the art results in simplified closed-domain settings such as the SQuAD (Rajpurkar et al., 2016) dataset, which provides a pre-selected passage, from which the answer to a given ...More

Code:

Data:

0
Introduction
  • Open-domain question answering (QA) is a key challenge in natural language processing.
  • A successful open-domain QA system must be able to effectively retrieve and comprehend one or more knowledge sources to infer a correct answer.
  • Q: What is the largest island in the Philippines?
  • A: Luzon P1 Mindanao is the second largest and easternmost island in the Philippines.
  • P2 As an island, Luzon is the Philippine’s largest at.
  • P3 Manila, located on east central Luzon Island, is the national capital and largest city
Highlights
  • Open-domain question answering (QA) is a key challenge in natural language processing
  • We use F1 score and Exact Match (EM) evaluation metrics10
  • We have proposed and evaluated R3, a new open-domain QA framework which combines information retrieval (IR) with a deep learning based Ranker and Reader
  • First the IR model retrieves the top-N passages conditioned on the question
  • The Ranker and Reader are trained jointly using reinforcement learning to directly optimize the expectation of extracting the groundtruth answer from the retrieved passages
  • Our framework achieves the best performance on several QA datasets
Results
  • Results and Analysis

    the authors will show the performance of different models on five QA datasets and offer further analysis.
  • The authors observe that the Reinforced Ranker-Reader (R3) achieves the best performance on the Quasar-T, WikiMovies, and CuratedTREC datests and achieves significantly better performance than the internal baseline model Simple Ranker-Reader (SR2) on all datasets except CuratedTREC
  • These results demonstrate the effectiveness of using RL to jointly train the Ranker and Reader both as compared to competing approaches and the non-RL RankerReader baseline
Conclusion
  • The authors have proposed and evaluated R3, a new open-domain QA framework which combines IR with a deep learning based Ranker and Reader.
  • First the IR model retrieves the top-N passages conditioned on the question.
  • The Ranker and Reader are trained jointly using reinforcement learning to directly optimize the expectation of extracting the groundtruth answer from the retrieved passages.
  • The authors' framework achieves the best performance on several QA datasets
Summary
  • Introduction:

    Open-domain question answering (QA) is a key challenge in natural language processing.
  • A successful open-domain QA system must be able to effectively retrieve and comprehend one or more knowledge sources to infer a correct answer.
  • Q: What is the largest island in the Philippines?
  • A: Luzon P1 Mindanao is the second largest and easternmost island in the Philippines.
  • P2 As an island, Luzon is the Philippine’s largest at.
  • P3 Manila, located on east central Luzon Island, is the national capital and largest city
  • Results:

    Results and Analysis

    the authors will show the performance of different models on five QA datasets and offer further analysis.
  • The authors observe that the Reinforced Ranker-Reader (R3) achieves the best performance on the Quasar-T, WikiMovies, and CuratedTREC datests and achieves significantly better performance than the internal baseline model Simple Ranker-Reader (SR2) on all datasets except CuratedTREC
  • These results demonstrate the effectiveness of using RL to jointly train the Ranker and Reader both as compared to competing approaches and the non-RL RankerReader baseline
  • Conclusion:

    The authors have proposed and evaluated R3, a new open-domain QA framework which combines IR with a deep learning based Ranker and Reader.
  • First the IR model retrieves the top-N passages conditioned on the question.
  • The Ranker and Reader are trained jointly using reinforcement learning to directly optimize the expectation of extracting the groundtruth answer from the retrieved passages.
  • The authors' framework achieves the best performance on several QA datasets
Tables
  • Table1: An open-domain QA training example. Q: question, A: answer, P: passages retrieved by an IR model and ordered by IR score
  • Table2: Statistics of the datasets. #q represents the number of questions. For the training dataset, we ignore the questions without any answer in all the retrieved passages. In the special case that there’s only one answer for the question, during training, we combine the question with the answer as the query to improve IR recall. Otherwise we use only the question. #p represents the number of passages and 14.8 / 100 means there are 14.8 passages containing the answer on average out of the 100 passages. We use top50 passages retrieved by the IR model for testing
  • Table3: Open-domain question answering results. The results show the average of 5 runs, with standard error in the superscript. The CuratedTREC and WebQuestions models are initialized by training on SQuADOPEN first. On the bottom, YodaQA and DrQA-MTL use additional resources (usage of KB for the former, and multiple training datasets for the latter), so are not a true apple-to-apple comparison to the other methods. EM: Exact Match
  • Table4: Effects of rankers from SR2 and R3 (on QuasarT test dataset). Here we use the same single reader model (SR) as the reader, combined with two different rankers. The performance of the two runs of SR2 and R3 (that provide the rankers) is listed at bottom. In this setting, the Ranker is the same, while the Reader is trained differently
  • Table5: Potential improvement on QA performance by improving the ranker. The performance is based on the QuasarT test dataset. The TOP-3/5 performance is used to evaluate the further potential improvement by improving rankers (see the “Potential Improvement” section)
  • Table6: An example of the answers extracted by the R3 and SR2 methods, given the question. The words in bold are the extracted answers. The passages are ranked by the highest score (Ranker+Reader) of the answer span in each passage
  • Table7: The performance of Rankers (recall of the top-k ranked passages) on the Quasar-T test dataset. This evaluation is simply based on whether the ground-truth appears in the TOP-N passages. IR directly uses the ranking score from raw dataset
Download tables as Excel
Related work
  • Open domain question answering dates back to as early as (Green Jr et al 1961) and was popularized with TREC8 (Voorhees 1999). The task is to answer a question by exploiting resources such as documents (Voorhees 1999), webpages (Kwok, Etzioni, and Weld 2001; Chen and Van Durme 2017) or structured knowledge bases (Berant et al 2013; Bordes et al 2015; Yu et al 2017). An early consensus since TREC-8 has produced an approach with three major components: question analysis, document retrieval and ranking, and answer extraction. Although question analysis is relatively mature, answer extraction and document ranking still represent significant challenges.

    Very recently, IR plus machine reading comprehension (SR-QA) showed promise for open-domain QA, especially after datasets created specifically for the multiple-passage RC setting (Nguyen et al 2016; Chen et al 2017a; Joshi et al 2017; Dunn et al 2017; Dhingra, Mazaitis, and Cohen 2017). These datasets deal with the end-to-end open-domain
Reference
  • [2015] Baudis, P., and Sedivy, J. 2015. Modeling of the question answering task in the yodaqa system. In Intl. Conf. of the Cross-Language Evaluation Forum for European Languages.
    Google ScholarFindings
  • [2013] Berant, J.; Chou, A.; Frostig, R.; and Liang, P. 2013. Semantic parsing on freebase from question-answer pairs. In Proc. of Conf. on EMNLP.
    Google ScholarLocate open access versionFindings
  • [2015] Bordes, A.; Usunier, N.; Chopra, S.; and Weston, J. 2015. Large-scale simple question answering with memory networks. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • [2017] Chen, T., and Van Durme, B. 2017. Discriminative information retrieval for question answering sentence selection. In Proc. of Conf. on EACL.
    Google ScholarLocate open access versionFindings
  • [2017a] Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017a. Reading Wikipedia to answer open-domain questions. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • [2017b] Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; Jiang, H.; and Inkpen, D. 2017b. Enhanced lstm for natural language inference. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • [2016] Cheng, J., and Lapata, M. 2016. Neural summarization by extracting sentences and words. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • [2017] Choi, E.; Hewlett, D.; Uszkoreit, J.; Polosukhin, I.; Lacoste, A.; and Berant, J. 2017. Coarse-to-fine question answering for long documents. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • [2017] Dhingra, B.; Liu, H.; Yang, Z.; Cohen, W. W.; and Salakhutdinov, R. 2017. Gated-attention readers for text comprehension. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • [2017] Dhingra, B.; Mazaitis, K.; and Cohen, W. W. 2017. QUASAR: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904.
    Findings
  • [2017] Dunn, M.; Sagun, L.; Higgins, M.; Guney, U.; Cirik, V.; and Cho, K. 2017. SearchQA: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
    Findings
  • [2010] Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek, D.; Kalyanpur, A. A.; et al. 2010. Building watson: An overview of the deepqa project. AI magazine 31(3):59– 79.
    Google ScholarLocate open access versionFindings
  • [1961] Green Jr, B. F.; Wolf, A. K.; Chomsky, C.; and Laughery, K. 1961. Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IREAIEE-ACM computer Conf., 219–224. ACM.
    Google ScholarLocate open access versionFindings
  • [2017] Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • [2015] Kingma, D., and Ba, J. 20Adam: A method for stochastic optimization. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • [2001] Kwok, C.; Etzioni, O.; and Weld, D. S. 2001. Scaling question answering to the web. ACM Transactions on Information Systems (TOIS) 19(3):242–262.
    Google ScholarLocate open access versionFindings
  • [2016] Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing neural predictions. In Proc. of Conf. on EMNLP.
    Google ScholarLocate open access versionFindings
  • [2014] Mnih, V.; Heess, N.; Graves, A.; et al. 2014. Recurrent models of visual attention. In Advances in NIPS, 2204–2212.
    Google ScholarLocate open access versionFindings
  • [2016] Narasimhan, K.; Yala, A.; and Barzilay, R. 2016. Improving information extraction by acquiring external evidence with reinforcement learning. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • [2016] Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; and Deng, L. 2016. MS MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
    Findings
  • [2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. GloVe: Global vectors for word representation. In Proc. of Conf. on EMNLP.
    Google ScholarLocate open access versionFindings
  • [2016] Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of Conf. on EMNLP.
    Google ScholarLocate open access versionFindings
  • [2017] Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017. Bidirectional attention flow for machine comprehension. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • [2000] Voorhees, E. M., and Tice, D. M. 2000. Building a question answering test collection. In Proc. of 23rd annual Intl. ACM SIGIR Conf. on Research and development in information retrieval, 200–207. ACM.
    Google ScholarLocate open access versionFindings
  • [1999] Voorhees, E. M. 1999. The trec-8 question answering track report. In Trec, volume 99, 77–82.
    Google ScholarLocate open access versionFindings
  • [2016] Wang, S., and Jiang, J. 2016. Learning natural language inference with LSTM. In Proc. of Conf. on NAACL.
    Google ScholarLocate open access versionFindings
  • [2017a] Wang, S., and Jiang, J. 2017a. A compare-aggregate model for matching text sequences. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • [2017b] Wang, S., and Jiang, J. 2017b. Machine comprehension using match-LSTM and answer pointer. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • [2016] Wang, Z.; Mi, H.; Hamza, W.; and Florian, R. 2016. Multi-perspective context matching for machine comprehension. arXiv preprint arXiv:1612.04211.
    Findings
  • [2017] Wang, W.; Yang, N.; Wei, F.; Chang, B.; and Zhou, M. 2017. Gated self-matching networks for reading comprehension and question answering. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • [2007] Wang, M.; Smith, N. A.; and Mitamura, T. 2007. What is the jeopardy model? a quasi-synchronous grammar for qa. In Proc. of Conf. on EMNLP.
    Google ScholarLocate open access versionFindings
  • [1992] Williams, R. J. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning.
    Google ScholarFindings
  • [2017] Xiong, C.; Zhong, V.; and Socher, R. 2017. Dynamic coattention networks for question answering. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • [2015] Yang, Y.; Yih, W.-t.; and Meek, C. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proc. of Conf. on EMNLP.
    Google ScholarLocate open access versionFindings
  • [2017] Yu, M.; Yin, W.; Hasan, K. S.; Santos, C. d.; Xiang, B.; and Zhou, B. 2017. Improved neural relation detection for knowledge base question answering. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments