Ask the Right Questions: Active Question Reformulation with Reinforcement Learning

ICLR, Volume abs/1705.07830, 2018.

Cited by: 36|Views135
EI
Weibo:
We investigated a first system of this kind that has three components: a question reformulator, a black box Question Answering system, and a candidate answer aggregator

Abstract:

We frame Question Answering as a Reinforcement Learning task, an approach that we call Active Question Answering. We propose an agent that sits between the user and a black box question-answering system and which learns to reformulate questions to elicit the best possible answers. The agent probes the system with, potentially many, natura...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Web and social media have become primary sources of information. Users’ expectations and information seeking activities co-evolve with the increasing sophistication of these resources.
  • Document retrieval, and simple factual question answering, users seek direct answers to complex and compositional questions.
  • Such search sessions may require multiple iterations, critical assessment, and synthesis (Marchionini, 2006).
  • SearchQA tests the ability of AQA to reformulate questions such that the QA system has the best chance of returning the correct answer.
Highlights
  • Web and social media have become primary sources of information
  • SearchQA tests the ability of Active Question Answering (AQA) to reformulate questions such that the Question Answering (QA) system has the best chance of returning the correct answer
  • N, from the AQA reformulator trained as described in Section 4
  • In AQA TopHyp we use the top hypothesis generated by the sequence model, q1
  • We investigated a first system of this kind that has three components: a question reformulator, a black box QA system, and a candidate answer aggregator
  • The reformulator and aggregator form a trainable agent that seeks to elicit the best answers from the QA system
Methods
  • 5.1 QUESTION ANSWERING DATA AND BIDAF TRAINING

    SearchQA (Dunn et al, 2017) is a dataset built starting from a set of Jeopardy! clues.
  • 5.1 QUESTION ANSWERING DATA AND BIDAF TRAINING.
  • SearchQA (Dunn et al, 2017) is a dataset built starting from a set of Jeopardy!
  • Each clue is associated with the correct answer, e.g. George Washington, and a list of snippets from Google’s top search results.
  • SearchQA contains over 140k question/answer pairs and 6.9M snippets.
  • The training, validation and test sets contain 99,820, 13,393 and 27,248 examples, respectively
Results
  • The authors evaluate several variants of AQA.
  • For each query q in the evaluation the authors generate a list of reformulations qi, for i = 1.
  • N , from the AQA reformulator trained as described in Section 4.
  • The authors set N = 20 in these experiments, the same value is used for the benchmarks.
  • In AQA TopHyp the authors use the top hypothesis generated by the sequence model, q1.
  • In AQA Voting the authors use BiDAF scores
Conclusion
  • Lewis et al (2017) trained chatbots that negotiate via language utterances in order to complete a task.
  • As a consequence deep QA systems might implement sophisticated ranking systems trained to sort snippets of text from the context
  • As such, they resemble document retrieval systems which incentivizes thediscovery of IR techniques, such as tf-idf re-weighting and stemming, that have been successful for decades (Baeza-Yates & Ribeiro-Neto, 1999).The authors propose a new framework to improve question answering.
  • The authors will continue developing active question answering, investigating the sequential, iterative aspects of information seeking tasks, framed as end-to-end RL problems, closing the loop between the reformulator and the selector
Summary
  • Introduction:

    Web and social media have become primary sources of information. Users’ expectations and information seeking activities co-evolve with the increasing sophistication of these resources.
  • Document retrieval, and simple factual question answering, users seek direct answers to complex and compositional questions.
  • Such search sessions may require multiple iterations, critical assessment, and synthesis (Marchionini, 2006).
  • SearchQA tests the ability of AQA to reformulate questions such that the QA system has the best chance of returning the correct answer.
  • Objectives:

    In contrast with this line of work, the goal is to generate full question reformulations while optimizing directly the end-to-end target performance metrics.
  • This is not the desired task: the aim is for the agent to learn to communicate using natural language with an environment over which is has no control
  • Methods:

    5.1 QUESTION ANSWERING DATA AND BIDAF TRAINING

    SearchQA (Dunn et al, 2017) is a dataset built starting from a set of Jeopardy! clues.
  • 5.1 QUESTION ANSWERING DATA AND BIDAF TRAINING.
  • SearchQA (Dunn et al, 2017) is a dataset built starting from a set of Jeopardy!
  • Each clue is associated with the correct answer, e.g. George Washington, and a list of snippets from Google’s top search results.
  • SearchQA contains over 140k question/answer pairs and 6.9M snippets.
  • The training, validation and test sets contain 99,820, 13,393 and 27,248 examples, respectively
  • Results:

    The authors evaluate several variants of AQA.
  • For each query q in the evaluation the authors generate a list of reformulations qi, for i = 1.
  • N , from the AQA reformulator trained as described in Section 4.
  • The authors set N = 20 in these experiments, the same value is used for the benchmarks.
  • In AQA TopHyp the authors use the top hypothesis generated by the sequence model, q1.
  • In AQA Voting the authors use BiDAF scores
  • Conclusion:

    Lewis et al (2017) trained chatbots that negotiate via language utterances in order to complete a task.
  • As a consequence deep QA systems might implement sophisticated ranking systems trained to sort snippets of text from the context
  • As such, they resemble document retrieval systems which incentivizes thediscovery of IR techniques, such as tf-idf re-weighting and stemming, that have been successful for decades (Baeza-Yates & Ribeiro-Neto, 1999).The authors propose a new framework to improve question answering.
  • The authors will continue developing active question answering, investigating the sequential, iterative aspects of information seeking tasks, framed as end-to-end RL problems, closing the loop between the reformulator and the selector
Tables
  • Table1: Results table for the experiments on SearchQA. Two-sample t-tests between the AQA results and either the Base-NMT or the MI-SubQuery results show that differences in F1 and Exact Match scores are statistically significant, p < 10−4, for both Top Hypothesis and CNN predictions. The difference between Base-NMT and MI-SubQuery is also significant for Top Hypothesis predictions
  • Table2: Results of the qualitative analysis on SearchQA. For the original Jeopardy! questions we give the reference answer, otherwise the answer given by BiDAF
  • Table3: Examples of queries where none of the methods produce the right answer, but the Oracle model can
  • Table4: Paraphrasing examples on captions from MSCOCO (<a class="ref-link" id="cLin_et+al_2014_a" href="#rLin_et+al_2014_a">Lin et al, 2014</a>)
Download tables as Excel
Related work
  • Lin & Pantel (2001) learned patterns of question variants by comparing dependency parsing trees. Duboue & Chu-Carroll (2006) showed that MT-based paraphrases can be useful in principle by providing significant headroom in oracle-based estimations of QA performance. Recently, Berant & Liang (2014) used paraphrasing to augment the training of a semantic parser by expanding through the paraphrases as a latent representation. Bilingual corpora and MT have been used to generate paraphrases by pivoting through a second language. Recent work uses neural translation models and multiple pivots (Mallinson et al, 2017). In contrast, our approach does not use pivoting and is, to our knowledge, the first direct neural paraphrasing system. Riezler et al (2007) propose phrase-based paraphrasing for query expansion. In contrast with this line of work, our goal is to generate full question reformulations while optimizing directly the end-to-end target performance metrics.
Funding
  • In training, we compute the F1 score of the answer for every instance
  • We also verified the increased fluency by using a large language model and found that the Base-NMT rewrites are 50% more likely than the original
Study subjects and analysis
language pairs: 30
This dataset contains 11.4M sentences which are fully aligned across six UN languages: Arabic, English, Spanish, French, Russian, and Chinese. From all bilingual pairs, we produce a multilingual training corpus of 30 language pairs. This yields 340M training examples which we use to train the zero-shot neural MT system (Johnson et al, 2016)

Reference
  • Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., 1999.
    Google ScholarLocate open access versionFindings
  • Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. In Proceedings of ACL, pp. 1415–1425, 2014.
    Google ScholarLocate open access versionFindings
  • Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. Massive exploration of neural machine translation architectures. In Proceedings of EMNLP, pp. 1442–1451, 2017.
    Google ScholarLocate open access versionFindings
  • Noam Chomsky. Aspects of the Theory of Syntax. The MIT Press, 1965.
    Google ScholarFindings
  • Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the Association for Computational Lingustics, 2011.
    Google ScholarLocate open access versionFindings
  • Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. Predicting query performance. In Proceedings of ACM SIGIR, pp. 299–306, 2002.
    Google ScholarLocate open access versionFindings
  • Pablo Ariel Duboue and Jennifer Chu-Carroll. Answering the question you wish they had asked: The impact of paraphrasing for question answering. In Proceedings of HLT-NAACL, pp. 33–36, 2006.
    Google ScholarLocate open access versionFindings
  • Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. https://arxiv.org/abs/1704.05179, 2017.
    Findings
  • Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Paraphrase-Driven Learning for Open Question Answering. In Proceedings of ACL, 2013.
    Google ScholarLocate open access versionFindings
  • Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
    Google ScholarLocate open access versionFindings
  • Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In Proceedings of EMNLP, pp. 2011–2021, 2017.
    Google ScholarLocate open access versionFindings
  • Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. https://arxiv.org/abs/1611.04558, 2016.
    Findings
  • Giridhar Kumaran and Vitor R. Carvalho. Reducing long queries using query quality predictors. In Proceedings of ACM SIGIR, pp. 564–571, 2009.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of EMNLP, pp. 2433–2443, 2017.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. In Proceedings of EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. In Proceedings of ACL, 2017.
    Google ScholarLocate open access versionFindings
  • Dekang Lin and Patrick Pantel. Discovery of inference rules for question-answering. Nat. Lang. Eng., 7(4):343–360, 2001.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In Proceedings of ECCV 2014, pp. 740–755, 2014.
    Google ScholarLocate open access versionFindings
  • Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. Paraphrasing revisited with neural machine translation. In Proceedings of EACL, 2017.
    Google ScholarLocate open access versionFindings
  • Gary Marchionini. Exploratory search: From finding to understanding. Commun. ACM, 49(4), 2006.
    Google ScholarLocate open access versionFindings
  • Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for text-based games using deep reinforcement learning. https://arxiv.org/abs/1506.08941, 2015.
    Findings
  • Rodrigo Nogueira and Kyunghyun Cho. End-to-end goal-driven web navigation. In Proceedings of NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Rodrigo Nogueira and Kyunghyun Cho. Task-oriented query reformulation with reinforcement learning. In Proceedings of EMNLP, 2017.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of EMNLP, pp. 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • Aaditya Prakash, Sadid A. Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. Neural paraphrase generation with stacked residual LSTM networks. In Proceedings of the 26th International Conference on Computational Linguistics (COLING), pp. 2923–2934, 2016.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of EMNLP, pp. 2383–2392, 2016.
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. https://arxiv.org/abs/1511.06732, 2015.
    Findings
  • Stefan Riezler, Alexander Vasserman, Ioannis Tsochantaridis, Vibhu Mittal, and Yi Liu. Statistical machine translation for query expansion in answer retrieval. In Proceedings of ACL, 2007.
    Google ScholarLocate open access versionFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR, 2017a.
    Google ScholarLocate open access versionFindings
  • Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Query-reduction networks for question answering. In Proceedings of ICLR 2017, ICLR 2017, 2017b.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, 1998. Ronald J. Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991. You Wu, Pankaj K. Agarwal, Chengkai Li, Jun Yang, and Cong Yu. Computational fact checking through query perturbations. ACM Trans. Database Syst., 42(1):4:1–4:41, 2017. Yingce Xia, Di He, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016. Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Poliquen. The united nations parallel corpus v1.0. In Proceedings of LREC, 2016.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments