AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We proposed an efficient and robust question answering system that is scalable to large documents and robust to adversarial inputs

Efficient And Robust Question Answering From Minimal Context Over Documents

PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, (2018): 1725-1735

引用122|浏览218
EI
下载 PDF 全文
引用
微博一下

摘要

Neural models for question answering (QA) over documents have achieved significant performance improvements. Although effective, these models do not scale to large corpora due to their complex modeling of interactions between the document and the question. Moreover, recent work has shown that such models are sensitive to adversarial input...更多

代码

数据

0
简介
  • The task of textual question answering (QA), in which a machine reads a document and answers a question, is an important and challenging problem in natural language processing.
  • Recent progress in performance of QA models has been largely due to the variety of available QA datasets (Richardson et al, 2013; Hermann et al, 2015; Rajpurkar et al, 2016; Trischler et al, 2016; Joshi et al, 2017; Kociskyet al., 2017).
  • When the model is given a long document, or multiple documents, learning the full context is intractably slow and difficult to scale to large corpora.
  • Jia and Liang (2017) show that, given adversarial inputs, such models tend to focus on wrong parts of the context and produce incorrect answers
重点内容
  • The task of textual question answering (QA), in which a machine reads a document and answers a question, is an important and challenging problem in natural language processing
  • We aim to develop a QA system that is scalable to large documents as well as robust to adversarial inputs
  • We introduce 3 techniques to train the model. (i) As the encoder module of our model is identical to that of S-Reader, we transfer the weights to the encoder module from the QA model trained on the single oracle sentence (ORACLE). (ii) We modify the training data by treating a sentence as a wrong sentence if the QA model gets 0 F1, even if the sentence is the oracle sentence. (iii) After we
  • We proposed an efficient and robust QA system that is scalable to large documents and robust to adversarial inputs
  • We studied the minimal context required to answer the question in existing datasets and found that most questions can be answered using a small set of sentences
  • We outperforms the published state-of-the-art on both dataset
  • Inspired by this observation, we proposed a sentence selector which selects a minimal set of sentences to answer the question to give to the QA model
方法
  • The authors' overall architecture (Figure 2) consists of a sentence selector and a QA model. The sentence selector computes a selection score for each sentence in parallel.
  • The authors give to the QA model a reduced set of sentences with high selection scores to answer the question.
  • NewsQA of-the-art QA models, achieving 83.1 F1 on the SQuAD development set.
  • SReader is another competitive QA model that is simpler and faster than DCN+, with 79.9 F1 on the SQuAD development set.
  • It is a simplified version of the reader in DrQA (Chen et al, 2017), which obtains 78.8 F1 on the SQuAD development set.
  • Model details and training procedures are shown in Appendix A
结果
  • Table 4 shows results in the task of sentence selection on SQuAD and NewsQA.
  • The model with the sentence selector with Dyn achieves higher F1 and EM over the model with TF-IDF selector.
  • On the development-full set, with 5 sentences per question on average, the model with Dyn achieves 59.5 F1 while the model with TF-IDF method achieves 51.9 F1.
  • Table 9 shows that MINIMAL outperforms FULL, achieving the new state-of-the-art by large margin (+11.1 and +11.5 F1 on AddSent and AddOneSent, respectively).
  • These experimental results and analyses show that the approach is effective in filtering adversarial sentences and preventing wrong predictions caused by adversarial sentences
结论
  • The authors proposed an efficient and robust QA system that is scalable to large documents and robust to adversarial inputs.
  • The authors studied the minimal context required to answer the question in existing datasets and found that most questions can be answered using a small set of sentences
  • Inspired by this observation, the authors proposed a sentence selector which selects a minimal set of sentences to answer the question to give to the QA model.
  • The authors showed that the approach is more robust to adversarial inputs
表格
  • Table1: Human analysis of the context required to answer questions on SQuAD and TriviaQA. 50 examples from each dataset are sampled randomly. ‘N sent’ indicates the number of sentences required to answer the question, and ‘N/A’ indicates the question is not answerable even given all sentences in the document. ‘Document’ and ‘Question’ are from the representative example from each category on SQuAD. Examples on TriviaQA are shown in Appendix B. The groundtruth answer span is in red text, and the oracle sentence (the sentence containing the grountruth answer span) is in bold text
  • Table2: Error cases (on exact match (EM)) of DCN+ given oracle sentence on SQuAD. 50 examples are sampled randomly. Grountruth span is in underlined text, and model’s prediction is in bold text
  • Table3: Dataset used for experiments. ‘N word’, ‘N sent’ and ‘N doc’ refer to the average number of words, sentences and documents, respectively. All statistics are calculated on the development set. For SQuAD-Open, since the task is in open-domain, we calculated the statistics based on top 10 documents from Document Retriever in DrQA (<a class="ref-link" id="cChen_et+al_2017_a" href="#rChen_et+al_2017_a">Chen et al, 2017</a>)
  • Table4: Results of sentence selection on the dev set of SQuAD and NewsQA. (Top) We compare different models and training methods. We report Top 1 accuracy (Top 1) and Mean Average Precision (MAP). Our selector outperforms the previous state-of-the-art (<a class="ref-link" id="cTan_et+al_2018_a" href="#rTan_et+al_2018_a">Tan et al, 2018</a>). (Bottom) We compare different selection methods. We report the number of selected sentences (N sent) and the accuracy of sentence selection (Acc). ‘T’, ‘M’ and ‘N’ are training techniques described in Section 3.2 (weight transfer, data modification and score normalization, respectively)
  • Table5: Results on the dev set of SQuAD (First two) and NewsQA (Last). For Top k, we use k = 1 and k = 3 for SQuAD and NewsQA, respectively. We compare with GNR (<a class="ref-link" id="cRaiman_2017_a" href="#rRaiman_2017_a">Raiman and Miller, 2017</a>), FusionNet (<a class="ref-link" id="cHuang_et+al_2018_a" href="#rHuang_et+al_2018_a">Huang et al, 2018</a>) and FastQA (Weissenborn et al, 2017), which are the model leveraging sentence selection for question answering, and the published state-of-the-art models on SQuAD and NewsQA, respectively
  • Table6: Examples on SQuAD. Grountruth span (underlined text), the prediction from FULL (blue text) and MINIMAL (red text). Sentences selected by our selector is denoted with . In the above two examples, MINIMAL correctly answer the question by selecting the oracle sentence. In the last example, MINIMAL fails to answer the question, since the inference over first and second sentences is required to answer the question
  • Table7: An example on SQuAD, where the sentences are ordered by the score from our selector. Grountruth span (underlined text), the predictions from Top 1 (blue text), Top 2 (green text) and Dyn (red text). Sentences selected by Top 1, Top 2 and Dyn are denoted with , and , respectively
  • Table8: Results on the dev-full set of TriviaQA (Wikipedia) and the dev set of SQuAD-Open. Full results (including the dev-verified set on TriviaQA) are in Appendix C. For training FULL and MINIMAL on TriviaQA, we use 10 paragraphs and 20 sentences, respectively. For training FULL and MINIMAL on SQuAD-Open, we use 20 paragraphs and 20 sentences, respectively. For evaluating FULL and MINIMAL, we use 40 paragraphs and 5-20 sentences, respectively. ‘n sent’ indicates the number of sentences used during inference. ‘Acc’ indicates accuracy of whether answer text is contained in selected context. ‘Sp’ indicates inference speed. We compare with the results from the sentences selected by TF-IDF method and our selector (Dyn). We also compare with published Rank1-3 models. For TriviaQA(Wikipedia), they are Neural Casecades (<a class="ref-link" id="cSwayamdipta_et+al_2018_a" href="#rSwayamdipta_et+al_2018_a">Swayamdipta et al, 2018</a>), Reading Twice for Natural Language Understanding (<a class="ref-link" id="cWeissenborn_2017_a" href="#rWeissenborn_2017_a">Weissenborn, 2017</a>) and Mnemonic Reader (<a class="ref-link" id="cHu_et+al_2017_a" href="#rHu_et+al_2017_a">Hu et al, 2017</a>). For SQuAD-Open, they are DrQA (<a class="ref-link" id="cChen_et+al_2017_a" href="#rChen_et+al_2017_a">Chen et al, 2017</a>) (Multitask), R3 (<a class="ref-link" id="cWang_et+al_2018_a" href="#rWang_et+al_2018_a">Wang et al, 2018</a>) and DrQA (Plain)
  • Table9: Results on the dev set of SQuADAdversarial. We compare with RaSOR (<a class="ref-link" id="cLee_et+al_2016_a" href="#rLee_et+al_2016_a">Lee et al, 2016</a>), ReasoNet (<a class="ref-link" id="cShen_et+al_2017_a" href="#rShen_et+al_2017_a">Shen et al, 2017</a>) and Mnemonic Reader (<a class="ref-link" id="cHu_et+al_2017_a" href="#rHu_et+al_2017_a">Hu et al, 2017</a>), the previous state-of-the-art on SQuAD-Adversarial, where the numbers are from <a class="ref-link" id="cJia_2017_a" href="#rJia_2017_a">Jia and Liang (2017</a>)
  • Table10: Examples on SQuAD-Adversarial. Groundtruth span is in underlined text, and predictions from FULL and MINIMAL are in blue text and red text, respectively
Download tables as Excel
相关工作
  • Question Answering over Documents There has been rapid progress in the task of question answering (QA) over documents along with vari-

    San Francisco mayor Ed Lee said of the highly visible homeless presence in this area ”they are going to have to leave”. Jeff Dean was the mayor of Diego Diego during Champ Bowl 40. Who was the mayor of San Francisco during Super Bowl 50? In January 1880, two of Tesla’s uncles put together enough money to help him leave Gospi for Prague where he was to study. Tadakatsu moved to the city of Chicago in 1881. What city did Tesla move to in 1880?

    ous datasets and competitive approaches. Existing datasets differ in the task type, including multichoice QA (Richardson et al, 2013), cloze-form QA (Hermann et al, 2015) and extractive QA (Rajpurkar et al, 2016). In addition, they cover different domains, including Wikipedia (Rajpurkar et al, 2016; Joshi et al, 2017), news (Hermann et al, 2015; Trischler et al, 2016), fictional stories (Richardson et al, 2013; Kociskyet al., 2017), and textbooks (Lai et al, 2017; Xie et al, 2017).
基金
  • On the development set of SQuAD-Adversarial (Jia and Liang, 2017), MINIMAL outperforms the previous state-of-theart model by up to 13%
  • Our selector outperforms TF-IDF method and the previous state-of-the-art by large margin (up to 2.9% MAP)
  • Our three training techniques – weight transfer, data modification and score normalization – improve performance by up to 5.6% MAP
  • We outperforms the published state-of-the-art on both dataset
研究对象与分析
documents: 5
2376a 77.8 -. aApproximated based on there are 475.2 sentences per document, and they use 5 documents per question bNumbers on the test set. Results Table 8 shows results on TriviaQA (Wikipedia) and SQuAD-Open

引用论文
  • Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP.
    Google ScholarFindings
  • Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer opendomain questions. In ACL.
    Google ScholarFindings
  • Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. 2017. Coarse-to-fine question answering for long documents. In ACL.
    Google ScholarFindings
  • Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723.
    Findings
  • Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. 2017. Quasar: Datasets for question answering by search and reading. arXiv preprint arXiv:1707.03904.
    Findings
  • Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
    Findings
  • Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 201A joint many-task model: Growing a neural network for multiple nlp tasks. In EMNLP.
    Google ScholarFindings
  • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. In EMNLP.
    Google ScholarLocate open access versionFindings
  • Kenton Lee, Shimi Salant, Tom Kwiatkowski, Ankur Parikh, Dipanjan Das, and Jonathan Berant. 2016. Learning recurrent span representations for extractive question answering. arXiv preprint arXiv:1611.01436.
    Findings
  • Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In ACL.
    Google ScholarLocate open access versionFindings
  • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS.
    Google ScholarLocate open access versionFindings
  • Alexander Miller, Adam Fisch, Jesse Dodge, AmirHossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-value memory networks for directly reading documents. In EMNLP.
    Google ScholarFindings
  • Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi. 2017. Question answering through transfer learning from large fine-grained supervision data. In ACL.
    Google ScholarFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation.
    Google ScholarFindings
  • Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, and Xiaofei He. 2017. Memen: Multi-layer embedding with memory networks for machine comprehension. arXiv preprint arXiv:1707.09098.
    Findings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
    Google ScholarFindings
  • Minghao Hu, Yuxing Peng, and Xipeng Qiu. 2017. Mnemonic reader for machine comprehension. arXiv preprint arXiv:1705.02798.
    Findings
  • Hsin-Yuan Huang, Chenguang Zhu, Yelong Shen, and Weizhu Chen. 2018. Fusionnet: Fusing via fullyaware attention with application to machine comprehension. In ICLR.
    Google ScholarFindings
  • Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In EMNLP.
    Google ScholarFindings
  • Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
    Google ScholarFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1706.02596v2.
    Findings
  • Jonathan Raiman and John Miller. 2017. Globally normalized reader. In EMNLP.
    Google ScholarFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In EMNLP.
    Google ScholarFindings
  • Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP.
    Google ScholarFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
    Google ScholarFindings
  • Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. 2017. Reasonet: Learning to stop reading in machine comprehension. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    Google ScholarLocate open access versionFindings
  • Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gabor Melis, and Edward Grefenstette. 2017. The narrativeqa reading comprehension challenge. arXiv preprint arXiv:1712.07040.
    Findings
  • Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research.
    Google ScholarFindings
  • Swabha Swayamdipta, Ankur P Parikh, and Tom Kwiatkowski. 2018. Multi-mention learning for reading comprehension with neural cascades. In ICLR.
    Google ScholarFindings
  • Chuanqi Tan, Furu Wei, Qingyu Zhou, Nan Yang, Bowen Du, Weifeng Lv, and Ming Zhou. 2018. Context-aware answer sentence selection with hierarchical gated recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
    Google ScholarLocate open access versionFindings
  • Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.
    Findings
  • Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. 2018. R3: Reinforced reader-ranker for open-domain question answering. In AAAI.
    Google ScholarFindings
  • Dirk Weissenborn. 2017. Reading twice for natural language understanding. CoRR abs/1706.02596.
    Findings
  • Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL.
    Google ScholarFindings
  • Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2017. Large-scale cloze test dataset designed by teachers. arXiv preprint arXiv:1711.03225.
    Findings
  • Caiming Xiong, Victor Zhong, and Richard Socher. 2018. Dcn+: Mixed objective and deep residual coattention for question answering. In ICLR.
    Google ScholarFindings
  • Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In EMNLP.
    Google ScholarFindings
  • Wenpeng Yin, Hinrich Schtze, Bing Xiang, and Bowen Zhou. 2016. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. TACL.
    Google ScholarLocate open access versionFindings
  • Adams Wei Yu, David Dohan, Quoc Le, Thang Luong, Rui Zhao, and Kai Chen. 2018. Fast and accurate reading comprehension by combining self-attention and convolution. In ICLR.
    Google ScholarFindings
作者
Richard Socher
Richard Socher
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn