Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

NeurIPS 2020, 2020.

Cited by: 0|Bibtex|Views320|Links
Keywords:
Natural Questionslanguage generationopen domain question answeringExact Matchretrieval-augmented generationMore(8+)
Weibo:
We showed that our retrieval-augmented generation models obtain state-of-the-art performance on open domain question answering

Abstract:

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-s...More
0
Introduction
  • Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowledge from data [41].
  • They can do so without any access to an external memory, as a parameterized implicit knowledge base [45, 46].
Highlights
  • Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowledge from data [41]
  • retrieval-augmented generation (RAG) demonstrates that neither a re-ranker nor extractive reader is necessary for state-of-the-art machine reading performance
  • Documents which contain clues as to the correct answer but do not contain the correct answer verbatim themselves can still contribute probability mass towards a correct answer being generated, which is not possible with standard extractive approaches, leading to more effective marginalization across documents
  • We presented hybrid generation models with access to parametric and non-parametric retrieval-based external memory, in the form of Wikipedia
  • We showed that our RAG models obtain state-of-the-art performance on open domain question answering
  • We found that people prefer RAG’s generation over purely parametric BART and find RAG more factual, and we conducted a detailed investigation of the learned retrieval component, validating its effectiveness
Methods
  • The authors experiment with RAG in a wide range of knowledge-intensive tasks.
  • The authors use a single Wikipedia dump for the non-parametric knowledge source.
  • Each Wikipedia article is split into disjoint 100-word chunks, to make a total of 21,015,324 documents.1.
  • The authors use the DPR document encoder to compute document embeddings for each document, and the authors build a single MIPS index using FAISS [19] using Hierarchical Navigable Small World approximation for efficient retrieval [33], which is used for all experiments.
  • In the remainder of this section, the authors will discuss the experimental details for each of these task settings
Results
  • 4.1 Open-domain Question Answering

    Table 1 shows results for RAG along with recent state-of-the-art models.
  • RAG combines the generation flexibility of the “closed-book" approaches and the performance of "open-book" retrieval-based approaches.
  • RAG compares favourably to DPR QA system on open-domain QA, which uses a BERT-based cross-encoder system to re-rank documents, along with an extractive reader.
  • RAG is able to generate correct answers even when the correct answer is not present in any of the retrieved documents, achieving an accuracy of 11.8% in such cases for Natural Questions, whereas an extractive model would score 0%
Conclusion
  • The authors presented hybrid generation models with access to parametric and non-parametric retrieval-based external memory, in the form of Wikipedia.
  • The authors found that people prefer RAG’s generation over purely parametric BART and find RAG more factual, and the authors conducted a detailed investigation of the learned retrieval component, validating its effectiveness.
  • The authors' work opens new research directions on how parametric and non-parametric memories interact and how to most effectively combine the different components, showing promise in being applied to a wide variety of NLP tasks
Summary
  • Introduction:

    Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowledge from data [41].
  • They can do so without any access to an external memory, as a parameterized implicit knowledge base [45, 46].
  • Methods:

    The authors experiment with RAG in a wide range of knowledge-intensive tasks.
  • The authors use a single Wikipedia dump for the non-parametric knowledge source.
  • Each Wikipedia article is split into disjoint 100-word chunks, to make a total of 21,015,324 documents.1.
  • The authors use the DPR document encoder to compute document embeddings for each document, and the authors build a single MIPS index using FAISS [19] using Hierarchical Navigable Small World approximation for efficient retrieval [33], which is used for all experiments.
  • In the remainder of this section, the authors will discuss the experimental details for each of these task settings
  • Results:

    4.1 Open-domain Question Answering

    Table 1 shows results for RAG along with recent state-of-the-art models.
  • RAG combines the generation flexibility of the “closed-book" approaches and the performance of "open-book" retrieval-based approaches.
  • RAG compares favourably to DPR QA system on open-domain QA, which uses a BERT-based cross-encoder system to re-rank documents, along with an extractive reader.
  • RAG is able to generate correct answers even when the correct answer is not present in any of the retrieved documents, achieving an accuracy of 11.8% in such cases for Natural Questions, whereas an extractive model would score 0%
  • Conclusion:

    The authors presented hybrid generation models with access to parametric and non-parametric retrieval-based external memory, in the form of Wikipedia.
  • The authors found that people prefer RAG’s generation over purely parametric BART and find RAG more factual, and the authors conducted a detailed investigation of the learned retrieval component, validating its effectiveness.
  • The authors' work opens new research directions on how parametric and non-parametric memories interact and how to most effectively combine the different components, showing promise in being applied to a wide variety of NLP tasks
Tables
  • Table1: Open-Domain QA Test Scores. For TQA, the left column uses the test split commonly used in Open-Domain QA. The right column uses the hidden TQA Wiki test split. See Appendix B for further information
  • Table2: Generation and classification task Test Scores. SotA for MS-MARCO is [<a class="ref-link" id="c4" href="#r4">4</a>], FEVER-3 is [<a class="ref-link" id="c61" href="#r61">61</a>] and FEVER-2 is [<a class="ref-link" id="c51" href="#r51">51</a>] * Uses gold context/evidence, best-performing model without gold access underlined. As FEVER is a classification dataset, RAG-Token and RAG-Sequence are equivalent
  • Table3: Human assessments for the Jeopardy Question Generation Task
  • Table4: Example Generations for MS-MARCO and Jeopardy Question generation. RAG models generate mpre specific and factually accurate responses, whereas BART generate more factually incorrect (marked by ‘?’), or partially correct (marked by *) and more generic responses
  • Table5: Ablations on the development set. As FEVER is a classification dataset, RAG-Token and RAG-Sequence are equivalent
  • Table6: Ratio of distinct tri-grams to total tri-grams in the development set generations for MSMARCO and Jeopardy Question Generation
Download tables as Excel
Related work
  • Single-Task Retrieval Prior work has shown that retrieval improves performance across a variety of NLP tasks when considered in isolation. Such tasks include open-domain question answering [5, 25], fact checking [50], fact completion [42], long-form question answering [12], Wikipedia article generation [32], dialogue [36, 59, 9, 13], translation [16], and language modeling [17, 23]. Our work unifies previous successes in incorporating retrieval into individual tasks, showing that a single retrieval-based architecture is capable of achieving strong performance across several tasks.

    General-Purpose Architectures for NLP Prior work on general-purpose architectures for NLP tasks has shown great success without the use of retrieval. A single, pre-trained language model has been shown to achieve strong performance on various classification tasks in the GLUE benchmarks [54, 55] after fine-tuning [43, 8]. GPT-2 [44] later showed that a single, left-to-right, pre-trained language model could achieve strong performance across both discriminative and generative tasks. For further improvement, BART [28] and T5 [45, 46] propose a single, pre-trained encoder-decoder model that leverages bi-directional attention to achieve stronger performance on discriminative and generative tasks. Our work aims to expand the space of possible tasks with a single, unified architecture, by learning a retrieval module to augment pre-trained, generative language models.
Study subjects and analysis
documents: 21015324
Following Lee et al [27] and Karpukhin et al [22], we use the December 2018 dump. Each Wikipedia article is split into disjoint 100-word chunks, to make a total of 21,015,324 documents.1. We use the DPR document encoder to compute document embeddings for each document, and we build a single MIPS index using FAISS [19] using Hierarchical Navigable Small World approximation for efficient retrieval [33], which is then used for all experiments

documents: 21015324
Following Lee et al [27] and Karpukhin et al [22], we use the December 2018 dump. Each Wikipedia article is split into disjoint 100-word chunks, to make a total of 21,015,324 documents.1. We use the DPR document encoder to compute document embeddings for each document, and we build a single MIPS index using FAISS [19] using Hierarchical Navigable Small World approximation for efficient retrieval [33], which is then used for all experiments

popular open-domain QA datasets: 4
In addition, we also compare to "Closed-Book QA" approaches [46], which, like RAG, generate answers, but do not exploit latent retrieval, instead relying purely on parametric knowledge. We consider four popular open-domain QA datasets: Natural Questions (NQ) [25], TriviaQA (TQA) [20]. WebQuestions (WQ) [3] and CuratedTrec (CT) [2]

documents: 1000
for answer-generation models [18]. To overcome this, we use a pre-processing step where we first retrieve the top 1000 documents for each query, and use the answer that most frequently matches the regex pattern as the supervision target. If no matches are found, we resort to a simple heuristic: generate all possible permutations for each regex, replacing non-deterministic symbols in the regex nested tree structure with a whitespace

retrieved documents: 15
3.5 Implementation Details. For Open-domain QA we report test numbers using 15 retrieved documents for RAG-Token models. For RAG-Sequence models, we report test results using 50 retrieved documents, and we use the Thorough Decoding approach since answers are generally short

retrieved documents: 50
For Open-domain QA we report test numbers using 15 retrieved documents for RAG-Token models. For RAG-Sequence models, we report test results using 50 retrieved documents, and we use the Thorough Decoding approach since answers are generally short. We use greedy decoding for QA as we did not find beam search improved results

retrieved documents: 10
We use greedy decoding for QA as we did not find beam search improved results. For Open-MSMarco and Jeopardy question generation, we report test numbers using ten retrieved documents for both RAG-Token and RAG-Sequence, and we also train a BART-large model as a baseline. We use a beam size of four, and use the Fast Decoding approach for RAG-Sequence models, as Thorough Decoding did not improve performance

pairs: 452
Table 3 shows the results from the human evaluation. The human evaluation was carried out with 452 pairs of generations from BART and RAG-Token. The annotators indicated that BART was more factual than RAG in only 7.1% of cases, while RAG was more factual in 42.7% of cases and both RAG and BART were factual in a further 17% of cases, clearly demonstrating the comparative effectiveness of RAG on the task over a state-of-the-art conditional generation model

retrieved articles: 10
We analyze the overlap in Wikipedia articles between the top k documents retrieved by RAG and the gold, annotated evidence documents. We find that the top article retrieved by RAG is a gold document for the claim in 71% of cases, and a gold article is present in the top 10 retrieved articles in 90% of cases. 4.5 Ablations

retrieved latent documents: 10
K Retrieved Docs. Using more documents Models are trained with either 5 or 10 retrieved latent documents, and we do not observe significant differences in performance between them. We also have the flexibility to adjust the number of retrieved documents at test time, which does affect performance

retrieved documents: 10
We also have the flexibility to adjust the number of retrieved documents at test time, which does affect performance. Figure 3 (left) shows that retrieving more documents at test time monotonically improves Open-domain QA results for RAG-Sequence, but performance peaks for RAG-Token at 10 retrieved documents. Figure 3 (right) shows that retrieving more documents leads to higher Rouge-L for RAG-Token at the expense of Bleu-1, but the effect is less pronounced for RAG-Sequence

world leaders: 2016
We prepared a list of 82 heads of states who had changed between these dates and used a template “Who is {position}?” (e.g., “Who is the prime minister of the UK?”) to query our Natural Questions -finetuned RAG model with each index. RAG achieved an accuracy of 70% using the 2016 index for 2016 world leaders and an accuracy of 68% using the 2018 index for the 2018 world leaders. Only 21% of the model’s predictions were the same using the two indices, and accuracy using mismatched indices is very low (12% using the 2018 index for 2016 leaders and 4% using the 2016 index for 2018 leaders)

leaders: 2016
RAG achieved an accuracy of 70% using the 2016 index for 2016 world leaders and an accuracy of 68% using the 2018 index for the 2018 world leaders. Only 21% of the model’s predictions were the same using the two indices, and accuracy using mismatched indices is very low (12% using the 2018 index for 2016 leaders and 4% using the 2016 index for 2018 leaders). Our result shows that we can effectively update RAG’s behavior with new world knowledge by simply replacing its non-parametric memory

retrieved documents: 5
Figure 1. RAG-Token document posterior p(zi|x, yi, y−i) for each generated token for input “Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document 1 is high when generating “A Farewell to Arms" and for document 2 when generating “The Sun Also Rises". Left: NQ performance as more documents are retrieved. Center: Fraction of answers in NQ where the answer occurs somewhere in the top K documents. Right: MS-MARCO Bleu-1 and Rouge-L as more documents are retrieved. left) shows that retrieving more documents at test time monotonically improves Open-domain QA results for RAG-Sequence, but performance peaks for RAG-Token at 10 retrieved documents. right) shows that retrieving more documents leads to higher Rouge-L for RAG-Token at the expense of Bleu-1, but the effect is less pronounced for RAG-Sequence. center) shows that the learned retriever shows a higher recall for gold documents compared to the fixed retriever. The improvements on TriviaQA and Natural Questions are notable, as we initialize the retriever from DPR, which is trained with strong, document-level supervision to perform well on these tasks. We also compare RAG’s dense embedding-based retrieval mechanism to a word overlap-based BM25 retriever [47]. Here, we replace RAG’s differentiable retriever with a fixed BM25 system. We use the BM25 retrieval scores as logits when calculating pη(zi|x)

retrieved documents: 10
RAG-Token document posterior p(zi|x, yi, y−i) for each generated token for input “Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document 1 is high when generating “A Farewell to Arms" and for document 2 when generating “The Sun Also Rises". Left: NQ performance as more documents are retrieved. Center: Fraction of answers in NQ where the answer occurs somewhere in the top K documents. Right: MS-MARCO Bleu-1 and Rouge-L as more documents are retrieved. left) shows that retrieving more documents at test time monotonically improves Open-domain QA results for RAG-Sequence, but performance peaks for RAG-Token at 10 retrieved documents. right) shows that retrieving more documents leads to higher Rouge-L for RAG-Token at the expense of Bleu-1, but the effect is less pronounced for RAG-Sequence. center) shows that the learned retriever shows a higher recall for gold documents compared to the fixed retriever. The improvements on TriviaQA and Natural Questions are notable, as we initialize the retriever from DPR, which is trained with strong, document-level supervision to perform well on these tasks. We also compare RAG’s dense embedding-based retrieval mechanism to a word overlap-based BM25 retriever [47]. Here, we replace RAG’s differentiable retriever with a fixed BM25 system. We use the BM25 retrieval scores as logits when calculating pη(zi|x). Annotation interface for human evaluation of factuality. A pop-out for detailed instructions and a worked example appear when clicking "view tool guide"

Reference
  • Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs], November 2016. URL http://arxiv.org/abs/1611.09268.arXiv:1611.09268.
    Findings
  • Petr Baudiš and Jan Šedivy. Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 222–228. Springer, 2015. URL https://link.springer.com/chapter/10.1007%2F978-3-319-24027-5_20.
    Locate open access versionFindings
  • Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 201Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ D13-1160.
    Locate open access versionFindings
  • Bin Bi, Chenliang Li, Chen Wu, Ming Yan, and Wei Wang. Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation. ArXiv, abs/2004.07159, 2020. URL https://arxiv.org/abs/2004.07159.
    Findings
  • Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1171. URL https://www.aclweb.org/anthology/P17-1171.
    Locate open access versionFindings
  • Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. Coarse-to-fine question answering for long documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 209–220, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1020. URL https://www.aclweb.org/anthology/P17-1020.
    Locate open access versionFindings
  • Christopher Clark and Matt Gardner. Simple and Effective Multi-Paragraph Reading Comprehension. arXiv:1710.10723 [cs], October 201URL http://arxiv.org/abs/1710.10723.arXiv:1710.10723.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
    Locate open access versionFindings
  • Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 201URL https://openreview.net/forum?id=r1l73iRqKm.
    Locate open access versionFindings
  • Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs], April 2017. URL http://arxiv.org/abs/1704.05179.arXiv:1704.05179.
    Findings
  • Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://www.aclweb.org/anthology/ P18-1082.
    Locate open access versionFindings
  • Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1346. URL https://www.aclweb.org/anthology/P19-1346.
    Locate open access versionFindings
  • Angela Fan, Claire Gardent, Chloe Braud, and Antoine Bordes. Augmenting transformers with KNN-based composite memory, 2020. URL https://openreview.net/forum?id= H1gx1CNKPH.
    Findings
  • Thibault Févry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. Entities as experts: Sparse memory access with entity supervision. ArXiv, abs/2004.07202, 2020. URL https://arxiv.org/abs/2004.07202.
    Findings
  • Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen tau Yih, and Michel Galley. A knowledge-grounded neural conversation model. In AAAI Conference on Artificial Intelligence, 2018. URL https://www.aaai.org/ocs/index.php/ AAAI/AAAI18/paper/view/16710.
    Locate open access versionFindings
  • Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O.K. Li. Search engine guided neural machine translation. In AAAI Conference on Artificial Intelligence, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17282.
    Locate open access versionFindings
  • Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450, 2018. doi: 10.1162/tacl_a_00030. URL https://www.aclweb.org/anthology/Q18-1031.
    Locate open access versionFindings
  • Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. ArXiv, abs/2002.08909, 2020. URL https://arxiv.org/abs/2002.08909.
    Findings
  • Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017. URL https://arxiv.org/abs/1702.08734.
    Findings
  • Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://www.aclweb.org/anthology/P17-1147.
    Locate open access versionFindings
  • Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 190–198, Cambridge, MA, USA, 2015. MIT Press. URL https://papers.nips.cc/paper/5857-inferring-algorithmic-patterns-with-stack-augmented-recurrent-nets.
    Locate open access versionFindings
  • Vladimir Karpukhin, Barlas Oguz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020. URL https://arxiv.org/abs/2004.04906.
    Findings
  • Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH.
    Locate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
    Findings
  • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics, 2019. URL https://tomkwiat.users.x20web.corp.google.com/papers/natural-questions/main-1455-kwiatkowski.pdf.
    Locate open access versionFindings
  • Guillaume Lample, Alexandre Sablayrolles, Marc’ Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Large memory layers with product keys. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8548–8559. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/9061-large-memory-layers-with-product-keys.pdf.
    Locate open access versionFindings
  • Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1612. URL https://www.aclweb.org/anthology/P19-1612.
    Locate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019. URL https://arxiv.org/abs/1910.13461.
    Findings
  • Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. URL https://www.aclweb.org/anthology/ N16-1014.
    Locate open access versionFindings
  • Margaret Li, Jason Weston, and Stephen Roller. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. ArXiv, abs/1909.03087, 2019. URL https://arxiv.org/abs/1909.03087.
    Findings
  • Hairong Liu, Mingbo Ma, Liang Huang, Hao Xiong, and Zhongjun He. Robust neural machine translation with joint textual and phonetic embedding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3044–3049, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1291. URL https://www.aclweb.org/anthology/P19-1291.
    Locate open access versionFindings
  • Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hyg0vbWC-.
    Locate open access versionFindings
  • Yury A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:824–836, 2016. URL https://arxiv.org/abs/1603.09320.
    Findings
  • Gary Marcus. The next decade in ai: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177, 2020. URL https://arxiv.org/abs/2002.06177.
    Findings
  • Luca Massarelli, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio Silvestri, and Sebastian Riedel. How decoding strategies affect the verifiability of generated text. arXiv preprint arXiv:1911.03587, 2019. URL https://arxiv.org/abs/1911.03587.
    Findings
  • Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M. Khapra. Towards exploiting background knowledge for building conversation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2322–2332, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1255. URL https://www.aclweb.org/anthology/D18-1255.
    Locate open access versionFindings
  • Preksha Nema and Mitesh M. Khapra. Towards a better metric for evaluating question generation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3950–3959, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1429. URL https://www.aclweb.org/anthology/D18-1429.
    Locate open access versionFindings
  • Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. In Tarek Richard Besold, Antoine Bordes, Artur S. d’Avila Garcez, and Greg Wayne, editors, Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org, 2016. URL http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf.
    Locate open access versionFindings
  • Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085, 2019. URL https://arxiv.org/abs/1901.04085.
    Findings
  • Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, and Kyunghyun Cho. Finding generalizable evidence by learning to convince q&a models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2402–2411, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1244. URL https://www.aclweb.org/anthology/D19-1244.
    Locate open access versionFindings
  • Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/ D19-1250. URL https://www.aclweb.org/anthology/D19-1250.
    Locate open access versionFindings
  • Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language models’ factual predictions. In Automated Knowledge Base Construction, 2020. URL https://openreview.net/forum?id=025X0zPfn.
    Locate open access versionFindings
  • Alec Radford. Improving Language Understanding by Generative Pre-Training, 2018. URL https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
    Findings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
    Findings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019. URL https://arxiv.org/abs/1910.10683.
    Findings
  • Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv e-prints, 2020. URL https://arxiv.org/abs/2002.08910.
    Findings
  • Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, April 2009. ISSN 1554-0669. doi: 10.1561/ 1500000019. URL https://doi.org/10.1561/1500000019.
    Locate open access versionFindings
  • Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–24Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf.
    Locate open access versionFindings
  • Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-To-End Memory Networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf.
    Locate open access versionFindings
  • James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://www.aclweb.org/anthology/N18-1074.
    Locate open access versionFindings
  • James H. Thorne and Andreas Vlachos. Avoiding catastrophic forgetting in mitigating model biases in sentence-pair classification with elastic weight consolidation. ArXiv, abs/2004.14366, 2020. URL https://arxiv.org/abs/2004.14366.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
    Locate open access versionFindings
  • Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search for improved description of complex scenes. 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17329.
    Findings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://www.aclweb.org/anthology/W18-5446.
    Locate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. SuperGLUE: A Stickier Benchmark for GeneralPurpose Language Understanding Systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3261–3275. Curran Associates, Inc., 2019. URL https://arxiv.org/abs/1905.00537.
    Findings
  • Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced ranker-reader for open-domain question answering. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5981–5988. AAAI Press, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16712.
    Locate open access versionFindings
  • Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer reranking in open-domain question answering. In ICLR, 2018. URL https://openreview.net/forum?id=rJl3yM-Ab.
    Locate open access versionFindings
  • Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1410.3916.
    Findings
  • Jason Weston, Emily Dinan, and Alexander Miller. Retrieve and refine: Improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pages 87–92, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5713. URL https://www.aclweb.org/anthology/W18-5713.
    Locate open access versionFindings
  • Shiyue Zhang and Mohit Bansal. Addressing semantic drift in question generation for semisupervised question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2495–2509, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1253. URL https://www.aclweb.org/anthology/D19-1253.
    Locate open access versionFindings
  • Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. Reasoning over semantic-level graph for fact checking. ArXiv, abs/1909.03745, 2019. URL https://arxiv.org/abs/1909.03745.
    Findings
Your rating :
0

 

Tags
Comments