Generalization through Memorization: Nearest Neighbor Language Models

Urvashi Khandelwal
Urvashi Khandelwal
Luke Zettlemoyer
Luke Zettlemoyer

ICLR, 2020.

Cited by: 4|Bibtex|Views61
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
KNN-language model The keys used for k-nearest neighbors-language model are the 1024-dimensional representations fed to the feedforward network in the final layer of the Transformer language model

Abstract:

We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding space, and can be drawn from any text collection, including the original LM training data. Applying ...More
Introduction
  • Neural language models (LMs) typically solve two subproblems: (1) mapping sentence prefixes to fixed-sized representations, and (2) using these representations to predict the word in the text (Bengio et al, 2003; Mikolov et al, 2010).
  • The nearest neighbors are computed according to distance in the pre-trained embedding space and can be drawn from any text collection, including the original LM training data
  • This approach allows rare patterns to be memorized explicitly, rather than implicitly in model parameters.
  • It improves performance when the same training data is used for learning the prefix representations and the kNN model, strongly suggesting that the prediction problem is more challenging than previously appreciated
Highlights
  • Neural language models (LMs) typically solve two subproblems: (1) mapping sentence prefixes to fixed-sized representations, and (2) using these representations to predict the next word in the text (Bengio et al, 2003; Mikolov et al, 2010)
  • We present a new language modeling approach that is based on the hypothesis that the representation learning problem may be easier than the prediction problem
  • Evaluation language model are trained to minimize the negative log-likelihood of the training corpus, and evaluated by perplexity on held out data
  • KNN-language model The keys used for k-nearest neighbors-language model are the 1024-dimensional representations fed to the feedforward network in the final layer of the Transformer language model
  • We have introduced k-nearest neighbors-language model, which can significantly outperform standard language models by directly querying training examples at test time
  • The approach can be applied to any neural language model
Methods
  • 4.1 USING THE TRAINING DATA AS THE DATASTORE

    We first experiment with creating a datastore from the same data used to train the LM.
  • We provide reported perplexities from two other recent models that build upon Baevski and Auli’s, suggesting that further improvements may be possible by augmenting the kNN-LM with these techniques.
  • We experiment with a continuous cache model, a related but orthogonal technique from Grave et al (2017c), in which the model saves and retrieves neighbors from earlier in the test document, Training Data Datastore.
  • We compare with models trained only on the standard training set, but recent work has shown performance can be improved by training on additional data, from either the test set (Krause et al, 2019) or large amounts of web text (Shoeybi et al, 2019).
Results
  • LMs are trained to minimize the negative log-likelihood of the training corpus, and evaluated by perplexity on held out data.
  • We perform a single forward pass over the training set with the trained model, in order to save the keys and values.
  • During this forward pass, each target token is provided a minimum of 1536 tokens of prior context for WIKITEXT-103 and a minimum of 512 Model.
  • We tune the interpolation parameter λ on the validation set.1
Conclusion
  • We have introduced kNN-LMs, which can significantly outperform standard language models by directly querying training examples at test time.
  • The approach can be applied to any neural language model.
  • The success of this method suggests that learning similarity functions between contexts may be an easier problem than predicting the word from some given context.
  • Future work should explore explicitly training similarity functions, and reducing the size of the datastore
Summary
  • Introduction:

    Neural language models (LMs) typically solve two subproblems: (1) mapping sentence prefixes to fixed-sized representations, and (2) using these representations to predict the word in the text (Bengio et al, 2003; Mikolov et al, 2010).
  • The nearest neighbors are computed according to distance in the pre-trained embedding space and can be drawn from any text collection, including the original LM training data
  • This approach allows rare patterns to be memorized explicitly, rather than implicitly in model parameters.
  • It improves performance when the same training data is used for learning the prefix representations and the kNN model, strongly suggesting that the prediction problem is more challenging than previously appreciated
  • Methods:

    4.1 USING THE TRAINING DATA AS THE DATASTORE

    We first experiment with creating a datastore from the same data used to train the LM.
  • We provide reported perplexities from two other recent models that build upon Baevski and Auli’s, suggesting that further improvements may be possible by augmenting the kNN-LM with these techniques.
  • We experiment with a continuous cache model, a related but orthogonal technique from Grave et al (2017c), in which the model saves and retrieves neighbors from earlier in the test document, Training Data Datastore.
  • We compare with models trained only on the standard training set, but recent work has shown performance can be improved by training on additional data, from either the test set (Krause et al, 2019) or large amounts of web text (Shoeybi et al, 2019).
  • Results:

    LMs are trained to minimize the negative log-likelihood of the training corpus, and evaluated by perplexity on held out data.
  • We perform a single forward pass over the training set with the trained model, in order to save the keys and values.
  • During this forward pass, each target token is provided a minimum of 1536 tokens of prior context for WIKITEXT-103 and a minimum of 512 Model.
  • We tune the interpolation parameter λ on the validation set.1
  • Conclusion:

    We have introduced kNN-LMs, which can significantly outperform standard language models by directly querying training examples at test time.
  • The approach can be applied to any neural language model.
  • The success of this method suggests that learning similarity functions between contexts may be an easier problem than predicting the word from some given context.
  • Future work should explore explicitly training similarity functions, and reducing the size of the datastore
Tables
  • Table1: Performance on WIKITEXT-103. The kNN-LM substantially outperforms existing work. Gains are additive with the related but orthogonal continuous cache, allowing us to improve the base model by almost 3 perplexity points with no additional training. We report the median of three random seeds
  • Table2: Performance on BOOKS, showing that kNN-LM works well in multiple domains
  • Table3: Experimental results on WIKI-3B. The model trained on 100M tokens is augmented with a datastore that contains about 3B training examples, outperforming the vanilla LM trained on the entire WIKI-3B training set
  • Table4: Domain adaptation experiments, with results on BOOKS. Adding an in-domain datastore to a Wikipedia-trained model improves results by 23 points, approaching in-domain training
  • Table5: WIKITEXT-103 validation results using different states from the final layer of the LM as the representation function f (·) for keys and queries. We retrieve k=1024 neighbors and λ is tuned for each
  • Table6: Another example where the kNN model places much higher probability mass on the correct target, compared to the LM. The nearest neighbors search has retrieved a training set context that is extremely similar to the test context, while very rare and in the long-tail of patterns
  • Table7: In this example, the desired date pattern appears in many examples. Yet, the nearest neighbors search is able to identify the only training set context which is relevant to the test context and assigns it the highest probability mass
  • Table8: In this case, the model is able to memorize the fact that Georges Bizet wrote Carmen
  • Table9: This is an example where the pkNN distribution is relatively flat, as several words are plausible continuations. However, the nearest neighbors search assigns the highest probability to the correct target and a corresponding context that is particularly relevant. In contrast, the LM probability on the correct target is lower
Download tables as Excel
Related work
  • We discuss related uses of caches for language modeling in Section 2. Similar kNN models to ours have been proposed for computer vision tasks (Papernot & McDaniel, 2018; Orhan, 2018; Zhao & Cho, 2018), primarily motivated by improving interpretability and robustness to adversarial attacks. We hypothesize that our method may be particularly effective for language modeling, because plentiful unlabeled data allows datastores of billions of tokens, and language modeling often requires world knowledge to be learnt from few examples. Nearest neighbor models have been applied to a number of NLP problems in the past, such as part of speech tagging (Daelemans et al, 1996) and morphological analysis (Bosch et al, 2007), but the use of learned representations makes the similarity function much more effective in the case of neural models. More recently, Kaiser et al (2017) have used a similarly differentiable memory that is learned and updated during training, and is applied to one-shot learning tasks. Several models have also improved language generation by using training examples directly at test time. Guu et al (2018) propose a model that samples training sentences at random and edits them with a sequence-to-sequence model, but does not use a retrieval mechanism such as kNN. Gu et al (2018) introduce a translation model that attends over retrieved training set examples. Weston et al (2018) improve a dialogue response generation model by refining similar instances from the training set. kNN-LM differs from these approaches by working at the level of individual tokens instead of whole training sentences, as well as not incorporating the retrieval mechanism into the training pipeline. A general trend in machine learning, and in language modeling in particular, is that adding more data consistently improves performance (Devlin et al, 2019; Radford et al, 2019; Yang et al, 2019; Liu et al, 2019; Zellers et al, 2019; Shoeybi et al, 2019). Our work offers an alternative method for scaling language models, in which relatively small models learn context representations, and a nearest neighbour search acts as a highly expressive classifier.
Funding
  • Introduces kNN-LMs, which extend a pre-trained neural language model by linearly interpolating it with a k-nearest neighbors model
  • Shows that this approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by varying the nearest neighbor datastore, again without further training
  • Presents a new language modeling approach that is based on the hypothesis that the representation learning problem may be easier than the prediction problem
  • Provides strong evidence that existing language models, are much better at the first problem, by using their prefix embeddings in a simple nearest neighbor scheme that significantly improves overall performance
  • Introduces kNN-LM, an approach that extends a pre-trained LM by linearly interpolating its word distribution with a k-nearest neighbors model
Reference
  • Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Anton Bakhtin, Arthur Szlam, Marc’Aurelio Ranzato, and Edouard Grave. Lightweight adaptive mixture of neural and n-gram language models. arXiv preprint arXiv:1804.07705, 2018.
    Findings
  • Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
    Google ScholarLocate open access versionFindings
  • Antal van den Bosch, Bertjan Busser, Sander Canisius, and Walter Daelemans. An efficient memorybased morphosyntactic tagger and parser for dutch. LOT Occasional Series, 7:191–206, 2007.
    Google ScholarLocate open access versionFindings
  • Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. Mbt: A memory-based part of speech tagger-generator. In WVLC, 1996.
    Google ScholarLocate open access versionFindings
  • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Edouard Grave, Moustapha M Cisse, and Armand Joulin. Unbounded cache model for online language modeling with open vocabulary. In NIPS, pp. 6042–6052, 2017a.
    Google ScholarLocate open access versionFindings
  • Edouard Grave, Armand Joulin, Moustapha Cisse, Herve Jegou, et al. Efficient softmax approximation for gpus. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1302–1310. JMLR. org, 2017b.
    Google ScholarLocate open access versionFindings
  • Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In ICLR, 2017c.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor OK Li. Search engine guided neural machine translation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450, 2018.
    Google ScholarLocate open access versionFindings
  • Jeff Johnson, Matthijs Douze, and Herve Jegou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
    Findings
  • Łukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of transformer language models. arXiv preprint arXiv:1904.08378, 2019.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Hongyin Luo, Lan Jiang, Yonatan Belinkov, and James Glass. Improving neural language models by segmenting, attending, and predicting the future. In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010.
    Google ScholarLocate open access versionFindings
  • A. Emin Orhan. A simple cache model for image recognition. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot and Patrick McDaniel. Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765, 2018.
    Findings
  • Ofir Press and Lior Wolf. Using the output embedding to improve language models. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. URL https://d4mucfpksywv.cloudfront.net/betterlanguage-models/language-models.pdf, 2019.
    Findings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
    Findings
  • Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Jason Weston, Emily Dinan, and Alexander H Miller. Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776, 2018.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
    Findings
  • Rowan Zellers, Ari Holtzman, Hannah Rashkin, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Jake Zhao and Kyunghyun Cho. Retrieval-augmented convolutional neural networks for improved robustness against adversarial examples. arXiv preprint arXiv:1802.09502, 2018.
    Findings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27, 2015.
    Google ScholarLocate open access versionFindings
  • On the anniversary date of his death, every year since 1997, thousands of people gather at his home in Memphis to...
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments