Neural Text Generation With Unlikelihood Training

Sean Welleck
Sean Welleck
Ilia Kulikov
Ilia Kulikov

ICLR, 2020.

Cited by: 13|Bibtex|Views199
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We empirically showed that unlikelihood training - both at the token and sequence levels - substantially reduced degeneration according to automatic metrics, and outperformed likelihood-trained models with various decoding methods according to human evaluation, being superior to ...

Abstract:

Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs. While some post-hoc fixes have been proposed, in particular top-k and nucleus sampling, they do not address the fac...More
Introduction
  • Neural text generation is a vital tool in a wide range of natural language applications.
  • The models repeat themselves at the token, phrase, and sentence levels, and statistics comparing a set of human-generated utterances and model-generated responses indicate a discrepancy between the human and model word distributions.
  • This does not appear to be rectified by training on more data (Radford et al, 2019).
Highlights
  • Neural text generation is a vital tool in a wide range of natural language applications
  • While the above may be factors, a primary factor is the use of the likelihood objective itself, as we demonstrate that degeneration is alleviated if we replace the likelihood objective with our proposal
  • An approach to training neural language models
  • Our results show that the likelihood objective is not constrained enough, in the sense that two models with the same perplexity can exhibit wildly different generation performance
  • We empirically showed that unlikelihood training - both at the token and sequence levels - substantially reduced degeneration according to automatic metrics, and outperformed likelihood-trained models with various decoding methods according to human evaluation, being superior to the current state-of-the-art approaches
Methods
  • Model Architecture Recent large-scale language models are based on the Transformer architecture, a multi-layer feed-forward network with self-attention (Vaswani et al, 2017).
  • We use a 16-layer Transformer with 8 attention heads, embedding dimension 1024, and fully-connected dimension 4096; the architecture is based on Baevski and Auli (2019) but with standard embedding and softmax layers.
  • Our proposed method is architecture agnostic; we choose this one as a representative of recent large-scale language models, e.g.
Results
  • Baseline The baseline model trained with maximum likelihood (LMLE) achieved 25.64 test perplexity, comparable to a current state-of-the-art system (Baevski and Auli, 2019) (24.92).
  • The greedy baseline’s seq-level repeats and single-token repeats far exceed those in human text (.006, .487 respectively).
  • The baseline continuations have far fewer unique tokens than human text, with a high rate of frequent tokens (Figure 1).
  • Token-Level Objective The proposed token-level unlikelihood objective (LUL-token) reduced nexttoken wrong repetition while increasing the number of unique next-tokens compared to the baseline (LMLE).
Conclusion
  • An approach to training neural language models. We observed that state-of-the-art models trained to maximize likelihood exhibit neural text degeneration, which we characterized and quantified in terms of repetition and token distribution mismatch.
  • Our results show that the likelihood objective is not constrained enough, in the sense that two models with the same perplexity can exhibit wildly different generation performance.
  • We empirically showed that unlikelihood training - both at the token and sequence levels - substantially reduced degeneration according to automatic metrics, and outperformed likelihood-trained models with various decoding methods according to human evaluation, being superior to the current state-of-the-art approaches
Summary
  • Introduction:

    Neural text generation is a vital tool in a wide range of natural language applications.
  • The models repeat themselves at the token, phrase, and sentence levels, and statistics comparing a set of human-generated utterances and model-generated responses indicate a discrepancy between the human and model word distributions.
  • This does not appear to be rectified by training on more data (Radford et al, 2019).
  • Methods:

    Model Architecture Recent large-scale language models are based on the Transformer architecture, a multi-layer feed-forward network with self-attention (Vaswani et al, 2017).
  • We use a 16-layer Transformer with 8 attention heads, embedding dimension 1024, and fully-connected dimension 4096; the architecture is based on Baevski and Auli (2019) but with standard embedding and softmax layers.
  • Our proposed method is architecture agnostic; we choose this one as a representative of recent large-scale language models, e.g.
  • Results:

    Baseline The baseline model trained with maximum likelihood (LMLE) achieved 25.64 test perplexity, comparable to a current state-of-the-art system (Baevski and Auli, 2019) (24.92).
  • The greedy baseline’s seq-level repeats and single-token repeats far exceed those in human text (.006, .487 respectively).
  • The baseline continuations have far fewer unique tokens than human text, with a high rate of frequent tokens (Figure 1).
  • Token-Level Objective The proposed token-level unlikelihood objective (LUL-token) reduced nexttoken wrong repetition while increasing the number of unique next-tokens compared to the baseline (LMLE).
  • Conclusion:

    An approach to training neural language models. We observed that state-of-the-art models trained to maximize likelihood exhibit neural text degeneration, which we characterized and quantified in terms of repetition and token distribution mismatch.
  • Our results show that the likelihood objective is not constrained enough, in the sense that two models with the same perplexity can exhibit wildly different generation performance.
  • We empirically showed that unlikelihood training - both at the token and sequence levels - substantially reduced degeneration according to automatic metrics, and outperformed likelihood-trained models with various decoding methods according to human evaluation, being superior to the current state-of-the-art approaches
Tables
  • Table1: Example greedy completions showing representative examples of the MLE model’s degenerate single-token repetition (top), phrase-level repetition (middle), and ‘structural’ repetition (bottom), as well as the proposed method’s ability to fix these degenerate behaviors
  • Table2: Results for token-level objectives (upper) and sequence-level fine-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the test subset of Wikitext-103
  • Table3: Human eval results. * denotes statistical significance (2-sided binomial test, p < .05)
  • Table4: Top: Degenerate repetition in completions from a state-of-the-art large-scale language model (<a class="ref-link" id="cRadford_et+al_2019_a" href="#rRadford_et+al_2019_a">Radford et al, 2019</a>). The examples contain single-word repetitions, phrase-level repetitions, and structural repetitions where some tokens within a repeating phrase vary. Recently proposed stochastic samplers (top-k, nucleus) exhibit degeneration based on hyper-parameter settings
  • Table5: Results for token-level objectives (upper) and sequence-level fine-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103
  • Table6: Stochastic decoding results according to sequence-level (left) and token-level (right) metrics using the test subset of Wikitext-103
  • Table7: GPT-2 results according to sequence-level and token-level metrics using the validation subset of wikitext-103. seq-rep-4 is computed on the word level; ppl, acc, rep, wrep are computed on the BPE level
  • Table8: Results for sequence-level fine-tuning using random-seq candidates according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103
  • Table9: Full human evaluation results. Includes additional comparisons omitted for brevity, and the raw number of wins and loses by each comparison
Download tables as Excel
Related work
  • Neural Text Degeneration Recently, several papers have observed various forms of neural text degeneration, especially in open-ended generation tasks. In dialogue, it has been shown that there is a mismatch between model and human word distributions, where generative models are more likely to output frequent words, but less likely to produce rare words compared to humans. For example, this was observed across all generative models submitted to the ConvAI2 NeurIPS 2018 competition (Dinan et al, 2019). In language modeling, the work of Holtzman et al (2019) highlighted problems with the word frequency distribution and level of repetition in model generations compared to human text. These issues are not remedied by simply increasing the amount of the training data; e.g. largescale GPT-2 language models (Radford et al, 2019) display the same issues.

    Improved Decoding Algorithms Several methods have been proposed to rectify these issues. The primary ones involve changing the decoding method to a sophisticated beam search variant or to stochastic decoding, e.g. sampling. Different variants of beam search have been explored (Li et al, 2016; Vijayakumar et al, 2018; Kulikov et al, 2018; Holtzman et al, 2018) which can decrease a model’s level of repetition by selecting candidates that are unlike previously chosen ones. Separately, hard or soft beam blocking has been investigated (Paulus et al, 2017; Klein et al, 2017), whereby previously generated n-grams are blocked from subsequent generation. This approach is often used in dialogue generation, fixing some token or phrase level repetitions but removing repetitions that would naturally occur in human text.
Funding
  • With greedy search, token-level unlikelihood training improved the 4-gram repetition in continuations by 36% (seq-rep-4 .283 vs. .442) while generating roughly 22% more unique tokens than the baseline (uniq-seq 13.2k vs. 10.8k), and a more favorable rate of infrequent tokens (Figure 1)
  • Sequence-Level Objective The sequence level fine-tuning (LUL-token+seq) yielded further improvements, with a 97% reduction in 4-gram repetitions (seq-rep-4 .013 vs. .442) from the baseline level (greedy LMLE), and 77% more unique tokens (uniq-seq 19.1k vs. 10.8k) with beam search
Study subjects and analysis
data: 4
Full human evaluation results. Includes additional comparisons omitted for brevity, and the raw number of wins and loses by each comparison. Sequence-level token distribution using the test subset of Wikitext-103. Nucleus sampling (p = 0.9) and beam blocking (n = 4) are used with the maximum likelihood baseline (LMLE). Screen shot of the user interface used in the human evaluation

Reference
  • Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Yejin Choi. 2018. The missing representation in neural (language) models. 3rd Workshop on Representation Learning for NLP (RepL4NLP).
    Google ScholarFindings
  • Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hal Daume, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine learning, 75(3):297–325.
    Google ScholarLocate open access versionFindings
  • Adji B Dieng, Kyunghyun Cho, David M Blei, and Yann LeCun. 2018. Learning with reflective likelihoods.
    Google ScholarFindings
  • Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. 2019. The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098.
    Findings
  • Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 201Classical structured prediction losses for sequence to sequence learning. arXiv preprint arXiv:1711.04956.
    Findings
  • Angela Fan, Mike Lewis, and Yann Dauphin. 201Hierarchical neural story generation. arXiv preprint arXiv:1805.04833.
    Findings
  • Tianxing He and James Glass. 201Negative training for neural dialogue response generation. arXiv preprint arXiv:1903.02134.
    Findings
  • Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. 2018. Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638–1649. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
    Findings
  • Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
    Findings
  • Ilya Kulikov, Alexander H Miller, Kyunghyun Cho, and Jason Weston. 2018. Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907.
    Findings
  • Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. 2006. A tutorial on energybased learning. Predicting structured data.
    Google ScholarFindings
  • Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.
    Findings
  • Margaret Li, Jason Weston, and Stephen Roller. 2019. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087.
    Findings
  • Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
    Findings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
    Google ScholarLocate open access versionFindings
  • Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732.
    Findings
  • Stephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635.
    Google ScholarLocate open access versionFindings
  • Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1702–1723, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Jesse Vig. 2018. Deconstructing bert: Distilling 6 patterns from 100 million parameters. Medium. Ashwin K Vijayakumar, Michael Cogswell, Ramprasaath R Selvaraju, Qing Sun, Stefan Lee, David
    Google ScholarFindings
  • Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Thirty-Second AAAI Conference on Artificial Intelligence. Jason Weston, Emily Dinan, and Alexander H Miller. 2018. Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776. Lantao Yu, Weinan Zhang, Jun Wang, and Yingrui Yu. 2016. Seqgan: Sequence generative adversarial nets with policy gradient. ArXiv, abs/1609.05473. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
    Findings
Full Text
Your rating :
0

 

Tags
Comments