Don't Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training

ACL, pp. 4715-4728, 2020.

Cited by: 8|Bibtex|Views127
EI
Other Links: arxiv.org|dblp.uni-trier.de
Weibo:
In this work we show how all of these problems can be addressed by extending the recently introduced unlikelihood loss to these cases

Abstract:

Generative dialogue models currently suffer from a number of problems which standard maximum likelihood training does not address. They tend to produce generations that (i) rely too much on copying from the context, (ii) contain repetitions within utterances, (iii) overuse frequent words, and (iv) at a deeper level, contain logical flaw...More
0
Introduction
  • Open-ended tasks such as dialogue reveal a number of issues with current neural text generation methods.
  • Critical failings are exposed in less constrained generation: reliance on repetitive copying and overuse of frequent words, and an inability to maintain logical coherence.
  • The former shows the learning objective is faulty in that it cannot match simple statistics of the training data, while the latter touches more to the heart of artificial intelligence: Work done while at Facebook AI Research (FAIR).
  • Unlikelihood can be seen as a much more general framework, as the authors will see
Highlights
  • Open-ended tasks such as dialogue reveal a number of issues with current neural text generation methods
  • Our work provides new applications of unlikelihood training (Welleck et al, 2019a), showing that unlikelihood offers a general framework for improving generative models, and in particular dialogue models
  • We see that training unlikelihood using only-contexts or only-labels reduces their corresponding metrics dramatically compared to the maximum likelihood estimation baseline
  • Figure 5 shows how the vocabulary distribution obtained after unlikelihood training is affected by the choice of mixing hyperparameter α (Eq 1): it can smoothly transition between the human training distribution and the maximum likelihood estimation trained distribution (‘Baseline’), which is far from the human one
  • The vocabulary unlikelihood fine-tuning shifts probability mass from the over-represented frequent words towards underrepresented medium and rare words, with the effect strengthening as α increases
  • We studied several aspects that contribute to that goal, defined metrics to measure them, and proposed algorithms that improve them, mitigating some of the failings of maximum likelihood training, the current dominant approach
Methods
  • In all of the experiments the authors employ a large pre-trained seq2seq Transformer (Vaswani et al, 2017) as the base model, which the authors fine-tune for particular tasks with the objectives outlined in Section 2 and specified in each experiment below.
  • The model was trained with a batch size of 3072 sequences for approximately 3M updates using a learning rate of 5e-4, and an inverse square root scheduler.
  • This pre-training took approximately two weeks using 64 NVIDIA V100s.
Results
  • Results for ConvAI2 are shown in Table 1.
  • The authors see that training unlikelihood using only-contexts or only-labels reduces their corresponding metrics dramatically compared to the MLE baseline.
  • Results are given in Figure 4, showing a statistically significant improvement over the baseline according to humans.
  • As seen in Table 6 for the two utterance task, the perplexity of contradicting utterances (12.5) is on average lower than for neutral (36.7) or triple-entailing utterances (17.5), it is higher than entailing utterances.
  • The authors compute significance with a two-tailed binomial test (p < .01).
Conclusion
  • Generating consistent and coherent human-like dialogue is a core goal of natural language research.
  • The authors' method defines objective functions under the umbrella of unlikelihood: during training, the authors wish to make inconsistent dialogue unlikely by lowering the probability of such events occurring.
  • This makes generative models repeat themselves less, copy the context less, and use more rare words from the vocabulary – closer to matching human statistics.
  • Future work could apply this same technique with other supervised data, e.g. correcting causal or commonsense reasoning errors (Zellers et al, 2019; Qin et al, 2019)
Summary
  • Introduction:

    Open-ended tasks such as dialogue reveal a number of issues with current neural text generation methods.
  • Critical failings are exposed in less constrained generation: reliance on repetitive copying and overuse of frequent words, and an inability to maintain logical coherence.
  • The former shows the learning objective is faulty in that it cannot match simple statistics of the training data, while the latter touches more to the heart of artificial intelligence: Work done while at Facebook AI Research (FAIR).
  • Unlikelihood can be seen as a much more general framework, as the authors will see
  • Methods:

    In all of the experiments the authors employ a large pre-trained seq2seq Transformer (Vaswani et al, 2017) as the base model, which the authors fine-tune for particular tasks with the objectives outlined in Section 2 and specified in each experiment below.
  • The model was trained with a batch size of 3072 sequences for approximately 3M updates using a learning rate of 5e-4, and an inverse square root scheduler.
  • This pre-training took approximately two weeks using 64 NVIDIA V100s.
  • Results:

    Results for ConvAI2 are shown in Table 1.
  • The authors see that training unlikelihood using only-contexts or only-labels reduces their corresponding metrics dramatically compared to the MLE baseline.
  • Results are given in Figure 4, showing a statistically significant improvement over the baseline according to humans.
  • As seen in Table 6 for the two utterance task, the perplexity of contradicting utterances (12.5) is on average lower than for neutral (36.7) or triple-entailing utterances (17.5), it is higher than entailing utterances.
  • The authors compute significance with a two-tailed binomial test (p < .01).
  • Conclusion:

    Generating consistent and coherent human-like dialogue is a core goal of natural language research.
  • The authors' method defines objective functions under the umbrella of unlikelihood: during training, the authors wish to make inconsistent dialogue unlikely by lowering the probability of such events occurring.
  • This makes generative models repeat themselves less, copy the context less, and use more rare words from the vocabulary – closer to matching human statistics.
  • Future work could apply this same technique with other supervised data, e.g. correcting causal or commonsense reasoning errors (Zellers et al, 2019; Qin et al, 2019)
Tables
  • Table1: Evaluation on the ConvAI2 task valid set (test set is hidden), comparing standard likelihood (MLE) with context and label repetition unlikelihood loss training. The repetition types can be decreased depending on which type of unlikelihood loss is used, with minimal changes in perplexity and F1
  • Table2: Evaluation on the Wizard of Wikipedia test set, comparing standard likelihood (MLE) with context and label repetition unlikelihood loss training. The repetition types can be decreased depending on the type of unlikelihood loss used, while minimally impacting F1
  • Table3: Evaluation on the ELI5 task test set, comparing standard likelihood (MLE) with context and label repetition unlikelihood loss training. The repetition types can be decreased depending on which type of unlikelihood loss is used, while improving F1
  • Table4: Unlikelihood loss applied to vocabulary distributions. Stronger α terms greatly shift probability mass from the most Frequent words to Medium and Rare words, at a small cost to PPL and F1. Frequent, medium, rare and rarest token classes are defined as the sets of tokens whose cumulative masses account for the top 40%, the next 30%, the next 20% and final 10% of tokens empirically generated by humans, respectively
  • Table5: Dialogue NLI two utterance generation task dataset statistics
  • Table6: Test evaluation on the Dialogue NLI two utterance generation task, comparing standard likelihood (MLE) models trained on pushshift.io Reddit and ConvAI2 with unlikelihood loss NLI training. Results are broken down according to whether the premise and positive candidate are entailing, triple-entailing, or neutral (Entail, Tr.-E, Neutral). Selection Accuracy measures how often the model assigns lower perplexity to the positive candidate than to the negative candidate in the pair. Top two rows: for standard maximum likelihood models, the perplexity of contradicting utterances is lower compared to neutral or triple-entailing utterances (albeit higher compared to entailing utterances), showing partial failure at the coherence task. Bottom row: NLI Unlikelihood training yields large improvements on all coherence metrics, while minimally increasing overall perplexity
  • Table7: Test evaluation on the Full Dialogue NLI generation task. NLI unlikelihood training improves coherence metrics compared to likelihood (MLE) training. For UL, the triple-entailing or neutral candidates are assigned relatively lower perplexity compared to contradicting candidates, with higher selection accuracy for coherent labels
  • Table8: Evaluation on the Wizard of Wikipedia task test set, comparing standard likelihood (MLE) with repetition unlikelihood loss training, where both methods use beam search (beam size of 5)
  • Table9: Unlikelihood loss applied to vocabulary distributions. Stronger α terms greatly shift probability mass from the most Frequent words to Medium and Rare words, at a small cost to PPL and F1. Frequent, medium, rare and rarest token classes are defined as the sets of tokens whose cumulative masses account for the top 40%, the next 30%, the next 20% and final 10% of tokens empirically generated by humans, respectively. Nucleus sampling can also produce a distribution close to human with parameter p close to 1, but with larger losses in F1
Download tables as Excel
Related work
  • Our work provides new applications of unlikelihood training (Welleck et al, 2019a), showing that unlikelihood offers a general framework for improving generative models, and in particular dialogue models. Outside of that work, the use of negative training in dialogue retrieval, rather than generation, has been previously extensively studied, see e.g. (Humeau et al, 2019; Nugmanova et al, 2019). In the area of generative dialogue, a number of works have focused on improving the standard likelihood training approach. Closer to our work is that of He and Glass (2019) which developed the approach of negative training to prevent generic and malicious responses in dialogue models. In terms of improving repetition and specificity, a recent alternative approach is that of control (Fan et al, 2018; Ficler and Goldberg, 2017; Ghazvininejad et al, 2017; See et al, 2019). Nucleus sampling (Holtzman et al, 2019) can help to remove generic or repetitive utterances at the expense of accuracy, but was shown to be inferior to beam blocking, which in turn was shown to be inferior to unlikelihood in Welleck et al (2019a).
Funding
  • Shows how all of these problems can be addressed by extending the recently introduced unlikelihood loss to these cases
  • Shows that appropriate loss functions which regularize generated outputs to match human distributions are effective for the first three issues
  • Shows applying unlikelihood to collected data of what a model should not do is effective for improving logical consistency, potentially paving the way to generative models with greater reasoning ability
  • Demonstrates the efficacy of our approach across several dialogue tasks
  • Shows how the recently introduced unlikelihood objective can be generalized to remedy these problems
Reference
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2020. The second conversational intelligence challenge (ConvAI2). In The NeurIPS ’18 Competition, pages 187– 208, Cham. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Angela Fan, David Grangier, and Michael Auli. 2018. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 45–5Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Saadia Gabriel, Antoine Bosselut, Ari Holtzman, Kyle Lo, Asli Celikyilmaz, and Yejin Choi. 2019. Cooperative generator-discriminator networks for abstractive summarization with narrative flow. arXiv preprint arXiv:1907.01272.
    Findings
  • Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. 2017. Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations, pages 43–48, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and Graham Neubig. 201Latent relation language models. arXiv preprint arXiv:1908.07690.
    Findings
  • Tianxing He and James Glass. 2019. Negative training for neural dialogue response generation. arXiv preprint arXiv:1903.02134.
    Findings
  • Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
    Findings
  • Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.
    Findings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Margaret Li, Jason Weston, and Stephen Roller. 2019. ACUTE-EVAL: Improved dialogue evaluation with optimized questions and multi-turn comparisons. In Proceedings of the NeurIPS Workshop on Conversational AI.
    Google ScholarLocate open access versionFindings
  • Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A dialog research software platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 79–84, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Aigul Nugmanova, Andrei Smirnov, Galina Lavrentyeva, and Irina Chernykh. 2019. Strategy of the negative sampling for training retrieval-based dialogue systems. In 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pages 844– 848. IEEE.
    Google ScholarLocate open access versionFindings
  • Fabio Petroni, Tim Rocktaschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
    Findings
  • Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, and Yejin Choi. 2019. Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5042– 5052, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 20Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
    Google ScholarLocate open access versionFindings
  • Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1702–1723, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019a. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319.
    Findings
  • Sean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. 2019b. Dialogue natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3731–3741, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning semantic textual similarity from conversations. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 164– 174, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
    Findings
  • Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204– 2213, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments