Learning an Unreferenced Metric for Online Dialogue Evaluation

ACL, pp. 2430-2441, 2020.

Cited by: 0|Bibtex|Views111
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose MAUDE, which is an unreferenced dialogue evaluation metric that leverages sentence representations from large pretrained language models, and is trained via Noise Contrastive Estimation

Abstract:

Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue. There have been recent efforts to develop automatic dialogue evaluation metrics, but most of them do not generalize to unseen datasets and/or need a human-generated reference response during infere...More
0
Introduction
  • Recent approaches in deep neural language generation have opened new possibilities in dialogue generation (Serban et al, 2017; Weston et al, 2018).
  • The response of a generative model is typically evaluated by comparing with the ground-truth response using various automatic word-overlap metrics, such as BLEU or METEOR.
  • These metrics, along with ADEM and RUBER, are essentially single-step evaluation metrics, where a score is calculated for each contextresponse pair.
  • A key benefit of this approach is that this metric can be used to evaluate online and for better training and optimization, as it provides partial credit during response generation
Highlights
  • Recent approaches in deep neural language generation have opened new possibilities in dialogue generation (Serban et al, 2017; Weston et al, 2018)
  • Note that these baselines can be viewed as ablations of the MAUDE framework using simplified text encoders, since we use the same noise contrastive estimation training loss to provide a fair comparison
  • We evaluate response r in three settings: Semantic Positive: responses that are semantically equivalent to the ground truth response; Semantic Negative: responses that are semantically opposite to the ground truth response; and Syntactic
  • We explore the feasibility of learning an automatic dialogue evaluation metric by leveraging pre-trained language models and the temporal structure of dialogue
  • We propose MAUDE, which is an unreferenced dialogue evaluation metric that leverages sentence representations from large pretrained language models, and is trained via Noise Contrastive Estimation
  • MAUDE learns a recurrent neural network to model the transition between the utterances in a dialogue, allowing it to correlate better with human annotations
Methods
  • To empirically evaluate the proposed unreferenced dialogue evaluation metric, the authors are interested in answering the following key research questions:

    Q1: How robust is the proposed metric on different types of responses?

    Q2: How well does the self-supervised metric correlate with human judgements?

    Datasets.
  • The authors compare against BERT-NLI, which is the same as the InferSent model but with the LSTM encoder replaced with a pre-trained BERT encoder.
  • Note that these baselines can be viewed as ablations of the MAUDE framework using simplified text encoders, since the authors use the same NCE training loss to provide a fair comparison.
  • Note that in practice, the authors use DistilBERT (Sanh et al, 2019) instead of BERT in both MAUDE and the BERT-NLI baseline.1
Conclusion
  • The authors explore the feasibility of learning an automatic dialogue evaluation metric by leveraging pre-trained language models and the temporal structure of dialogue.
  • MAUDE learns a recurrent neural network to model the transition between the utterances in a dialogue, allowing it to correlate better with human annotations.
  • This is a good indication that MAUDE can be used to evaluate online dialogue conversations.
  • Since it provides immediate continuous rewards and at the singlestep level, MAUDE can be be used to optimize and train better dialogue generation models, which the authors want to pursue as future work
Summary
  • Introduction:

    Recent approaches in deep neural language generation have opened new possibilities in dialogue generation (Serban et al, 2017; Weston et al, 2018).
  • The response of a generative model is typically evaluated by comparing with the ground-truth response using various automatic word-overlap metrics, such as BLEU or METEOR.
  • These metrics, along with ADEM and RUBER, are essentially single-step evaluation metrics, where a score is calculated for each contextresponse pair.
  • A key benefit of this approach is that this metric can be used to evaluate online and for better training and optimization, as it provides partial credit during response generation
  • Methods:

    To empirically evaluate the proposed unreferenced dialogue evaluation metric, the authors are interested in answering the following key research questions:

    Q1: How robust is the proposed metric on different types of responses?

    Q2: How well does the self-supervised metric correlate with human judgements?

    Datasets.
  • The authors compare against BERT-NLI, which is the same as the InferSent model but with the LSTM encoder replaced with a pre-trained BERT encoder.
  • Note that these baselines can be viewed as ablations of the MAUDE framework using simplified text encoders, since the authors use the same NCE training loss to provide a fair comparison.
  • Note that in practice, the authors use DistilBERT (Sanh et al, 2019) instead of BERT in both MAUDE and the BERT-NLI baseline.1
  • Conclusion:

    The authors explore the feasibility of learning an automatic dialogue evaluation metric by leveraging pre-trained language models and the temporal structure of dialogue.
  • MAUDE learns a recurrent neural network to model the transition between the utterances in a dialogue, allowing it to correlate better with human annotations.
  • This is a good indication that MAUDE can be used to evaluate online dialogue conversations.
  • Since it provides immediate continuous rewards and at the singlestep level, MAUDE can be be used to optimize and train better dialogue generation models, which the authors want to pursue as future work
Tables
  • Table1: Metric score evaluation (∆ = score(c, rground-truth) − score(c, r)) between RUBER (R), InferSent (IS), DistilBERTNLI (DNI) and MAUDE (M) on PersonaChat dataset’s public validation set. For Semantic Positive tests, lower ∆ is better; for all Negative tests higher ∆ is better
  • Table2: Correlation with calibrated scores between RUBER (R), InferSent (IS), DistilBERT-NLI (DNI) and MAUDE (M) when trained on PersonaChat dataset man annotators rate the entire dialogue and not a context-response pair. On the other hand, our setup is essentially a single-step evaluation method. To align our scores with the multi-turn evaluation, we average the individual turns to get an aggregate score for a given dialogue
  • Table3: Zero-shot generalization results on DailyDialog, Frames and MultiWOZ dataset for the baselines and MAUDE. + denotes semantic positive responses, and − denotes semantic negative responses
  • Table4: Metric score evaluation between InferSent, DistilBERT-NLI and MAUDE on PersonaChat dataset, trained on P (r) = Semantics. Bold scores represent the best individual scores, and bold with blue represents the best difference with the true response
  • Table5: Metric score evaluation between InferSent, DistilBERT-NLI and MAUDE on PersonaChat dataset, trained on P (r) = Syntax. Bold scores represent the best individual scores, and bold with blue represents the best difference with the true response
  • Table6: Metric score evaluation between InferSent, DistilBERT-NLI and MAUDE on PersonaChat dataset, trained on P (r) = Syntax + Semantics. Bold scores represent the best individual scores, and bold with blue represents the best difference with the true response
Download tables as Excel
Funding
  • This research, with respect to Quebec Artificial Intelligence Institute (Mila) and McGill University, was supported by the Canada CIFAR Chairs in AI program
Reference
  • Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv.
    Google ScholarFindings
  • Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.
    Google ScholarLocate open access versionFindings
  • Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Ultes Stefan, Ramadan Osman, and Milica Gasic. 2018. Multiwoz - a largescale multi-domain wizard-of-oz dataset for taskoriented dialogue modelling. In Proceedings of EMNLP. MultiWoz CORPUS licensed under CCBY 4.0.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.
    Google ScholarFindings
  • Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv.
    Google ScholarFindings
  • Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • W.A. Falcon. 2019.
    Google ScholarFindings
  • https://github.com/williamFalcon/
    Findings
  • Michael Gutmann and Aapo Hyvarinen. 20Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv.
    Google ScholarFindings
  • Ilya Kulikov, Alexander H Miller, Kyunghyun Cho, and Jason Weston. 2018. Importance of a search strategy in neural dialogue modelling. arXiv.
    Google ScholarFindings
  • Margaret Li, Jason Weston, and Stephen Roller. 2019. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv.
    Google ScholarFindings
  • Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. Dailydialog: A manually labelled multi-turn dialogue dataset. In Proceedings of IJCNLP.
    Google ScholarLocate open access versionFindings
  • Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. arXiv.
    Google ScholarFindings
  • Ryan Lowe. 2019. A Retrospective for “Towards an Automatic Turing Test - Learning to Evaluate Dialogue Responses”. ML Retrospectives.
    Google ScholarFindings
  • Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 20Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses. arXiv.
    Google ScholarFindings
  • Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 20Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT).
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.
    Google ScholarFindings
  • Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? How controllable attributes affect human judgments. arXiv.
    Google ScholarFindings
  • Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of AAAI.
    Google ScholarLocate open access versionFindings
  • Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2017. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems. arXiv.
    Google ScholarFindings
  • Jason Weston, Emily Dinan, and Alexander H Miller. 2018. Retrieve and refine: Improved sequence generation models for dialogue. arXiv.
    Google ScholarFindings
  • John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond BLEU:Training Neural Machine Translation with Semantic Similarity. In Proceedings of ACL, Florence, Italy.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv.
    Google ScholarFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv.
    Google ScholarFindings
  • Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv.
    Google ScholarFindings
  • To define a good encoding function, we turn to pre-trained language models. These models are typically trained on large corpus and achieve stateof-the-art results on a range of language understanding tasks (Ott et al., 2018). To validate our hypothesis, we use a pre-trained (and fine-tuned) BERT (Devlin et al., 2018) as fe. We compute hui = fe(ui)∀ui ∈ D, and learn a linear classifier to predict an approximate position of the ui ∈ Di. The task has details in its design, in the case of goal-oriented dialogues the vocabulary differs in different parts of the conversation and in chitchat dialogues it cannot be said. To experiment, we choose PersonaChat (Zhang et al., 2018) and DailyDialog (Li et al., 2017) to be nominal of chit-chat style data, and Frames (Asri et al., 2017) and MultiWOZ (Budzianowski et al., 2018) for goal-oriented data.
    Google ScholarLocate open access versionFindings
  • In order for a dialogue evaluation metric to be useful, one has to evaluate how it generalizes to unseen data. We performed the evaluation using our trained models on PersonaChat dataset, and then evaluated them zero-shot on two goal-oriented datasets, Frames (Asri et al., 2017) and MultiWoz (Budzianowski et al., 2018), and one chit-chat style dataset: Daily Dialog (Li et al., 2017) (Table 3). We find BERT-based models are significantly better at generalization than InferSent or RUBER, with MAUDE marginally better than DistilBERT-NLI baseline. MAUDE has the biggest impact on generalization to DailyDialog dataset, which suggests that it captures the commonalities of chit-chat style dialogue from PersonaChat. Surprisingly, generalization gets significantly better of BERT-based models on goal-oriented datasets as well. This suggests that irrespective of the nature of dialogue, pre-training helps because it contains the information common to English language lexical items.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments