Our model could be extended to other general-purpose datasets, such as Reddit, once similar pre-trained models become publicly available. Such models are necessary even for creating a test set in a new domain, which will help us determine if automatic dialogue evaluation model ge...
Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses.
meeting of the association for computational linguistics, (2018): 1116-1126
Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem.Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality (Liu et al., 2016). Yet having an accurate automatic evaluation procedure is crucial for dialogue r...更多
下载 PDF 全文
- Building systems that can naturally and meaningfully converse with humans has been a central goal of artificial intelligence since the formulation of the Turing test (Turing, 1950).
- There has been a surge of interest towards building large-scale non-task-oriented dialogue systems using neural networks (Sordoni et al, 2015b; Shang et al, 2015; Vinyals and Le, 2015; Serban et al, 2016a; Li et al, 2015)
- These models are trained in an end-to-end manner to optimize a single objective, usually the likelihood of generating the responses from a fixed corpus.
- Such models have already had a substantial impact in industry, including Google’s Smart Reply system (Kannan et al, 2016), and Microsoft’s Xiaoice chatbot (Markoff and Mozur, 2015), which has over 20 million users
- Building systems that can naturally and meaningfully converse with humans has been a central goal of artificial intelligence since the formulation of the Turing test (Turing, 1950)
- There has been a surge of interest towards building large-scale non-task-oriented dialogue systems using neural networks (Sordoni et al, 2015b; Shang et al, 2015; Vinyals and Le, 2015; Serban et al, 2016a; Li et al, 2015). These models are trained in an end-to-end manner to optimize a single objective, usually the likelihood of generating the responses from a fixed corpus. Such models have already had a substantial impact in industry, including Google’s Smart Reply system (Kannan et al, 2016), and Microsoft’s Xiaoice chatbot (Markoff and Mozur, 2015), which has over 20 million users
- We show that automatic dialogue evaluation model scores correlate significantly with human judgement at both the utterance-level and system-level
- We show that automatic dialogue evaluation model can often generalize to evaluating new models, whose responses were unseen during training, making automatic dialogue evaluation model a strong first step towards effective automatic dialogue response evaluation.1
- In order to reduce the effective vocabulary size, we use byte pair encoding (BPE) (Gage, 1994; Sennrich et al, 2015), which splits each word into sub-words or characters
- Instead of using the variable hierarchical recurrent encoderdecoder pre-training method presented in Section 4, we use off-the-shelf embeddings for c, r, andr, and finetune M and N on our dataset. These tweet2vec embeddings are computed at the character-level with a bidirectional GRU on a Twitter dataset for hashtag prediction (Dhingra et al, 2016). We find that they obtain reasonable but inferior performance compared to using variable hierarchical recurrent encoderdecoder embeddings
- 5.1 Experimental Procedure
In order to reduce the effective vocabulary size, the authors use byte pair encoding (BPE) (Gage, 1994; Sennrich et al, 2015), which splits each word into sub-words or characters.
- The authors over-sample from bins across the same score to ensure that ADEM does not use response length to predict the score.
- This is because humans have a tendency to give a higher rating to shorter responses than to longer responses (Serban et al, 2016b), as shorter responses are often more generic and are more likely to be suitable to the context.
- The test set Pearson correlation between response length and human score is 0.27
- Utterance-level correlations The authors first present new utterance-level correlation results3 for existing (a) BLEU-2.
- Metric BLEU-2 BLEU-4 ROUGE METEOR T2V VHRED.
- C-ADEM R-ADEM ADEM (T2V) ADEM Full dataset Spearman Pearson
- The authors use the Twitter Corpus to train the models as it contains a broad range of non-task-oriented conversations and it has been used to train many state-ofthe-art models.
- The authors' model could be extended to other general-purpose datasets, such as Reddit, once similar pre-trained models become publicly available.
- An important direction for future work is modifying ADEM such that it is not subject to this bias
- This could be done, for example, by censoring ADEM’s representations (Edwards and Storkey, 2016) such that they do not contain any information about length.
- One can combine this with an adversarial evaluation model (Kannan and Vinyals, 2017; Li et al, 2017) that assigns a score based on how easy it is to distinguish the dialogue model responses
- Table1: Statistics of the dialogue response evaluation dataset. Each example is in the form (context, model response, reference response, human score)
- Table2: Correlation between metrics and human judgements, with p-values shown in brackets. ‘ADEM (T2V)’ indicates ADEM with tweet2vec embeddings (<a class="ref-link" id="cDhingra_et+al_2016_a" href="#rDhingra_et+al_2016_a">Dhingra et al, 2016</a>), and ‘VHRED’ indicates the dot product of VHRED embeddings (i.e. ADEM at initialization). C- and R-ADEM represent the ADEM model trained to only compare the model response to the context or reference response, respectively. We compute the baseline metric scores (top) on the full dataset to provide a more accurate estimate of their scores (as they are not trained on a training set)
- Table3: System-level correlation, with the p-value in brackets
- Table4: Correlation for ADEM when various model responses are removed from the training set. The left two columns show performance on the entire test set, and the right two columns show performance on responses only from the dialogue model not seen during training. The last row (25% at random) corresponds to the ADEM model trained on all model responses, but with the same amount of training data as the model above (i.e. 25% less data than the full training set)
- Table5: Examples of scores given by the ADEM model
- Related to our approach is the literature on novel methods for the evaluation of machine translation systems, especially through the WMT evaluation task (Callison-Burch et al, 2011; Machacek and Bojar, 2014; Stanojevic et al, 2015). In particular, (Albrecht and Hwa, 2007; Gupta et al, 2015) have proposed to evaluate machine translation systems using Regression and Tree-LSTMs respectively. Their approach differs from ours as, in the dialogue domain, we must additionally condition our score on the context of the conversation, which is not necessary in translation.
There has also been related work on estimating the quality of responses in chat-oriented dialogue systems. (DeVault et al, 2011) train an automatic dialogue policy evaluation metric from 19 structured role-playing sessions, enriched with paraphrases and external referee annotations. (Gandhe and Traum, 2016) propose a semi-automatic evaluation metric for dialogue coherence, similar to BLEU and ROUGE, based on ‘wizard of Oz’ type data.6 (Xiang et al, 2014) propose a framework to predict utterance-level problematic situations in a dataset of Chinese dialogues using intent and sentiment factors. Finally, (Higashinaka et al, 2014) train a classifier to distinguish user utterances from system-generated utterances using various dialogue features, such as dialogue acts, question types, and predicate-argument structures.
Several recent approaches use hand-crafted reward features to train dialogue models using reinforcement learning (RL). For example, (Li et al, 2016b) use features related to ease of answering and information flow, and (Yu et al, 2016) use metrics related to turn-level appropriateness and conversational depth. These metrics are based on hand-crafted features, which only capture a small set of relevant aspects; this inevitably leads to suboptimal performance, and it is unclear whether such objectives are preferable over retrieval-based crossentropy or word-level maximum log-likelihood objectives. Furthermore, many of these metrics are computed at the conversation-level, and are not available for evaluating single dialogue responses.
- The model achieves comparable correlations to the ADEM model that was trained on 25% less data selected at random
These models are trained in an end-to-end manner to optimize a single objective, usually the likelihood of generating the responses from a fixed corpus. Such models have already had a substantial impact in industry, including Google’s Smart Reply system (Kannan et al, 2016), and Microsoft’s Xiaoice chatbot (Markoff and Mozur, 2015), which has over 20 million users. One of the challenges when developing such systems is to have a good way of measuring progress, in this case the performance of the chatbot
- Joshua Albrecht and Rebecca Hwa. 2007. Regression for sentence-level mt evaluation with pseudo references. In ACL.
- Ron Artstein, Sudeep Gandhe, Jillian Gerten, Anton Leuski, and David Traum. 2009. Semi-formal evaluation of conversational characters. In Languages: From Formal to Natural, Springer, pages 22–35.
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
- Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 199Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2):157–166.
- Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. COLING.
- Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, pages 22–64.
- Tim Cooijmans, Nicolas Ballas, Cesar Laurent, and Aaron Courville. 2016. Recurrent batch normalization. arXiv preprint arXiv:1603.09025.
- David DeVault, Anton Leuski, and Kenji Sagae. 2011. Toward learning and evaluation of dialogue policies with text examples. In Proceedings of the SIGDIAL 2011 Conference. Association for Computational Linguistics, pages 39–48.
- Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W Cohen. 2016. Tweet2vec: Character-based distributed representations for social media. arXiv preprint arXiv:1605.03481.
- Harrison Edwards and Amos Storkey. 2016. Censoring representations with an adversary. ICLR.
- Salah El Hihi and Yoshua Bengio. 1995. Hierarchical recurrent neural networks for long-term dependencies. In NIPS. Citeseer, volume 400, page 409.
- Philip Gage. 1994. A new algorithm for data compression. The C Users Journal 12(2):23–38.
- Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltableu: A discriminative metric for generation tasks with intrinsically diverse targets. arXiv preprint arXiv:1506.06863.
- Sudeep Gandhe and David Traum. 2016. A semiautomated evaluation metric for dialogue model coherence. In Situated Dialog in Speech-Based Human-Computer Interaction, Springer, pages 217– 225.
- Rohit Gupta, Constantin Orasan, and Josef van Genabith. 20Reval: A simple and effective machine translation evaluation metric based on recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Ryuichiro Higashinaka, Toyomi Meguro, Kenji Imamura, Hiroaki Sugiyama, Toshiro Makino, and Yoshihiro Matsuo. 2014. Evaluating coherence in open domain conversational systems. In INTERSPEECH. pages 130–134.
- Sepp Hochreiter. 1991. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universitat Munchen page 91.
- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735– 1780.
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
- Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, et al. 2016. Smart reply: Automated response suggestion for email. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). volume 36, pages 495– 503.
- Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198.
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
- Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155.
- Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learning to decode for future success. arXiv preprint arXiv:1701.06549.
- Jiwei Li, Will Monroe, Alan Ritter, and Dan Jurafsky. 2016b. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.
- Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
- Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909.
- Matous Machacek and Ondrej Bojar. 2014. Results of the wmt14 metrics shared task. In Proceedings of the Ninth Workshop on Statistical Machine Translation. Citeseer, pages 293–301.
- J. Markoff and P. Mozur. 2015. For sympathetic ear, more chinese turn to smartphone program. NY Times.
- Sebastian Moller, Roman Englert, Klaus-Peter Engelbrecht, Verena Vanessa Hafner, Anthony Jameson, Antti Oulasvirta, Alexander Raake, and Norbert Reithinger. 2006. Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations. In INTERSPEECH.
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
- Karl Pearson. 1901. Principal components analysis. The London, Edinburgh and Dublin Philosophical Magazine and Journal 6(2):566.
- Alan Ritter, Colin Cherry, and William B Dolan. 2011. Data-driven response generation in social media. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pages 583–593.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016a. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI. pages 3776–3784.
- online conversational system. Proc. CLP pages 43– 51. Zhou Yu, Ziyu Xu, Alan W Black, and Alex I Rudnicky. 2016. Strategy and policy learning for nontask-oriented conversational systems. In 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. page 404.
- Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016b. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069.
- Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364.
- Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka, and Yusuke Miyao. 2016. Overview of the ntcir-12 short text conversation task. Proceedings of NTCIR-12 pages 473–484.
- Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and JianYun Nie. 2015a. A hierarchical recurrent encoderdecoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, pages 553–562.
- Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015b. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714.
- Milos Stanojevic, Amir Kamran, Philipp Koehn, and Ondrej Bojar. 2015. Results of the wmt15 metrics shared task. In Proceedings of the Tenth Workshop on Statistical Machine Translation. pages 256–273.
- Alan M Turing. 1950. Computing machinery and intelligence. Mind 59(236):433–460.
- Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
- Marilyn A Walker, Diane J Litman, Candace A Kamm, and Alicia Abella. 1997. Paradise: A framework for evaluating spoken dialogue agents. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 271–280.
- J. Weizenbaum. 1966. ELIZAa computer program for the study of natural language communication between man and machine. Communications of the ACM 9(1):36–45.
- Yang Xiang, Yaoyun Zhang, Xiaoqiang Zhou, Xiaolong Wang, and Yang Qin. 2014. Problematic situation analysis and automatic recognition for chi-nese