AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Our model could be extended to other general-purpose datasets, such as Reddit, once similar pre-trained models become publicly available. Such models are necessary even for creating a test set in a new domain, which will help us determine if automatic dialogue evaluation model ge...

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses.

meeting of the association for computational linguistics, (2018): 1116-1126

被引用263|浏览319
EI
下载 PDF 全文
引用
微博一下

摘要

Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem.Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality (Liu et al., 2016). Yet having an accurate automatic evaluation procedure is crucial for dialogue r...更多

代码

数据

0
简介
  • Building systems that can naturally and meaningfully converse with humans has been a central goal of artificial intelligence since the formulation of the Turing test (Turing, 1950).
  • There has been a surge of interest towards building large-scale non-task-oriented dialogue systems using neural networks (Sordoni et al, 2015b; Shang et al, 2015; Vinyals and Le, 2015; Serban et al, 2016a; Li et al, 2015)
  • These models are trained in an end-to-end manner to optimize a single objective, usually the likelihood of generating the responses from a fixed corpus.
  • Such models have already had a substantial impact in industry, including Google’s Smart Reply system (Kannan et al, 2016), and Microsoft’s Xiaoice chatbot (Markoff and Mozur, 2015), which has over 20 million users
重点内容
  • Building systems that can naturally and meaningfully converse with humans has been a central goal of artificial intelligence since the formulation of the Turing test (Turing, 1950)
  • There has been a surge of interest towards building large-scale non-task-oriented dialogue systems using neural networks (Sordoni et al, 2015b; Shang et al, 2015; Vinyals and Le, 2015; Serban et al, 2016a; Li et al, 2015). These models are trained in an end-to-end manner to optimize a single objective, usually the likelihood of generating the responses from a fixed corpus. Such models have already had a substantial impact in industry, including Google’s Smart Reply system (Kannan et al, 2016), and Microsoft’s Xiaoice chatbot (Markoff and Mozur, 2015), which has over 20 million users
  • We show that automatic dialogue evaluation model scores correlate significantly with human judgement at both the utterance-level and system-level
  • We show that automatic dialogue evaluation model can often generalize to evaluating new models, whose responses were unseen during training, making automatic dialogue evaluation model a strong first step towards effective automatic dialogue response evaluation.1
  • In order to reduce the effective vocabulary size, we use byte pair encoding (BPE) (Gage, 1994; Sennrich et al, 2015), which splits each word into sub-words or characters
  • Instead of using the variable hierarchical recurrent encoderdecoder pre-training method presented in Section 4, we use off-the-shelf embeddings for c, r, andr, and finetune M and N on our dataset. These tweet2vec embeddings are computed at the character-level with a bidirectional GRU on a Twitter dataset for hashtag prediction (Dhingra et al, 2016). We find that they obtain reasonable but inferior performance compared to using variable hierarchical recurrent encoderdecoder embeddings
方法
  • 5.1 Experimental Procedure

    In order to reduce the effective vocabulary size, the authors use byte pair encoding (BPE) (Gage, 1994; Sennrich et al, 2015), which splits each word into sub-words or characters.
  • The authors over-sample from bins across the same score to ensure that ADEM does not use response length to predict the score.
  • This is because humans have a tendency to give a higher rating to shorter responses than to longer responses (Serban et al, 2016b), as shorter responses are often more generic and are more likely to be suitable to the context.
  • The test set Pearson correlation between response length and human score is 0.27
结果
  • Utterance-level correlations The authors first present new utterance-level correlation results3 for existing (a) BLEU-2.
  • Metric BLEU-2 BLEU-4 ROUGE METEOR T2V VHRED.
  • C-ADEM R-ADEM ADEM (T2V) ADEM Full dataset Spearman Pearson
结论
  • The authors use the Twitter Corpus to train the models as it contains a broad range of non-task-oriented conversations and it has been used to train many state-ofthe-art models.
  • The authors' model could be extended to other general-purpose datasets, such as Reddit, once similar pre-trained models become publicly available.
  • An important direction for future work is modifying ADEM such that it is not subject to this bias
  • This could be done, for example, by censoring ADEM’s representations (Edwards and Storkey, 2016) such that they do not contain any information about length.
  • One can combine this with an adversarial evaluation model (Kannan and Vinyals, 2017; Li et al, 2017) that assigns a score based on how easy it is to distinguish the dialogue model responses
表格
  • Table1: Statistics of the dialogue response evaluation dataset. Each example is in the form (context, model response, reference response, human score)
  • Table2: Correlation between metrics and human judgements, with p-values shown in brackets. ‘ADEM (T2V)’ indicates ADEM with tweet2vec embeddings (<a class="ref-link" id="cDhingra_et+al_2016_a" href="#rDhingra_et+al_2016_a">Dhingra et al, 2016</a>), and ‘VHRED’ indicates the dot product of VHRED embeddings (i.e. ADEM at initialization). C- and R-ADEM represent the ADEM model trained to only compare the model response to the context or reference response, respectively. We compute the baseline metric scores (top) on the full dataset to provide a more accurate estimate of their scores (as they are not trained on a training set)
  • Table3: System-level correlation, with the p-value in brackets
  • Table4: Correlation for ADEM when various model responses are removed from the training set. The left two columns show performance on the entire test set, and the right two columns show performance on responses only from the dialogue model not seen during training. The last row (25% at random) corresponds to the ADEM model trained on all model responses, but with the same amount of training data as the model above (i.e. 25% less data than the full training set)
  • Table5: Examples of scores given by the ADEM model
Download tables as Excel
相关工作
  • Related to our approach is the literature on novel methods for the evaluation of machine translation systems, especially through the WMT evaluation task (Callison-Burch et al, 2011; Machacek and Bojar, 2014; Stanojevic et al, 2015). In particular, (Albrecht and Hwa, 2007; Gupta et al, 2015) have proposed to evaluate machine translation systems using Regression and Tree-LSTMs respectively. Their approach differs from ours as, in the dialogue domain, we must additionally condition our score on the context of the conversation, which is not necessary in translation.

    There has also been related work on estimating the quality of responses in chat-oriented dialogue systems. (DeVault et al, 2011) train an automatic dialogue policy evaluation metric from 19 structured role-playing sessions, enriched with paraphrases and external referee annotations. (Gandhe and Traum, 2016) propose a semi-automatic evaluation metric for dialogue coherence, similar to BLEU and ROUGE, based on ‘wizard of Oz’ type data.6 (Xiang et al, 2014) propose a framework to predict utterance-level problematic situations in a dataset of Chinese dialogues using intent and sentiment factors. Finally, (Higashinaka et al, 2014) train a classifier to distinguish user utterances from system-generated utterances using various dialogue features, such as dialogue acts, question types, and predicate-argument structures.

    Several recent approaches use hand-crafted reward features to train dialogue models using reinforcement learning (RL). For example, (Li et al, 2016b) use features related to ease of answering and information flow, and (Yu et al, 2016) use metrics related to turn-level appropriateness and conversational depth. These metrics are based on hand-crafted features, which only capture a small set of relevant aspects; this inevitably leads to suboptimal performance, and it is unclear whether such objectives are preferable over retrieval-based crossentropy or word-level maximum log-likelihood objectives. Furthermore, many of these metrics are computed at the conversation-level, and are not available for evaluating single dialogue responses.
基金
  • The model achieves comparable correlations to the ADEM model that was trained on 25% less data selected at random
研究对象与分析
users: 20000000
These models are trained in an end-to-end manner to optimize a single objective, usually the likelihood of generating the responses from a fixed corpus. Such models have already had a substantial impact in industry, including Google’s Smart Reply system (Kannan et al, 2016), and Microsoft’s Xiaoice chatbot (Markoff and Mozur, 2015), which has over 20 million users. One of the challenges when developing such systems is to have a good way of measuring progress, in this case the performance of the chatbot

引用论文
  • Joshua Albrecht and Rebecca Hwa. 2007. Regression for sentence-level mt evaluation with pseudo references. In ACL.
    Google ScholarFindings
  • Ron Artstein, Sudeep Gandhe, Jillian Gerten, Anton Leuski, and David Traum. 2009. Semi-formal evaluation of conversational characters. In Languages: From Formal to Natural, Springer, pages 22–35.
    Google ScholarLocate open access versionFindings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
    Findings
  • Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 199Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2):157–166.
    Google ScholarLocate open access versionFindings
  • Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. COLING.
    Google ScholarLocate open access versionFindings
  • Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, pages 22–64.
    Google ScholarLocate open access versionFindings
  • Tim Cooijmans, Nicolas Ballas, Cesar Laurent, and Aaron Courville. 2016. Recurrent batch normalization. arXiv preprint arXiv:1603.09025.
    Findings
  • David DeVault, Anton Leuski, and Kenji Sagae. 2011. Toward learning and evaluation of dialogue policies with text examples. In Proceedings of the SIGDIAL 2011 Conference. Association for Computational Linguistics, pages 39–48.
    Google ScholarLocate open access versionFindings
  • Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W Cohen. 2016. Tweet2vec: Character-based distributed representations for social media. arXiv preprint arXiv:1605.03481.
    Findings
  • Harrison Edwards and Amos Storkey. 2016. Censoring representations with an adversary. ICLR.
    Google ScholarLocate open access versionFindings
  • Salah El Hihi and Yoshua Bengio. 1995. Hierarchical recurrent neural networks for long-term dependencies. In NIPS. Citeseer, volume 400, page 409.
    Google ScholarLocate open access versionFindings
  • Philip Gage. 1994. A new algorithm for data compression. The C Users Journal 12(2):23–38.
    Google ScholarLocate open access versionFindings
  • Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltableu: A discriminative metric for generation tasks with intrinsically diverse targets. arXiv preprint arXiv:1506.06863.
    Findings
  • Sudeep Gandhe and David Traum. 2016. A semiautomated evaluation metric for dialogue model coherence. In Situated Dialog in Speech-Based Human-Computer Interaction, Springer, pages 217– 225.
    Google ScholarLocate open access versionFindings
  • Rohit Gupta, Constantin Orasan, and Josef van Genabith. 20Reval: A simple and effective machine translation evaluation metric based on recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • Ryuichiro Higashinaka, Toyomi Meguro, Kenji Imamura, Hiroaki Sugiyama, Toshiro Makino, and Yoshihiro Matsuo. 2014. Evaluating coherence in open domain conversational systems. In INTERSPEECH. pages 130–134.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter. 1991. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universitat Munchen page 91.
    Google ScholarFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735– 1780.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
    Findings
  • Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, et al. 2016. Smart reply: Automated response suggestion for email. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). volume 36, pages 495– 503.
    Google ScholarLocate open access versionFindings
  • Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198.
    Findings
  • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
    Findings
  • Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155.
    Findings
  • Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learning to decode for future success. arXiv preprint arXiv:1701.06549.
    Findings
  • Jiwei Li, Will Monroe, Alan Ritter, and Dan Jurafsky. 2016b. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.
    Findings
  • Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
    Findings
  • Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909.
    Findings
  • Matous Machacek and Ondrej Bojar. 2014. Results of the wmt14 metrics shared task. In Proceedings of the Ninth Workshop on Statistical Machine Translation. Citeseer, pages 293–301.
    Google ScholarLocate open access versionFindings
  • J. Markoff and P. Mozur. 2015. For sympathetic ear, more chinese turn to smartphone program. NY Times.
    Google ScholarFindings
  • Sebastian Moller, Roman Englert, Klaus-Peter Engelbrecht, Verena Vanessa Hafner, Anthony Jameson, Antti Oulasvirta, Alexander Raake, and Norbert Reithinger. 2006. Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations. In INTERSPEECH.
    Google ScholarFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Karl Pearson. 1901. Principal components analysis. The London, Edinburgh and Dublin Philosophical Magazine and Journal 6(2):566.
    Google ScholarLocate open access versionFindings
  • Alan Ritter, Colin Cherry, and William B Dolan. 2011. Data-driven response generation in social media. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pages 583–593.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
    Findings
  • Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016a. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI. pages 3776–3784.
    Google ScholarLocate open access versionFindings
  • online conversational system. Proc. CLP pages 43– 51. Zhou Yu, Ziyu Xu, Alan W Black, and Alex I Rudnicky. 2016. Strategy and policy learning for nontask-oriented conversational systems. In 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. page 404.
    Google ScholarLocate open access versionFindings
  • Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016b. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069.
    Findings
  • Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364.
    Findings
  • Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka, and Yusuke Miyao. 2016. Overview of the ntcir-12 short text conversation task. Proceedings of NTCIR-12 pages 473–484.
    Google ScholarLocate open access versionFindings
  • Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and JianYun Nie. 2015a. A hierarchical recurrent encoderdecoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, pages 553–562.
    Google ScholarLocate open access versionFindings
  • Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015b. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714.
    Findings
  • Milos Stanojevic, Amir Kamran, Philipp Koehn, and Ondrej Bojar. 2015. Results of the wmt15 metrics shared task. In Proceedings of the Tenth Workshop on Statistical Machine Translation. pages 256–273.
    Google ScholarLocate open access versionFindings
  • Alan M Turing. 1950. Computing machinery and intelligence. Mind 59(236):433–460.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
    Findings
  • Marilyn A Walker, Diane J Litman, Candace A Kamm, and Alicia Abella. 1997. Paradise: A framework for evaluating spoken dialogue agents. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 271–280.
    Google ScholarLocate open access versionFindings
  • J. Weizenbaum. 1966. ELIZAa computer program for the study of natural language communication between man and machine. Communications of the ACM 9(1):36–45.
    Google ScholarLocate open access versionFindings
  • Yang Xiang, Yaoyun Zhang, Xiaoqiang Zhou, Xiaolong Wang, and Yang Qin. 2014. Problematic situation analysis and automatic recognition for chi-nese
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科