AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
When evaluation metrics are not explicitly correlated to human judgement, it is possible to draw misleading conclusions by examining how the metrics rate different models

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation.

EMNLP, (2016): 2122-2132

被引用907|浏览517
EI
下载 PDF 全文
引用
微博一下

摘要

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a modelu0027s generated response to a single target response. We show that these metrics correlate very...更多

代码

数据

0
简介
  • An important aspect of dialogue response generation systems, which are trained to produce a reasonable utterance given a conversational context, is how to evaluate the quality of the generated response.
  • This paper focuses on unsupervised dialogue response generation models, such as chatbots
  • These models are receiving increased attention, using end-to-end training with neural networks (Serban et al, 2016; Sordoni et al, 2015; Vinyals and Le, 2015).
  • This avoids the need to collect supervised labels on a large scale, which can be prohibitively expensive.
  • Automatic evaluation metrics would help accelerate the deployment of unsupervised response generation systems
重点内容
  • An important aspect of dialogue response generation systems, which are trained to produce a reasonable utterance given a conversational context, is how to evaluate the quality of the generated response
  • This paper focuses on unsupervised dialogue response generation models, such as chatbots
  • Automatic evaluation metrics would help accelerate the deployment of unsupervised response generation systems
  • When evaluation metrics are not explicitly correlated to human judgement, it is possible to draw misleading conclusions by examining how the metrics rate different models
  • We compare the performance of selected models according to the embedding metrics on two different domains: the Ubuntu Dialogue Corpus (Lowe et al, 2015), which contains technical vocabulary and where conversations are often oriented towards solving a particular problem, and a non-technical Twitter corpus collected following the procedure of Ritter et al (2010)
  • We have shown that many metrics commonly used in the literature for evaluating unsupervised dialogue systems do not correlate strongly with human judgement
结果
  • The authors present correlation results between the human judgements and each metric in Table 3.
  • The authors found that the BLEU-3 and BLEU-4 scores were near-zero for a majority of response pairs; for BLEU-4, only four examples had a score > 10−9.
  • They still correlate with human judgements on the Twitter Corpus at a rate similar to BLEU-2.
  • BLEU-3 and BLEU-4 behave as a scaled, noisy version of BLEU-2; if one is to evaluate dialogue
结论
  • The authors have shown that many metrics commonly used in the literature for evaluating unsupervised dialogue systems do not correlate strongly with human judgement.
  • When evaluation metrics are not explicitly correlated to human judgement, it is possible to draw misleading conclusions by examining how the metrics rate different models
  • To illustrate this point, the authors compare the performance of selected models according to the embedding metrics on two different domains: the Ubuntu Dialogue Corpus (Lowe et al, 2015), which contains technical vocabulary and where conversations are often oriented towards solving a particular problem, and a non-technical Twitter corpus collected following the procedure of Ritter et al (2010).
  • The authors consider these two datasets since they cover contrasting dialogue domains, i.e. technical help vs casual chit-chat, and because they are amongst the largest publicly available corpora, making them good candidates for building data-driven dialogue systems
表格
  • Table1: Example showing the intrinsic diversity of valid responses in a dialogue. The (reasonable) model response would receive a BLEU score of 0
  • Table2: Models evaluated using the vector-based evaluation metrics, with 95% confidence intervals
  • Table3: Correlation between each metric and human judgements for each response. Correlations shown in the human row result from randomly dividing human judges into two groups
  • Table4: Correlation between BLEU metric and human judgements after removing stopwords and punctuation for the Twitter dataset
  • Table5: Effect of differences in response length for the Twitter dataset, ∆w = absolute difference in #words between a ground truth response and proposed response output of these encoders is passed through another ‘context-level’ encoder, which enables the handling of longer-term dependencies
Download tables as Excel
相关工作
  • We focus on metrics that are model-independent, i.e. where the model generating the response does not also evaluate its quality; thus, we do not consider word perplexity, although it has been used to evaluate unsupervised dialogue models (Serban et al, 2015). This is because it is not computed on a per-response basis, and cannot be computed for retrieval models. Further, we only consider metrics that can be used to evaluate proposed responses against ground-truth responses, so we do not consider retrieval-based metrics such as recall, which has been used to evaluate dialogue models (Schatzmann et al, 2005; Lowe et al, 2015). We also do not consider evaluation methods for supervised evaluation methods.1

    Several recent works on unsupervised dialogue systems adopt the BLEU score for evaluation. Ritter et al (2011) formulate the unsupervised learning problem as one of translating a context into a candidate response. They use a statistical machine translation (SMT) model to generate responses to various contexts using Twitter data, and show that it outperforms information retrieval baselines according to both BLEU and human evaluations. Sordoni et al (2015) extend this idea using a recurrent language model to generate responses in a context-sensitive manner. They also evaluate using BLEU, however they produce multiple ground truth responses by retrieving 15 responses from elsewhere in the corpus, using a simple bag-of-words model. Li et al (2015) evaluate their proposed diversity-promoting objective function for neural network models using BLEU score with only a single ground truth response. A modified version of BLEU, deltaBLEU (Galley et al, 2015b), which takes into account several humanevaluated ground truth responses, is shown to have a weak to moderate correlation to human judgements using Twitter dialogues. However, such human annotation is often infeasible to obtain in practice. Galley et al (2015b) also show that, even with several ground truth responses available, the standard BLEU metric does not correlate strongly with human judgements.
基金
  • Investigates evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available
  • Shows that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain
  • Provides quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems
  • Focuses on unsupervised dialogue response generation models, such as chatbots
  • Investigates the correlation between the scores from several automatic evaluation metrics and human judgements of dialogue response quality, for a variety of response generation models
引用论文
  • R. Artstein, S. Gandhe, J. Gerten, A. Leuski, and D. Traum. 2009. Semi-formal evaluation of conversational characters. In Languages: From Formal to Natural, pages 22–35. Springer.
    Google ScholarLocate open access versionFindings
  • S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.
    Google ScholarLocate open access versionFindings
  • O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. SaintAmand, et al. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58. Association for Computational Linguistics Baltimore, MD, USA.
    Google ScholarLocate open access versionFindings
  • A. Cahill. 2009. Correlating human and automatic evaluation of a german surface realiser. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 97–100. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Re-evaluation the role of bleu in machine translation research. In EACL, volume 6, pages 249–256.
    Google ScholarLocate open access versionFindings
  • C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O. F. Zaidan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17–53. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • C. Callison-Burch, P. Koehn, C. Monz, and O. F. Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22–64. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • B. Chen and C. Cherry. 2014. A systematic comparison of smoothing techniques for sentence-level bleu. ACL 2014, page 362.
    Google ScholarLocate open access versionFindings
  • J. Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213.
    Google ScholarLocate open access versionFindings
  • D. Espinosa, R. Rajkumar, M. White, and S. Berleant. 20Further meta-evaluation of broad-coverage surface realization. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 564–574. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • P. W. Foltz, W. Kintsch, and T. K. Landauer. 1998. The measurement of textual coherence with latent semantic analysis. Discourse processes, 25(2-3):285–307.
    Google ScholarLocate open access versionFindings
  • G. Forgues, J. Pineau, J.-M. Larcheveque, and R. Tremblay. 2014. Bootstrapping dialog systems with word embeddings.
    Google ScholarFindings
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. 2015. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055. J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. 2016. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155. C.-Y. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8.
    Findings
  • R. Lowe, N. Pow, I. V. Serban, and J. Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In SIGDIAL.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111– 3119.
    Google ScholarLocate open access versionFindings
  • J. Mitchell and M. Lapata. 2008. Vector-based models of semantic composition. In ACL, pages 236–244.
    Google ScholarLocate open access versionFindings
  • S. Moller, R. Englert, K. Engelbrecht, V. Hafner, A. Jameson, A. Oulasvirta, A. Raake, and N. Reithinger. 2006. MeMo: towards automatic usability evaluation of spoken dialogue services by user error simulations. In INTERSPEECH.
    Google ScholarFindings
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002a. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • K. Papineni, S. Roukos, T. Ward, J. Henderson, and F. Reeder. 2002b. Corpus-based comprehensive and diagnostic MT evaluation: Initial Arabic, Chinese, French, and Spanish results. In Proceedings of the second international conference on Human Language Technology Research, pages 132–137.
    Google ScholarLocate open access versionFindings
  • E. Reiter and A. Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
    Google ScholarLocate open access versionFindings
  • A. Ritter, C. Cherry, and B. Dolan. 2010. Unsupervised modeling of twitter conversations. In North American Chapter of the Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • A. Ritter, C. Cherry, and W. B. Dolan. 2011. Datadriven response generation in social media. In Proceedings of the conference on empirical methods in natural language processing, pages 583–593. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • V. Rus and M. Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 157–162, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • J. Schatzmann, K. Georgila, and S. Young. 2005. Quantitative evaluation of user simulation techniques for spoken dialogue systems. In 6th Special Interest Group on Discourse and Dialogue (SIGDIAL).
    Google ScholarLocate open access versionFindings
  • I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. 2015. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Networks. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio. 2016. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069.
    Findings
  • A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2015).
    Google ScholarLocate open access versionFindings
  • A. Stent, M. Marge, and M. Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 341–351. Springer.
    Google ScholarLocate open access versionFindings
  • O. Vinyals and Q. Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
    Findings
  • M. Walker, D. Litman, C. Kamm, and A. Abella. 1997. Paradise: A framework for evaluating spoken dialogue agents. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 271–280. ACL.
    Google ScholarLocate open access versionFindings
  • T.-H. Wen, M. Gasic, N. Mrksic, P.-H. Su, D. Vandyke, and S. Young. 2015. Semantically conditioned lstmbased natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745.
    Findings
  • J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. 2015. Towards universal paraphrastic sentence embeddings. CoRR, abs/1511.08198.
    Findings
作者
Chia-Wei Liu
Chia-Wei Liu
Iulian Vlad Serban
Iulian Vlad Serban
Michael Noseworthy
Michael Noseworthy
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科