GRADE: Automatic Graph Enhanced Coherence Metric for Evaluating Open Domain Dialogue Systems

EMNLP 2020, pp. 9230-9240, 2020.

Cited by: 0|Bibtex|Views171|DOI:https://doi.org/10.18653/V1/2020.EMNLP-MAIN.742
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We proposed Graph-enhanced Representation for Automatic Dialogue Evaluation, a novel metric for dialogue coherence evaluation of open-domain dialogue systems

Abstract:

Automatically evaluating dialogue coherence is a challenging but high-demand ability for developing high-quality open-domain dialogue systems. However, current evaluation metrics consider only surface features or utterance-level semantics, without explicitly considering the fine-grained topic transition dynamics of dialogue flows. Here, w...More
0
Introduction
  • What makes dialogue utterances unified rather than a random group of sentences, is an essential property to pursue an open-domain active hiking dog sport treadmill exercise outdoors

    Dialogue Graph campers treadmill evidences dog outdoors eat eat exercise Commonsense fat Graph

    Why not use the treadmill? Or maybe get a dog?

    Sometimes the author's husband goes with me.
  • Due to the ignorance of the underlying semantic of a response, they are biased and correlate poorly with human judgements in terms of response coherence (Liu et al, 2016)
  • To overcome this issue, some learning-based metrics were proposed to train a coherence scoring model by considering the utterance-level semantics, such as ADEM (Lowe et al, 2017), RUBER (Tao et al, 2018), and BERTRUBER (Ghazarian et al, 2019).
  • The above metrics have demonstrated higher correlations with human judgements than statistic-based metrics, they only model dialogue coherence at utterance level without explicitly considering the fine-grained topic transition dynamics of dialogue flows
Highlights
  • Coherence, what makes dialogue utterances unified rather than a random group of sentences, is an essential property to pursue an open-domain active hiking dog sport treadmill exercise outdoors

    Dialogue Graph campers treadmill evidences dog outdoors eat eat exercise Commonsense fat Graph

    Why not use the treadmill? Or maybe get a dog?

    Sometimes my husband goes with me
  • To address the above problems, we propose a new automatic metric for open-domain dialogue systems, named as Graph-enhanced Representation for Automatic Dialogue Evaluation (GRADE), which explicitly models topic transition dynamics by reasoning over dialogue graphs and incorporates them into utterance-level contextualized representations
  • Our GRADE obtains the highest correlations with human judgements in average
  • The Spearman value of GRADE on the TransformerRanker is lower than BLEURT which is trained on a very large-scale dataset, the averaged correlation result of GRADE is 1% higher than BLEURT
  • To verify the transferability of our GRADE, we further evaluate the human correlations of GRADE compared with other baselines on two unseen chit-chat datasets, ConvAI2 and EmpatheticDialogues
  • Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models in terms of the Pearson and Spearman correlations with human judgements
  • We proposed GRADE (Graphenhanced Representations for Automatic Dialogue Evaluation), a novel metric for dialogue coherence evaluation of open-domain dialogue systems
Methods
  • The authors use the DailyDialog4 (Li et al, 2017) dataset which contains high-quality open-domain conversations about daily life including diverse topics, to learn the GRADE.
  • Another two chit-chat datasets, ConvAI25 (Dinan et al, 2019) and EmpatheticDialogues6 (Rashkin et al, 2019), are considered as unseen datasets to verify the transferability of the metrics.
Results
  • DailyDialog Dataset.
  • The test set results of the DailyDialog dataset are presented in Table 1.
  • The Spearman value of GRADE on the TransformerRanker is lower than BLEURT which is trained on a very large-scale dataset, the averaged correlation result of GRADE is 1% higher than BLEURT.
  • Other Unseen Datasets.
  • To verify the transferability of the GRADE, the authors further evaluate the human correlations of GRADE compared with other baselines on two unseen chit-chat datasets, ConvAI2 and EmpatheticDialogues.
Conclusion
  • The authors proposed GRADE (Graphenhanced Representations for Automatic Dialogue Evaluation), a novel metric for dialogue coherence evaluation of open-domain dialogue systems.
  • A limitation of GRADE is the inconsistency between the training objective and the expected behavior.
  • GRADE may deviate from the human scoring criterion and fail to quantify the dialogue responses accurately, and that the human correlation results fluctuate over different runs.
  • To develop a dialogue metric that can quantify in a more human-like manner, it is critical to reducing the gap between the training objective and the model behavior the authors truly care about
Summary
  • Introduction:

    What makes dialogue utterances unified rather than a random group of sentences, is an essential property to pursue an open-domain active hiking dog sport treadmill exercise outdoors

    Dialogue Graph campers treadmill evidences dog outdoors eat eat exercise Commonsense fat Graph

    Why not use the treadmill? Or maybe get a dog?

    Sometimes the author's husband goes with me.
  • Due to the ignorance of the underlying semantic of a response, they are biased and correlate poorly with human judgements in terms of response coherence (Liu et al, 2016)
  • To overcome this issue, some learning-based metrics were proposed to train a coherence scoring model by considering the utterance-level semantics, such as ADEM (Lowe et al, 2017), RUBER (Tao et al, 2018), and BERTRUBER (Ghazarian et al, 2019).
  • The above metrics have demonstrated higher correlations with human judgements than statistic-based metrics, they only model dialogue coherence at utterance level without explicitly considering the fine-grained topic transition dynamics of dialogue flows
  • Objectives:

    Since the goal is to predict a coherence score of a response based on a context, the authors only consider the edges between the context nodes Vc and the response nodes Vr
  • Methods:

    The authors use the DailyDialog4 (Li et al, 2017) dataset which contains high-quality open-domain conversations about daily life including diverse topics, to learn the GRADE.
  • Another two chit-chat datasets, ConvAI25 (Dinan et al, 2019) and EmpatheticDialogues6 (Rashkin et al, 2019), are considered as unseen datasets to verify the transferability of the metrics.
  • Results:

    DailyDialog Dataset.
  • The test set results of the DailyDialog dataset are presented in Table 1.
  • The Spearman value of GRADE on the TransformerRanker is lower than BLEURT which is trained on a very large-scale dataset, the averaged correlation result of GRADE is 1% higher than BLEURT.
  • Other Unseen Datasets.
  • To verify the transferability of the GRADE, the authors further evaluate the human correlations of GRADE compared with other baselines on two unseen chit-chat datasets, ConvAI2 and EmpatheticDialogues.
  • Conclusion:

    The authors proposed GRADE (Graphenhanced Representations for Automatic Dialogue Evaluation), a novel metric for dialogue coherence evaluation of open-domain dialogue systems.
  • A limitation of GRADE is the inconsistency between the training objective and the expected behavior.
  • GRADE may deviate from the human scoring criterion and fail to quantify the dialogue responses accurately, and that the human correlation results fluctuate over different runs.
  • To develop a dialogue metric that can quantify in a more human-like manner, it is critical to reducing the gap between the training objective and the model behavior the authors truly care about
Tables
  • Table1: Correlations between automatic evaluation metrics and human judgements on three different datasets (DailyDialog, ConvAI2 and EmpatheticDialogues) and two dialogue models (Transformer-Ranker and TransformerGenerator). The star * indicates results with p-value > 0.05, which are not statistically significant
  • Table2: Correlations between auto-metrics and human judgements on the ConvAI2 dataset and two dialogue models, Bert-Ranker and DialoGPT, respectively
  • Table3: Ablation results on the DailyDialog dataset, averaged across five random seeds, with standard deviations presented in gray color. N1 and N2 refer to the numbers of the 1st and 2nd hop neighboring nodes in ConceptNet, respectively. The symbol indicates that three or more than three correlation results over the five random seeds are not statistically significant, namely, p-value > 0.05
Download tables as Excel
Related work
  • Automatic evaluation for open-domain dialogue systems is difficult since there are many appropriate responses for a dialogue context under the open-domain setting, known as the one-to-many problem (Zhao et al, 2017).

    Initially, the statistic-based metrics in language generation tasks are adopted for dialogue evaluation, such as BLEU (Papineni et al, 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2004). These metrics use statistical rules to measure the surface similarity between generated responses and reference responses. For example, BLEU computes the geometric average of the n-gram precisions. However, they can not cope with the one-to-many problem and have weak correlations with human judgements (Liu et al, 2016).

    In recent years, learning-based metrics have increasingly attracted interest from researchers. ADEM proposed by Lowe et al (2017) achieves higher correlations with human judgements than the statistic-based metrics, which is trained with human-annotated data in a supervised manner. However, it is time-consuming and expensive to obtain large amounts of annotated data. To reduce the cost of obtaining annotated data, Tao et al (2018) trained their metric RUBER with auto-constructed negative samples in an unsupervised manner.
Funding
  • This work was supported in part by National Key RD Program of China under Grant No 2018AAA0100300, National Natural Science Foundation of China (NSFC) under Grant No.U19A2073 and No.61976233, Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Nature Science Foundation of Shenzhen Under Grant No 2019191361, Zhijiang Lab’s Open Fund (No 2020AA3AB14)
Study subjects and analysis
individual workers: 10
For each coherence question, workers were provided with a contextresponse pair and asked to assess the coherence between the context and the response on a scale of 1-5 (not coherent at all to very coherent). Each pair was assessed by 8 to 10 individual workers. In total, there are 1200 different pair and 11910 human annotations from 217 unique workers, as the final human judgements

unique workers: 217
Each pair was assessed by 8 to 10 individual workers. In total, there are 1200 different pair and 11910 human annotations from 217 unique workers, as the final human judgements. As shown in Figure 3, the distributions of human judgements are balanced from score 1 to 5

Reference
  • Daniel De Freitas Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a human-like open-domain chatbot. ArXiv, abs/2001.09977.
    Findings
  • Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
    Google ScholarFindings
  • Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander H. Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander I. Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2019. The second conversational intelligence challenge (convai2). ArXiv, abs/1902.00098.
    Findings
  • Sarik Ghazarian, Johnny Wei, Aram Galstyan, and Nanyun Peng. 2019. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 82–89, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Will Hamilton, Zhitao Ying, and Jure Leskovec. 201Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034.
    Google ScholarLocate open access versionFindings
  • Zhiting Hu, Haoran Shi, Bowen Tan, Wentao Wang, Zichao Yang, Tiancheng Zhao, Junxian He, Lianhui Qin, Di Wang, et al. 2019. Texar: A modularized, versatile, and extensible toolkit for text generation. In ACL 2019, System Demonstrations.
    Google ScholarFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
    Google ScholarFindings
  • Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic Turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1116–1126, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston. 2017. Parlai: A dialog research software platform. arXiv preprint arXiv:1705.06476.
    Findings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic opendomain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Stephen Roller, Emily Dinan, Naman Goyal, Da Young Ju, Mary F. Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric Michael Smith, Y.Lan Boureau, and Jason Weston. 2020. Recipes for building an open-domain chatbot. ArXiv, abs/2004.13637.
    Findings
  • Yu Rong, Wen bing Huang, Tingyang Xu, and Junzhou Huang. 2020. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Shiki Sato, Reina Akama, Hiroki Ouchi, Jun Suzuki, and Kentaro Inui. 2020. Evaluating dialogue generation systems via response selection. ArXiv, abs/2004.14302.
    Findings
  • Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In NAACL-HLT.
    Google ScholarFindings
  • Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. 2020. Bleurt: Learning robust metrics for text generation. ArXiv, abs/2004.04696.
    Findings
  • Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Jianheng Tang, Tiancheng Zhao, Chenyan Xiong, Xiaodan Liang, Eric Xing, and Zhiting Hu. 2019. Targetguided open-domain conversation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5624–5634, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. In AAAI.
    Google ScholarFindings
  • Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktaschel, Douwe Kiela, Arthur Szlam, and Jason Weston. 2019. Learning to speak and act in a fantasy text adventure game. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 673–683, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training for conversational response generation.
    Google ScholarFindings
  • Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–664, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, pages 1–62.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments