Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Jan Deriu
Jan Deriu
Don Tuggener
Don Tuggener
Pius von Däniken
Pius von Däniken
Jon Ander Campos
Jon Ander Campos
Alvaro Rodrigo
Alvaro Rodrigo
Thiziri Belkacem
Thiziri Belkacem
Aitor Soroa
Aitor Soroa
Eneko Agirre
Eneko Agirre
Mark Cieliebak
Mark Cieliebak

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views49
Other Links: arxiv.org
Keywords:
bot conversationhuman evaluationmulti turninter-annotator agreementsingle turnMore(15+)
Weibo:
We introduced Spot The Bot, a robust and time-efficient approach for evaluating conversational dialogue systems

Abstract:

The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce \emph{Spot The Bot}...More

Code:

Data:

0
Introduction
  • Evaluation is a long-standing issue in developing conversational dialogue systems. The underlying difficulty in evaluation lies in the problem’s open-ended nature, as chatbots do not solve a clearly-defined task whose success can be measured in relation to an a priori defined ground truth.
  • The authors present the Spot The Bot framework, a cost-efficient evaluation methodology that can be used to rank several bots with regard to their ability to disguise as humans
  • It works as a multiturn-based evaluation with human judges.
  • The authors show that the framework produces reliable, repeatable results, while being quicker and more cost-effective to run than related approaches, as it does not rely on human-bot conversations and generally requires fewer annotations.
  • The authors release the framework as a ready-to-use tool for evaluating dialogue systems into which different systems can be plugged and compared1
Highlights
  • Evaluation is a long-standing issue in developing conversational dialogue systems
  • Single-turn analysis is usually performed by a human judge that rates a single response of the bot to a given context, whereas multi-turn analysis is often performed by a user that interacts with the bot and rates the interaction
  • Spot The Bot works by generating conversations between bots, mixing these bot-bot conversations with human-human conversations and letting human judges decide for each entity in the conversations if it is a human or a bot
  • Automatic evaluation metrics for chatbots are known to correlate poorly with human ratings (Liu et al, 2016; Lowe et al, 2017; Mehri and Eskenazi, 2020), so we focus on human-based approaches, which can be classified in two dimensions: 1) single-turn vs. multi-turn approaches, and 2) approaches where the dialogue systems are judged by the user directly or where judgments are made by objective experts, who do not participate in the dialogue
  • We introduced Spot The Bot, a robust and time-efficient approach for evaluating conversational dialogue systems
  • We show that Spot The Bot yields robust and significant results while reducing the evaluation time compared to other evaluation frameworks
Methods
  • For each domain6, the authors prepared a pool of bots to be ranked and analyzed.
  • The authors seed the conversations by using the first exchange of a conversation in the test set, which is sampled at random.
  • Only 2% of all sampled conversations contain an exchange, which can be found in the training material.
  • The results are biased towards the performance of a few crowdworkers, the authors designed a Human Intelligence Task as a batch of 20 conversations, and each worker was only allowed to work on three batches.
  • The authors designed the batches so that two segments of the same conversations never appear in the same batch, and each batch contains different segments of different conversations
Results
  • Table 1 gives an overview of the win rates for each pair of bots and their ranking ranges.
  • The Chisquare test computes the significance.
  • Most pairwise win-rates are significant.
  • DR performs worst in all three do- Dailydialog.
  • GPT BR S2 DR GPT - 0.67 0.77 0.93 BR 0.33 - 0.79 0.83 S2 0.23 0.21 - 0.74 DR 0.07 0.17 0.26 -.
  • WR RANGE 0.79 (1,1) 0.65 (1,2) 0.39 (3,3) 0.16 (4,4).
  • WR RANGE 0.87 (1,1) 0.59 (2,3) 0.50 (2,3) 0.33 (4,4) 0.19 (5,5) PersonaChat
Conclusion
  • 5.1 On Inter-Annotator Agreement

    The robustness of the evaluation of chatbots is often hampered by inter-annotator agreement (IAA) (Gandhe and Traum, 2016).
  • The results averaged over all investigated domains and segment lengths per bot, are shown in Table 3.9In this work, the authors introduced Spot The Bot, a robust and time-efficient approach for evaluating conversational dialogue systems.
  • It is based on conversations between bots rated by humans with respect to the bots’ ability to mimic human behavior.
Summary
  • Introduction:

    Evaluation is a long-standing issue in developing conversational dialogue systems. The underlying difficulty in evaluation lies in the problem’s open-ended nature, as chatbots do not solve a clearly-defined task whose success can be measured in relation to an a priori defined ground truth.
  • The authors present the Spot The Bot framework, a cost-efficient evaluation methodology that can be used to rank several bots with regard to their ability to disguise as humans
  • It works as a multiturn-based evaluation with human judges.
  • The authors show that the framework produces reliable, repeatable results, while being quicker and more cost-effective to run than related approaches, as it does not rely on human-bot conversations and generally requires fewer annotations.
  • The authors release the framework as a ready-to-use tool for evaluating dialogue systems into which different systems can be plugged and compared1
  • Methods:

    For each domain6, the authors prepared a pool of bots to be ranked and analyzed.
  • The authors seed the conversations by using the first exchange of a conversation in the test set, which is sampled at random.
  • Only 2% of all sampled conversations contain an exchange, which can be found in the training material.
  • The results are biased towards the performance of a few crowdworkers, the authors designed a Human Intelligence Task as a batch of 20 conversations, and each worker was only allowed to work on three batches.
  • The authors designed the batches so that two segments of the same conversations never appear in the same batch, and each batch contains different segments of different conversations
  • Results:

    Table 1 gives an overview of the win rates for each pair of bots and their ranking ranges.
  • The Chisquare test computes the significance.
  • Most pairwise win-rates are significant.
  • DR performs worst in all three do- Dailydialog.
  • GPT BR S2 DR GPT - 0.67 0.77 0.93 BR 0.33 - 0.79 0.83 S2 0.23 0.21 - 0.74 DR 0.07 0.17 0.26 -.
  • WR RANGE 0.79 (1,1) 0.65 (1,2) 0.39 (3,3) 0.16 (4,4).
  • WR RANGE 0.87 (1,1) 0.59 (2,3) 0.50 (2,3) 0.33 (4,4) 0.19 (5,5) PersonaChat
  • Conclusion:

    5.1 On Inter-Annotator Agreement

    The robustness of the evaluation of chatbots is often hampered by inter-annotator agreement (IAA) (Gandhe and Traum, 2016).
  • The results averaged over all investigated domains and segment lengths per bot, are shown in Table 3.9In this work, the authors introduced Spot The Bot, a robust and time-efficient approach for evaluating conversational dialogue systems.
  • It is based on conversations between bots rated by humans with respect to the bots’ ability to mimic human behavior.
Tables
  • Table1: Win rates (WR) for each pair of systems for each of the three domains. The bold entries denote significance (p < 0.05) computed with Chi-square test. The ranking ranges are computed using bootstrap sampling
  • Table2: Per feature win-rate of the different systems over all domains. Bold numbers indicate that the feature has a significant influence on system survival according to a Cox model
  • Table3: Annotator agreement on labels
  • Table4: Overview of time efficiency in Seconds. Spot The Bot annotation versus creating human-bot conversations
  • Table5: Win rates for each pair of systems for each of the three domains. The bold entries denote significance (p < 0.05) computed with Chi-square test
  • Table6: Table 6
  • Table7: Overview of the domains
  • Table8: Segment Analysis for the Dailydialog domain. For each segment 2,3, and 5 the win-rate (WR) and the percentage of classification as humans (HP) are shown. In the last row the percentage of ties is shown
  • Table9: Overview of the annotator performance. The number of annotations (#Ann), the average correctness score (AVG. CORR), the average correctness score for the human-human conversations (AVG. HUM. CORR.), and the percentage of annotators that have a correctness score below 50% ( < 50%)
Download tables as Excel
Related work
  • There exist various methods to evaluate dialogue systems, both automatic and human-based, but no single evaluation metric is widely agreed upon in the scientific community (Deriu et al, 2020). Automatic evaluation metrics for chatbots are known to correlate poorly with human ratings (Liu et al, 2016; Lowe et al, 2017; Mehri and Eskenazi, 2020), so we focus on human-based approaches, which can be classified in two dimensions: 1) single-turn vs. multi-turn approaches, and 2) approaches where the dialogue systems are judged by the user directly (interactive) or where judgments are made by objective experts, who do not participate in the dialogue (static). Single-turn Static Evaluations. Evaluations based on a static context and a single response from

    1https://github.com/jderiu/ spot-the-bot-code the dialogue systems are widely adopted. Usually, the rating is performed by expert raters that read the response of one or more dialogue systems to a static context and rate the responses (Galley et al, 2018). Alternatively, the responses of two bots can be compared directly to choose a preferred answer (Li et al, 2016). While being relatively time and cost-efficient, single-turn evaluation fails to capture the conversation’s quality as a whole. A system that tends to produce repeated answers can obtain a high single-turn score, albeit a low multi-turn one (See et al, 2019). Some authors also report poor inter-annotator agreement (Ghandeharioun et al, 2019).
Funding
  • This work has been partially funded by the LIHLITH project supported by the EU ERA-Net CHIST-ERA; the Swiss National Science Foundation [20CH21 174237]; the Agencia Estatal de Investigacion (AEI, Spain) projects PCIN-2017-118 and PCIN-2017-085; Basque Government IT134319.Jon Ander Campos enjoys a doctoral grant from the Spanish MECD
Reference
  • Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
    Findings
  • Jacopo Amidei, Paul Piwek, and Alistair Willis. 2019a. Agreement is overrated: A plea for correlation to assess human evaluation reliability. In Proceedings of the 12th International Conference on Natural Language Generation, pages 344–354.
    Google ScholarLocate open access versionFindings
  • Jacopo Amidei, Paul Piwek, and Alistair Willis. 2019b. The use of rating and Likert scales in natural language generation human evaluation tasks: A review and some recommendations. In Proceedings of the 12th International Conference on Natural Language Generation, pages 397–402, Tokyo, Japan. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Clifford Anderson-Bergman. 2017. icenReg: Regression Models for Interval Censored Data in R. Journal of Statistical Software, Articles, 81(12):1–23.
    Google ScholarLocate open access versionFindings
  • Emily Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Phillip A Bishop and Robert L Herron. 2015. Use and misuse of the Likert item responses and other ordinal measures. International journal of exercise science, 8(3):297.
    Google ScholarLocate open access versionFindings
  • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 201Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
    Google ScholarLocate open access versionFindings
  • Ondrej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia, Bulgaria. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • D. R. Cox. 1972. Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological), 34(2):187–220.
    Google ScholarLocate open access versionFindings
  • Jan Deriu and Mark Cieliebak. 2019. Towards a metric for automated conversational dialogue system evaluation and improvement. In Proceedings of the 12th International Conference on Natural Language Generation, pages 432–437, Tokyo, Japan. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2020a. The second conversational intelligence challenge (convai2). In The NeurIPS ’18 Competition, pages 187–208, Cham. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2020b. The second conversational intelligence challenge (convai2). In The NeurIPS ’18 Competition, pages 187–208, Cham. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Ondrej Dusek, Jekaterina Novikova, and Verena Rieser. 2018. Findings of the E2E NLG challenge. In Proceedings of the 11th International Conference on Natural Language Generation, pages 322–328, Tilburg University, The Netherlands. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • A. Eyal, L. Rokach, M. Kalech, O. Amir, R. Chougule, R. Vaidyanathan, and K. Pattada. 2014. Survival analysis of automobile components using mutually exclusive forests. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 44(2):246–253.
    Google ScholarLocate open access versionFindings
  • Michel Galley, Chris Brockett, Xiang Gao, Bill Dolan, and Jianfeng Gao. 2018. End-to-end conversation modeling: Moving beyond chitchat dstc 7 task 2 description ( v 1. 0 ).
    Google ScholarFindings
  • Sudeep Gandhe and David Traum. 2016. A semi-automated evaluation metric for dialogue model coherence. In Situated Dialog in SpeechBased Human-Computer Interaction, pages 217– 225. Springer.
    Google ScholarLocate open access versionFindings
  • Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, and Rosalind Picard. 20Approximating interactive human evaluation with self-play for open-domain dialog systems. In Advances in Neural Information Processing Systems, pages 13658– 13669.
    Google ScholarLocate open access versionFindings
  • Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkillTM: A Bayesian Skill Rating System. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, page 569–576, Cambridge, MA, USA. MIT Press.
    Google ScholarLocate open access versionFindings
  • generated text. In Proceedings of the 12th International Conference on Natural Language Generation, pages 355–368, Tokyo, Japan. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jialiang Li and Shuangge Ma. 2013. Survival analysis in medicine and genetics. CRC Press.
    Google ScholarFindings
  • Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Margaret Li, Jason Weston, and Stephen Roller. 2019. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons.
    Google ScholarFindings
  • Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic Turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1116–1126, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Shikib Mehri and Maxine Eskenazi. 2020. Usr: An unsupervised and reference free evaluation metric for dialog generation.
    Google ScholarFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners.
    Google ScholarFindings
  • Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. Best practices for the human evaluation of automatically
    Google ScholarFindings
  • Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic opendomain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637.
    Findings
  • Jost Schatzmann, Kark Weilhammer, Matt Stuttle, and Steve Young. 2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The Knowledge Engineering Review, 21(2):97–126.
    Google ScholarLocate open access versionFindings
  • Alexander Schmitt and Stefan Ultes. 2015. Interaction quality: Assessing the quality of ongoing spoken dialog interaction by experts—and how it relates to user satisfaction. Speech Communication, 74:12 – 36.
    Google ScholarLocate open access versionFindings
  • Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1702–1723, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • A. M. Turing. 1950. Computing Machinery and Intelligence. Mind, LIX(236):433–460.
    Google ScholarLocate open access versionFindings
  • Bruce W. Turnbull. 1974. Nonparametric Estimation of a Survivorship Function with Doubly Censored Data. Journal of the American Statistical Association, 69(345):169–173.
    Google ScholarLocate open access versionFindings
  • Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, Ming Cheng, Behnam Hedayatnia, Angeliki Metallinou, Rahul Goel, Shaohua Yang, and Anirudh Raju. 2018. On evaluating and comparing open domain dialog systems.
    Google ScholarFindings
  • Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204– 2213, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qiang Zhao and Jianguo Sun. 2004. Generalized logrank test for mixed interval-censored failure time data. Statistics in Medicine, 23(10):1621–1629.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments