Can You Put it All Together: Evaluating Conversational Agents' Ability to Blend Skills

Smith Eric Michael
Smith Eric Michael
Williamson Mary
Williamson Mary

ACL, pp. 2021-2030, 2020.

Cited by: 3|Bibtex|Views122
EI
Other Links: arxiv.org|dblp.uni-trier.de
Weibo:
We have shown several ways to leverage previous work focusing on individual conversational skills, either by combining trained singleskill models in a two-stage way, by re-using the datasets for simultaneous multi-task training, and by fine-tuning on the overall blended task

Abstract:

Being engaging, knowledgeable, and empathetic are all desirable general qualities in a conversational agent. Previous work has introduced tasks and datasets that aim to help agents to learn those qualities in isolation and gauge how well they can express them. But rather than being specialized in one single quality, a good open-domain c...More

Code:

Data:

0
Introduction
  • A good open-domain conversational agent should have a well-rounded set of skills1 and qualities that allow it to seamlessly blend listening with empathy, providing knowledgeable responses, and talking about various topics from everyday life to their favorite hobbies or latest challenges.

    1”Skills” in the conversational AI literature is sometimes taken to mean a very defined specific set of abilities such as telling the weather (e.g., Zhou et al (2020)).
  • Recent research has made solid strides towards gauging and improving performance of opendomain conversational agents along specific axes such as how knowledgeable they are (Dinan et al, 2019b; Moghe et al, 2018; Qin et al, 2019), how well they can display empathy (Rashkin et al, 2019; Lin et al, 2019) or talk about their personal background (Zhang et al, 2018; Li et al, 2017)
  • It remains unclear whether models optimized for performance along one of these axes can retain the learned skill while blending it with other desirable skills, or how to best conduct simultaneous training of multiple skills.
  • In order to evaluate those methods, the authors propose a new Englishlanguage dataset, BlendedSkillTalk, that blends several skills into a single conversation, and use it to evaluate methods with both automated metrics and human crowdsourced ratings across different axes
Highlights
  • A good open-domain conversational agent should have a well-rounded set of skills1 and qualities that allow it to seamlessly blend listening with empathy, providing knowledgeable responses, and talking about various topics from everyday life to their favorite hobbies or latest challenges.

    1”Skills” in the conversational AI literature is sometimes taken to mean a very defined specific set of abilities such as telling the weather (e.g., Zhou et al (2020))
  • We examine how to combine three such traits that each have a corresponding task and dataset: demonstrating an ability to talk about oneself and get to know your partner, as captured by the ConvAI2 dataset, an extension of the PersonaChat dataset (Zhang et al, 2018; Dinan et al, 2020); being knowledgeable and discussing a topic in depth, as measured through the Wizard of Wikipedia task (Dinan et al, 2019b); and demonstrating empathy and being able to talk about emotional personal situations, as measured by the EmpatheticDialogues benchmark proposed in Rashkin et al (2019)
  • The Multi-Task Single-Skills model performs best among the blended models, and nearly matches the performance of all singleskill models on all benchmarks
  • We have shown several ways to leverage previous work focusing on individual conversational skills, either by combining trained singleskill models in a two-stage way, by re-using the datasets for simultaneous multi-task training, and by fine-tuning on the overall blended task
  • We showed that multiple multi-task approaches can be effective on this task, careful construction of the training scheme is important to mitigate biases when blending and selecting skills, while fine-tuning on the overall blended task improves models further
Methods
  • design video games for a living

    The author wears glasses that are cateye.

    Wizard of Wikipedia topic: Video game design Previous utterances: U: What video games do you like to play? G: all kinds, action, adventure, shooter, platformer, rpg, etc. but video game design requires both artistic and technical competence AND writing skills. that is one part many people forget

    Actual utterances: U: Exactly! the author thinks many people fail to notice how beautiful the art of video games can be. (PB) (G selected the WoW suggestion: ”Some games games are purposely designed to be a work of a persons creative expression, many though have been challenged as works of art by some critics.”) G: Some games games are purposely designed to be a work of a persons creative expression, many though have been challenged as works of art by some critics. (K) U: Video games are undervalued by many and too blamed for problems like obesity or violence in kids (K) G: Just last week the author's son was playing some Tine 2 and it was keeping him so calm.

    Games are therapeutic to some. (S) U: the author uses games to relax after a stressful day, the small escape is relaxing. (PB) (G selected the ED suggestion: ”The author enjoys doing that after a hard day at work as well.
  • The author thinks many people fail to notice how beautiful the art of video games can be.
  • The author hopes it relaxes you!”) G: the author enjoys a good gaming session after a hard day at work as well.
  • (PB) G: the author wishes the author could play football, But the author wears this cateye glasses and they would break if the author tried.
  • The author has to show off the author's beautiful green eyes somehow. (S)
Results
  • Results on Single

    Skill Benchmarks

    Automated metrics results on the original benchmarks used to gauge competency at a single skill (ConvAI2, WoW, ED) reported in the literature are shown in Table 5.
  • The model finetuned on BST shows balanced performance but fails to match the performance of the single-skill models on their original benchmarks.
  • Results for both settings are shown in Table 6.
  • All single-skill models show improved performance once fine-tuned on the BST train set.
  • Performance in the zero-shot setting is already good, which is promising in terms of generalization to unseen data
Conclusion
  • Discussion and Conclusion

    This paper focuses on the goal of creating an open-domain conversational agent that can display many skills, and blend them in a seamless and engaging way.
  • One natural extension would be to generalize these findings to other skills than the three addressed here, such as humor/wit, eloquence, image commenting, etc.
  • This would in principle be straightforward to do as long as these additional skills have a corresponding “single-skill” dataset to train on and are sufficiently distinguishable from each other
Summary
  • Introduction:

    A good open-domain conversational agent should have a well-rounded set of skills1 and qualities that allow it to seamlessly blend listening with empathy, providing knowledgeable responses, and talking about various topics from everyday life to their favorite hobbies or latest challenges.

    1”Skills” in the conversational AI literature is sometimes taken to mean a very defined specific set of abilities such as telling the weather (e.g., Zhou et al (2020)).
  • Recent research has made solid strides towards gauging and improving performance of opendomain conversational agents along specific axes such as how knowledgeable they are (Dinan et al, 2019b; Moghe et al, 2018; Qin et al, 2019), how well they can display empathy (Rashkin et al, 2019; Lin et al, 2019) or talk about their personal background (Zhang et al, 2018; Li et al, 2017)
  • It remains unclear whether models optimized for performance along one of these axes can retain the learned skill while blending it with other desirable skills, or how to best conduct simultaneous training of multiple skills.
  • In order to evaluate those methods, the authors propose a new Englishlanguage dataset, BlendedSkillTalk, that blends several skills into a single conversation, and use it to evaluate methods with both automated metrics and human crowdsourced ratings across different axes
  • Methods:

    design video games for a living

    The author wears glasses that are cateye.

    Wizard of Wikipedia topic: Video game design Previous utterances: U: What video games do you like to play? G: all kinds, action, adventure, shooter, platformer, rpg, etc. but video game design requires both artistic and technical competence AND writing skills. that is one part many people forget

    Actual utterances: U: Exactly! the author thinks many people fail to notice how beautiful the art of video games can be. (PB) (G selected the WoW suggestion: ”Some games games are purposely designed to be a work of a persons creative expression, many though have been challenged as works of art by some critics.”) G: Some games games are purposely designed to be a work of a persons creative expression, many though have been challenged as works of art by some critics. (K) U: Video games are undervalued by many and too blamed for problems like obesity or violence in kids (K) G: Just last week the author's son was playing some Tine 2 and it was keeping him so calm.

    Games are therapeutic to some. (S) U: the author uses games to relax after a stressful day, the small escape is relaxing. (PB) (G selected the ED suggestion: ”The author enjoys doing that after a hard day at work as well.
  • The author thinks many people fail to notice how beautiful the art of video games can be.
  • The author hopes it relaxes you!”) G: the author enjoys a good gaming session after a hard day at work as well.
  • (PB) G: the author wishes the author could play football, But the author wears this cateye glasses and they would break if the author tried.
  • The author has to show off the author's beautiful green eyes somehow. (S)
  • Results:

    Results on Single

    Skill Benchmarks

    Automated metrics results on the original benchmarks used to gauge competency at a single skill (ConvAI2, WoW, ED) reported in the literature are shown in Table 5.
  • The model finetuned on BST shows balanced performance but fails to match the performance of the single-skill models on their original benchmarks.
  • Results for both settings are shown in Table 6.
  • All single-skill models show improved performance once fine-tuned on the BST train set.
  • Performance in the zero-shot setting is already good, which is promising in terms of generalization to unseen data
  • Conclusion:

    Discussion and Conclusion

    This paper focuses on the goal of creating an open-domain conversational agent that can display many skills, and blend them in a seamless and engaging way.
  • One natural extension would be to generalize these findings to other skills than the three addressed here, such as humor/wit, eloquence, image commenting, etc.
  • This would in principle be straightforward to do as long as these additional skills have a corresponding “single-skill” dataset to train on and are sufficiently distinguishable from each other
Tables
  • Table1: Guided workers choice of suggestions in the train set of BlendedSkillTalk, broken down by provenance of the given initial context utterances. Guided workers often choose not to use the suggestions, but have a slight preference for ConvAI2 when the initial context is from that dataset, and similarly for ED
  • Table2: Percentages of utterances of unguided workers classified by the dataset classifier as coming from ConvAI2, WoW, or ED, broken down by provenance of the provided seed context. For each dataset, the fraction of utterances classified as coming from that dataset is highest when the seed context is from that same dataset
  • Table3: Breakdown of conversations by number of modes, showing that most BST dataset conversations exhibit multiple modes. Workers were asked to choose if each utterance of a conversation demonstrated knowledge, empathy, personal situations, or personal background. Over 70% of the conversations annotated demonstrated at least 3 of the 4 modes
  • Table4: Mitigating skill selection bias. Adding personas and topics during multi-task training (debias) results in the multi-task retrieval models selecting utterances more evenly when tested on BlendedSkillTalk compared to training on the original datasets (orig)
  • Table5: Results on single-skill benchmarks. Top: reported values published in the papers accompanying the benchmarks, and the Poly-encoder paper. ConvAI2, WoW, ED: models trained on the corresponding benchmark. These models perform very well on the benchmark they were trained on, but not as well on other benchmarks. BST: The model fine-tuned on BST shows more balanced performance (i.e., none of the single-skill benchmarks does better at all three skills), but it is noticeably lower than each specialized model. Random-Skill: the performance of choosing a random single-skill per response is comparable to the BST model, but slightly worse on ConvAI2. MT Two-Stage: guiding the generation by an actual task classifier as opposed to random selection increases performance on all skills. MT Single-Skills: this model performs best among the blended skills architectures, and nearly matches the single-skill model performance (and surpasses it in the WoW case). Added-context benchmarks: when the benchmark contexts are augmented with a persona and topic as described in section 3.2, the evaluation results barely change. Mixed-candidates evaluation: when the set of benchmark candidates is tripled by adding candidates from the other two benchmarks in equal proportion, the performance of the best respective single-task models suffers, while the MT Single-Skills model proves more resilient. Note that Single-task averages in italics do not correspond to a single model, but an average over 3 models
  • Table6: Test results on BlendedSkillTalk. BST, zeroshot: the models are tested directly on the test set of BST without having been fine-tuned on the BST train set. +BST, FT: models are fine-tuned on the BST train set, then tested on the BST test set. Multi-Task SingleSkills + BlendedSkillTalk performs best. The MultiTask Two-Stage model outperforms two of the singleskill models, but the latter work well when combined with BlendedSkillTalk fine-tuning. We hypothesize that ConvAI2 alone performs well because it has been trained to use persona contexts, that are used throughout the BST dialogues
  • Table7: Human evaluation results on individual axes of knowledge, empathy, and being personal, as well as overall quality. All results here have a 95% confidence interval of ± 0.2 or 0.3, omitted to avoid cluttering the table. Results that are within the confidence interval of the best model performance are bolded. ConvAI2, WoW, ED: models pre-trained on pushshift.io Reddit and fine-tuned on the respective datasets. For Empathy and Personal topics, the individual models tend to do better when trained on a dataset tailored for that, however they all perform similarly on the Knowledge dimension. BST: model pre-trained on pushshift.io Reddit and fine-tuned on BST. This model is showing better overall performance compared to single-skill datasets (i.e., none of the three single-skill dataset do better than BST in every dimension). MT Single-Skills with fine-tuning on BST and MT Two-Stage are performing very well on all dimensions. MT Single-Skills with fine-tuning on BST has fewer than a third of the parameters of the MT Two-Stage model, yet manages to perform as well, if not slightly better
Download tables as Excel
Related work
  • While most commercial dialogue systems rely on hand-coded narrow skills (e.g., see Zhou et al.

    (2020); Ram et al (2018)), typically focusing on separate task-oriented features such as alarm setting, calendar entries, etc., we are interested in models that display various qualities in opendomain dialogue. Further, we focus on skills that can be learned end-to-end, as end-to-end learning affords the promise of better generalization to unseen domains.

    Recent promising conversational models have leveraged very large conversation-like data such as datasets extracted from Reddit and made available by a third party on pushshift.io (Mazareet al., 2018; Humeau et al, 2019; Keskar et al, 2019; Rashkin et al, 2019). These large-scale datasets are very useful in providing vast amounts of conversational material that allow for reproducible research and comparison with prior work, however the qualities of resulting conversational agents are dependent on the qualities present in the source conversations. Given how online conversations can turn toxic and lack empathy, indiscriminate pretraining on such corpora is unlikely to spontaneously endow a conversational agent with desirable qualities such as avoiding toxic responses (Dinan et al, 2019a) or demonstrating empathy (Rashkin et al, 2019) or knowledge (Dinan et al, 2019b).

    This has led the community to propose tasks and datasets focusing specifically on some trait or skill. In this work, we examine how to combine three such traits that each have a corresponding task and dataset: demonstrating an ability to talk about oneself and get to know your partner, as captured by the ConvAI2 dataset, an extension of the PersonaChat dataset (Zhang et al, 2018; Dinan et al, 2020); being knowledgeable and discussing a topic in depth, as measured through the Wizard of Wikipedia task (Dinan et al, 2019b); and demonstrating empathy and being able to talk about emotional personal situations, as measured by the EmpatheticDialogues benchmark proposed in Rashkin et al (2019). The ConvAI2 dataset comprises more than 140k utterances of crowdsourced conversations between paired workers getting to know each other. Each worker was assigned a persona consisting of a few sentences such as “I have a pet hamster,” which had separately been crowdsourced. The Wizard of Wikipedia (WoW) task aims to explore conversation informed by expert knowledge from Wikipedia, and provides about 194k utterances of conversations on about 1,250 topics. The EmpatheticDialogues (ED) dataset consists in about 50k utterances between a Speaker who is talking about an emotional situation, and a Listener who is tasked to respond in an empathetic manner, acknowledging the other person’s feelings. In addition to being associated with easy-to-use datasets, these three skills benefit from being clearly defined and separate in scope. Focusing on blending only three skills keeps data collection, ablations, and analyses manageable while already presenting a challenge for models, and it helps narrow down the most promising approaches for blending a greater number of skills.
Funding
  • Investigates several ways to combine models trained towards isolated capabilities, ranging from simple model aggregation schemes that require minimal additional training, to various forms of multi-task training that encompass several skills at all training stages
  • Proposes a new dataset, BlendedSkillTalk, to analyze how these capabilities would mesh together in a natural conversation, and compare the performance of different architectures and training schemes
  • Proposes a new Englishlanguage dataset, BlendedSkillTalk, that blends several skills into a single conversation, and use it to evaluate methods with both automated metrics and human crowdsourced ratings across different axes
  • Proposes methods that compare those competing approaches, and provide a detailed analysis of their successes and failures
  • Examines how to combine three such traits that each have a corresponding task and dataset: demonstrating an ability to talk about oneself and get to know your partner, as captured by the ConvAI2 dataset, an extension of the PersonaChat dataset ; being knowledgeable and discussing a topic in depth, as measured through the Wizard of Wikipedia task ; and demonstrating empathy and being able to talk about emotional personal situations, as measured by the EmpatheticDialogues benchmark proposed in Rashkin et al.
Reference
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019a. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4536–4545, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2020. The second conversational intelligence challenge (ConvAI2). In The NeurIPS ’18 Competition, pages 187– 208, Cham. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019b. Wizard of wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Real-time inference in multi-sentence tasks with deep pretrained transformers. arXiv preprint arXiv:1905.01969.
    Findings
  • Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
    Findings
  • Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, and Pascale Fung. 201MoEL: Mixture of empathetic listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 121–132, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Pierre-Emmanuel Mazare, Samuel Humeau, Martin Raison, and Antoine Bordes. 2018. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2775–2779, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A dialog research software platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 79–84, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M. Khapra. 2018. Towards exploiting background knowledge for building conversation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2322–2332, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jianfeng Gao. 2019. Conversing by reading: Contentful neural conversation with on-demand machine reading. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5427–5436, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, et al. 2018. Conversational ai: The science behind the alexa prize. arXiv preprint arXiv:1801.03604.
    Findings
  • Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic opendomain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204– 2213, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53–93.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments