AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Unlike early chatbots designed for chitchat, XiaoIce is designed as a social chatbot intended to serve users’ needs for communication, affection, and social belonging, and is endowed with empathy, personality, and skills, integrating both emotional quotient and intelligence quoti...
The Design and Implementation of XiaoIce, an Empathetic Social Chatbot.
arXiv: Human-Computer Interaction, (2019)
This paper describes the development of the Microsoft XiaoIce system, the most popular social chatbot in the world. XiaoIce is uniquely designed as an AI companion with an emotional connection to satisfy the human need for communication, affection, and social belonging. We take into account both intelligent quotient (IQ) and emotional quo...More
PPT (Upload PPT)
- The development of social chatbots, or intelligent dialogue systems that are able to engage in empathetic conversations with humans, has been one of the longest running goals in artificial intelligence (AI).
- Conversational systems, such as Eliza (Weizenbaum 1966), Parry (Colby, Weber, and Hilf 1971), and Alice (Wallace 2009), were.
- Recent surveys include Gao, Galley, and Li (2019) and Shum, He, and Li (2018)
- The development of social chatbots, or intelligent dialogue systems that are able to engage in empathetic conversations with humans, has been one of the longest running goals in artificial intelligence (AI)
- General Chat is responsible for engaging in open-domain conversations that cover a wide range of topics
- Because General Chat and Domain Chats are implemented using the same engine with access to different databases, we only describe General Chat here
- The components of Image Commenting, including the text-to-image generator and boosted tree ranker, are trained on a data set consisting of 28 million images, each paired with six text comments rated on the three-level quality scale as shown in Figure 13
- Evaluating the quality of open-domain social chatbots is challenging because social chats are inherently open-ended (Ram et al 2018; Gao, Galley, and Li 2019; Huang, Zhu, and Gao 2019) and the long-term success of a social chatbot needs to be measured by its user engagement
- Unlike early chatbots designed for chitchat, XiaoIce is designed as a social chatbot intended to serve users’ needs for communication, affection, and social belonging, and is endowed with empathy, personality, and skills, integrating both emotional quotient and intelligence quotient to optimize for long-term user engagement, measured in expected Conversation-turns Per Session
- Design Principle
Social chatbots require a sufficiently high intelligence quotient (IQ) to acquire a range of skills to keep up with the users and help them complete specific tasks.
- Importantly, social chatbots require a sufficient emotional quotient (EQ) to meet users’ emotional needs, such as emotional affection and social belonging, which are among the fundamental needs for human beings (Maslow 1943).
- Integration of both IQ and EQ is core to XiaoIce’s system design.
- The most important and sophisticated skill is Core Chat, which can engage in long and opendomain conversations with users
- Both the topic switching classifier and the topic ranker are trained using 50K dialogue sessions whose topics are manually labeled.
- It can be observed that the XiaoIce-produced comments are emotional, subjective, imaginative, and are very likely to inspire meaningful human–machine interactions, while the comments generated by the other image captioning models are reasonable in content but boring in the context of social chats, and less likely to improve user engagement
- Most of these skills are designed for very specific user scenarios or tasks, implemented using hand-crafted dialogue policies and template-based response generators unless otherwise stated.
- A skill can be retired or reenter the market based on the market study result
- 7.1 Evaluation Metrics
Evaluating the quality of open-domain social chatbots is challenging because social chats are inherently open-ended (Ram et al 2018; Gao, Galley, and Li 2019; Huang, Zhu, and Gao 2019) and the long-term success of a social chatbot needs to be measured by its user engagement.
- There is no doubt that the most reliable evaluation is to deploy the chatbot to users and monitor the user feedback and engagement, measured by user ratings, NAU, CPS, and so on, over a long period of time
- The authors take this approach to evaluate XiaoIce. Some recent dialogue challenges (Dinan et al 2018; Ram et al 2018) take a similar, manual evaluation approach, using paid workers and unpaid volunteers.
- The authors will continue to make XiaoIce more useful and empathetic to help build a more connected and happier society for all
- Table1: Perplexity and BLEU for the seq2seq and persona models on the TV series data set. Adapted from <a class="ref-link" id="cLi_et+al_2016_b" href="#rLi_et+al_2016_b">Li et al (2016b</a>)
- Table2: Responses to “Do you love me?” from the persona model on the TV series data set using different addressees and speakers. Adapted from <a class="ref-link" id="cLi_et+al_2016_b" href="#rLi_et+al_2016_b">Li et al (2016b</a>)
- Table3: Ratings of three response generation systems on a 5K dialogue data set
- Table4: Image commenting results of XiaoIce and four state-of-the-art image captioning systems, in percent. Adapted from <a class="ref-link" id="cHuang_et+al_2019_a" href="#rHuang_et+al_2019_a">Huang et al (2019</a>)
- Table5: The record of the longest conversations of XiaoIce. We have verified carefully with these users that these long conversations are generated by XiaoIce and human users, not another bot
- XiaoIce is designed as a modular system based on a hybrid AI engine that combines rulebased and data-driven approaches, as presented in Figure 4 and Section 4. By contrast, in the research community, there is a growing interest in developing fully data-driven, end-to-end (E2E) systems for social chatbot (chitchat) scenarios, as reviewed in Chapter 5 of Gao, Galley, and Li (2019).
The difference is mainly due to different design goals of social chatbots. Traditionally, social chatbots are designed for chitchat scenarios where the bots are expected to mimic human user conversations but not to interact with the user’s environment. For such scenarios, E2E approaches often lead to a very simple system architecture, such as RNNbased systems (Shang, Lu, and Li 2015; Vinyals et al 2015; Li et al 2016b), where the neural network–based response generation models can be easily trained on large-scale free-form, open-domain data sets (e.g., collected from social networks) to allow the bots to chat with users on any topics.
XiaoIce, on the other hand, is designed as an AI companion that integrates both EQ and IQ skills that are needed to help users complete specific tasks. Thus, XiaoIce has to interact with the user’s environment and access real-world knowledge (e.g., via API calls). Therefore, XiaoIce uses a modular architecture similar to task-oriented dialogue systems, with different modules dealing with different tasks. Depending on the availability of training data and knowledge bases for each individual task, either a rule-based method or a data-driven method, or a hybrid of both, is adopted for the task. For example, when asked “what is the weather tomorrow?,” E2E systems are likely to give a plausible but random response, such as “sunny” and “rainy,” due to the lack of grounding in realworld knowledge.12 XiaoIce, however, generates a factual response based on the user’s geographical location and the corresponding database, as shown in Figure 19(a).
- We find that incorporating the neuralbased generator into the baseline improves the coverage by 20%, and incorporating the retrieval-based generator using unpaired database into the baseline improves the coverage by 10%
Study subjects and analysis
active users: 660000000
We show how XiaoIce dynamically recognizeshuman feelings and states, understands user intent, and responds to user needs throughout long conversations. Since the release in 2014, XiaoIce has communicated with over 660 million active users and succeeded in establishing long-term relationships with many of them. Analysis of largescale online logs shows that XiaoIce has achieved an average CPS of 23, which is significantly higher than that of other chatbots and even human conversations
active users: 660000000
In this article we present the design and implementation of Microsoft XiaoIce (‘Little Ice’ literally in Chinese), the most popular social chatbot in the world. Since her launch in China in May 2014, XiaoIce has attracted over 660 million active users (i.e., subscribed users). XiaoIce has already been shipped in five countries (China, Japan, US, India, and Indonesia) under different names (e.g., Rinna in Japan) on more than 40 platforms, including WeChat, QQ, Weibo, and Meipai in China; Facebook Messenger in the United States and India; and LINE in Japan and Indonesia
conversation pairs: 30000000000
First is the human conversational data from the Internet—social networks, public forums, bulletin boards, news comments, and so on. After the launch of XiaoIce in May 2014, we also started collecting human– machine conversations generated by XiaoIce and her users, which amounted to more than 30 billion conversation pairs as of May 2018. Nowadays, 70% of XiaoIce’s responses are retrieved from her own past conversations
pilot studies: 2
Evaluation. We present two pilot studies that validate the effectiveness of the personabased neural response generator and the hybrid approach that combines the generationbased and retrieval-based methods, respectively, and then the A/B test of General Chat. In the first pilot study reported in Li et al (2016b), we compare the persona model against two baseline models, using a TV series data set for model training and evaluation
These skills allow XiaoIce to collaborate with human users in their creative activities, including text-based Poetry Generation,10 voice-based Song and Audio Book Generation, XiaoIce FM for Somebody, XiaoIce Kids Story Factory, and so on. The XiaoIce Poetry Generation skill has helped over four million users to generate poems. On 15 May 2018, XiaoIce published the first AI-created Chinese poem album in history.11
conversations with humans: 10000000000
In two months, XiaoIce successfully became a cross-platform social chatbot. Through August 2015, XiaoIce has had more than 10 billion conversations with humans. By that point, users have proactively posted more than 6 million conversation sessions to the public
active users: 660000000
XiaoIce has made these characters “alive” by bringing various capabilities including chatting, providing services, sharing knowledge, and creating contents. As of July 2018, XiaoIce has been deployed on more than 40 platforms, and has attracted 660 million active users. XiaoIce-generated TV and radio programs have covered
- Albrecht, Joshua and Rebecca Hwa. 2007. A re-examination of machine learning approaches for sentence-level MT evaluation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 880–887, Prague.
- Anderson, Peter, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pages 382–398.
- Anderson, Peter, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086.
- Asimov, I. 198“The Bicentennial Man” in I. Asimov, The Bicentennial Man and Other Stories. Banerjee, Satanjeev and Alon Lavie. 2005. Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72.
- Brahnam, Sheryl. 200Strategies for handling customer abuse of ECAS. Abuse: The Darker Side of Human Computer Interaction, pages 62–67.
- Cai, Yang. 200Empathic computing. In: Ambient Intelligence in Everyday Life. Springer, pages 67–85.
- Cheng, Wen-Feng, Chao-Chung Wu, Ruihua Song, Jianlong Fu, Xing Xie, and Jian-Yun Nie. 2018. Image inspired poetry generation in XiaoIce. arXiv preprint arXiv:1808.03090. Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha. Colby, Kenneth Mark, Sylvia Weber, and Franklin Dennis Hilf. 1971. Artificial paranoia. Artificial Intelligence, 2(1):1–25.
- Cuayáhuitl, Heriberto, Seonghan Ryu, Donghyeon Lee, and Jihie Kim. 201A study on dialogue reward prediction for open-ended conversational agents. NeurIPS Workshop on Conversational AI. Curry, Amanda Cercas and Verena Rieser. 2018. # MeToo Alexa: How conversational systems respond to sexual harassment. In Proceedings of the Second ACL Workshop on Ethics in Natural Language Processing, pages 7–14.
- Dinan, Emily, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia: Knowledge-powered conversational agents. CoRR, abs/1811.01241. Fang, Hao, Hao Cheng, Elizabeth Clark, Ariel Holtzman, Maarten Sap, Mari Ostendorf, Yejin Choi, and Noah A. Smith. 2017. Sounding board–University of Washington’s Alexa Prize submission. Alexa Prize Proceedings. Fang, Hao, Hao Cheng, Maarten Sap, Elizabeth Clark, Ari Holtzman, Yejin Choi, Noah A. Smith, and Mari Ostendorf. 2018. Sounding board: A user-centric and content-driven social chatbot. NAACL HLT 2018, page 96.
- Fang, Hao, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt,et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473-1482.
- Fedorenko, Denis, Nikita Smetanin, and Artem Rodichev. 2018. Avoiding echo-responses in a retrieval-based conversation system. In Conference on Artificial Intelligence and Natural Language, pages 91–97.
- Fung, Pascale, Dario Bertero, Yan Wan, Anik Dey, Ricky Ho Yin Chan, Farhad Bin Siddique, and Yang Yang, Chien-Sheng Wu, and Ruixi Lin. 2016. Towards empathetic human-robot interactions. CoRR, abs/1605.04072. Galley, Michel, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. In ACL-IJCNLP, pages 445–450.
- Gan, Chuang, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. StyleNet: Generating attractive visual captions with styles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3137–3146.
- Gao, Jianfeng, Michel Galley, and Lihong Li. 2019. Neural approaches to conversational AI. Foundations and Trends in Information Retrieval, 13(2–3):127–298.
- Gao, Jianfeng, Mu Li, Chang Ning Huang, and Andi Wu. 2005. Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics, 31(4):531–574.
- Gao, Jianfeng, Patrick Pantel, Michael Gamon, Xiaodong He, and Li Deng. 2014. Modeling interestingness with deep neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2–13.
- Ghazvininejad, Marjan, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In Proceedings of AAAI, pages 5111–5117.
- Huang, Minlie, Xiaoyan Zhu, and Jianfeng Gao. 2019. Challenges in building intelligent open-domain dialog systems. arXiv preprint arXiv:1905.05709.
- Huang, Po-Sen, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for Web search using clickthrough data. In CIKM, pages 2333–2338, ACM.
- Huang, Qiuyuan, Pei Liu, Lei Zhang, Dapeng Wu, and Jianfeng Gao. 2019. Interweaved hierarchical neural networks for image commenting. Unpublished report.
- Khatri, Chandra, Behnam Hedayatnia, Anu Venkatesh, Jeff Nunn, Yi Pan, Qing Liu, Han Song, Anna Gottardi, Sanjeev Kwatra, Sanju Pancholi, et al. 2018. Advancing the state of the art in open domain dialog systems through the Alexa Prize. arXiv preprint arXiv:1812.10757.
- Li, Jiwei, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
- Li, Jiwei, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016b. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003.
- Li, Jiwei, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016c. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202.
- Lin, Chin-Yew. 2004. Rouge: A package for automatic evaluation of summaries, Proceedings of the ACL workshop, pages 74–81.
- Liu, Chia-Wei, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of EMNLP 2016, pages 2122–2132.
- Lowe, Ryan, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic Turing test: Learning to evaluate dialogue responses. In Proceedings of ACL 2017, Volume 1: Long Papers, pages 1116–1126, Vancouver.
- Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
- Maslow, Abraham Harold. 1943. A theory of human motivation. Psychological Review, 50(4):370.
- Mathews, Alexander Patrick, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In AAAI, pages 3574–3580.
- Misu, Teruhisa, Kallirroi Georgila, Anton Leuski, and David Traum. 2012. Reinforcement learning of question-answering dialogue policies for virtual museum guides. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 84–93.
- Morris, Meredith Ringel, Annuska Zolyomi, Catherine Yao, Sina Bahram, Jeffrey P. Bigham, and Shaun K. Kane. 2016. With most of it being pictures now, I rarely use it: Understanding Twitter’s evolving accessibility to blind users. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 5506–5516.
- Mostafazadeh, Nasrin, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios Spithourakis, and Lucy Vanderwende. 2017. Image-grounded conversations: Multimodal context for natural question and response generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 462–472.
- Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation, In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
- Peng, Baolin, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In EMNLP, pages 2231–2240.
- Picard, Rosalind W. 2000. Affective Computing. MIT Press.
- Ram, Ashwin, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, et al. 2018. Conversational AI: The science behind the Alexa Prize. arXiv preprint arXiv:1801.03604. Rennie, Steven J., Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1179–1195.
- Sai, Ananya, Mithun Das Gupta, Mitesh M. Khapra, and Mukundhan Srinivasan. 2019. Response generation by context-aware prototype editing. In Proceedings of AAAI 2019, volume 33, pages 7281–7288, Honolulu, HI. Schmidt, Anna and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pages 1–10.
- Serban, Iulian Vlad, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, pages 3776–3784.
- Serban, Iulian Vlad, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pages 3295–3301.
- Shang, Lifeng, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In ACL-IJCNLP, pages 1577–1586.
- Shawar, Bayan Abu and Eric Atwell. 2007. Different measurements metrics to evaluate a chatbot system. In Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, pages 89–96.
- Shen, Yelong, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, pages 101–110.
- Shum, Heung-Yeung, Xiaodong He, and Di Li. 2018. From Eliza to XiaoIce: Challenges and opportunities with social chatbots. CoRR, abs/1801.01957. Sordoni, Alessandro, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. In NAACL-HLT, pages 196–205.
- Sutskever, Ilya, Oriol Vinyals, and Quoc Le. 2014. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112.
- Sutton, Richard S., Doina Precup, and Satinder P. Singh. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2):181–211. [An earlier version appeared as Technical Report 98-74, Department of Computer Science, University of Massachusetts, Amherst, MA 01003.] Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575.
- Vinyals, Oriol and Quoc Le 2015. A neural conversational model. In ICML Deep Learning Workshop. Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164.
- Wallace, Richard S. 2009. The anatomy of Alice. In Parsing the Turing Test. Springer, pages 181–210.
- Weizenbaum, Joseph. 1966. Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1):36–45.
- Wu, Bowen, Baoxun Wang, and Hui Xue. 2016. Ranking responses oriented to conversational relevance in chat-bots. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 652–662.
- Wu, Qiang, Christopher J. C. Burges, Krysta M. Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures. Information Retrieval, 13(3):254–270.
- Xing, Chen, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic aware neural response generation. In AAAI, volume 17, pages 3351–3357.
- Zhang, Kai, Wei Wu, Fang Wang, Ming Zhou, and Zhoujun Li. 2016. Learning distributed representations of data in community question answering for question retrieval. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 533–542.