Deploying Lifelong Open-Domain Dialogue Learning

Cited by: 0|Bibtex|Views66
Other Links: arxiv.org
Weibo:
Detailed experiments showed that the one can collect high quality data that improves both automatic offline metrics and user engagement metrics when used for training models

Abstract:

Much of NLP research has focused on crowdsourced static datasets and the supervised learning paradigm of training once and then evaluating test performance. As argued in de Vries et al. (2020), crowdsourced data has the issues of lack of naturalness and relevance to real-world use cases, while the static dataset paradigm does not allow ...More

Code:

Data:

0
Introduction
  • Humans learn to use language over the course of their lives from the interactions they have with the world and other people.
  • The prevailing dominant paradigm in natural language processing (NLP) research is to build a fixed dataset from which to train a model and freeze it, without any ability for the model to interact with humans using language at training time at all
  • While the authors need such interaction in order to study human-machine communication to its full extent, constraints usually inhibit such research.
  • As crowdworkers are motivated by pay, not by interest in the actual tasks themselves, the data distribution may not match the desired one (de Vries et al, 2020)
Highlights
  • Humans learn to use language over the course of their lives from the interactions they have with the world and other people
  • The prevailing dominant paradigm in natural language processing (NLP) research is to build a fixed dataset from which to train a model and freeze it, without any ability for the model to interact with humans using language at training time at all
  • Detailed experiments showed that the one can collect high quality data that improves both automatic offline metrics and user engagement metrics when used for training models
  • We find this exciting because this approach shows it is possible to build continually improving models that learn from interacting with humans in the wild, which represents a paradigm shift away from the limited static dataset setup that is prevalent in much of the work of the community
Methods
  • 5.1 Rounds of Learning

    The authors performed three rounds of the lifelong learning setup.

    Round 1 consists of models trained on LIGHT MTurk data only.
  • Round 2 consists of models trained on LIGHT MTurk data + 50,982 WILD examples collected from the deployment of the Round 1 models, and again deploy these within the game.
  • Round 3 consists of models trained on LIGHT MTurk data + 50,982 examples from Round 1 deployment + an additional 180,010 examples collected from Round 2 deployment.
  • While the setup is a lifelong learning setup and the models are still currently deployed and collecting data, for this paper the authors froze the collection at a given point in order to provide a data release and provide experimental results.
  • Validation and test sets were extracted from a portion of the data from Round 2
Results
  • Experiments show that is not the case, and that even the lowest quality data does provide a useful signal, e.g. performance drops slightly from 87.06% to 86.69% on the WILD validation set if the authors remove bins lower than 6, but otherwise training on all other data, and to 85.38% if the authors remove bins lower than 9.
Conclusion
  • Conclusion and Future Work

    The authors have presented a fully realized system for improving upon an open-domain dialogue task by utilizing a deployed game with a purpose, for lifelong learning.
  • Detailed experiments showed that the one can collect high quality data that improves both automatic offline metrics and user engagement metrics when used for training models
  • The authors find this exciting because this approach shows it is possible to build continually improving models that learn from interacting with humans in the wild, which represents a paradigm shift away from the limited static dataset setup that is prevalent in much of the work of the community
Summary
  • Introduction:

    Humans learn to use language over the course of their lives from the interactions they have with the world and other people.
  • The prevailing dominant paradigm in natural language processing (NLP) research is to build a fixed dataset from which to train a model and freeze it, without any ability for the model to interact with humans using language at training time at all
  • While the authors need such interaction in order to study human-machine communication to its full extent, constraints usually inhibit such research.
  • As crowdworkers are motivated by pay, not by interest in the actual tasks themselves, the data distribution may not match the desired one (de Vries et al, 2020)
  • Methods:

    5.1 Rounds of Learning

    The authors performed three rounds of the lifelong learning setup.

    Round 1 consists of models trained on LIGHT MTurk data only.
  • Round 2 consists of models trained on LIGHT MTurk data + 50,982 WILD examples collected from the deployment of the Round 1 models, and again deploy these within the game.
  • Round 3 consists of models trained on LIGHT MTurk data + 50,982 examples from Round 1 deployment + an additional 180,010 examples collected from Round 2 deployment.
  • While the setup is a lifelong learning setup and the models are still currently deployed and collecting data, for this paper the authors froze the collection at a given point in order to provide a data release and provide experimental results.
  • Validation and test sets were extracted from a portion of the data from Round 2
  • Results:

    Experiments show that is not the case, and that even the lowest quality data does provide a useful signal, e.g. performance drops slightly from 87.06% to 86.69% on the WILD validation set if the authors remove bins lower than 6, but otherwise training on all other data, and to 85.38% if the authors remove bins lower than 9.
  • Conclusion:

    Conclusion and Future Work

    The authors have presented a fully realized system for improving upon an open-domain dialogue task by utilizing a deployed game with a purpose, for lifelong learning.
  • Detailed experiments showed that the one can collect high quality data that improves both automatic offline metrics and user engagement metrics when used for training models
  • The authors find this exciting because this approach shows it is possible to build continually improving models that learn from interacting with humans in the wild, which represents a paradigm shift away from the limited static dataset setup that is prevalent in much of the work of the community
Tables
  • Table1: Data Statistics of our lifelong learning deployment at the point where we froze collection for experiments reported within the paper and subsequent data release
  • Table2: Comparison of statistics of the open-domain dialogue data collected during our lifelong learning deployment (bottom row) compared to several existing crowdsourced datasets. Our data is around twice as large in terms of human utterances than these datasets, and 4x as large in terms of dialogue utterances (as our data consists of human-model conversations), while the cost to collect our data was only 1/5th of the price per utterance of LIGHT MTurk, see Sec. 5.3.3
  • Table3: Three rounds of training in our lifelong open-domain dialogue learning setup. Both retrieval and generative models trained on the data from the three rounds improve across both metrics on all three test sets
  • Table4: Deployment-based Evaluation, comparing several metrics on data collected during Round 2 of collection
  • Table5: Deployment-based Evaluation: changes in continue rates for various model variants
  • Table6: Percentage of utterances flagged with an issue alongside overall satisfaction, by model
Download tables as Excel
Related work
  • Open-Domain Dialogue Dialogue in the opendomain setting, wherein the conversation involves chat about any topic, rather than a specific goal-directed topic, is commonly studied in the train/valid/test static dataset paradigm utilizing supervised learning. A number of crowdsourced or scraped datasets have been developed to that end, including Daily Dialogue (Li et al, 2017), PersonaChat (Li et al, 2016a), Empathetic Dialogues (Rashkin et al, 2019) and Wizard of Wikipedia (Dinan et al, 2019c).

    LIGHT In this work we specifically focus on the open-domain dialogue setting of LIGHT (Urbanek et al, 2019). LIGHT focuses on situated characters playing character roles that can talk about any topic, within the context of a medieval fantasy world. This setting is known to be engaging for human role-players, and also alleviates some safety concerns in that the role-playing means they should not divulge personally identifying information. The authors crowdsourced a dialogue dataset consisting of 8.5k episodes and 111k utterances, which they publicly released. We refer to this as LIGHT MTurk data, or LIGHT data for short, in the rest of this paper. In this work we utilize this data to build a deployed system whereby players can converse with models, and we can study lifelong learning with these models using the information in these new conversations.
Funding
  • Experiments show that is not the case, and that even the lowest quality data does provide a useful signal, e.g. performance drops slightly from 87.06% to 86.69% on the WILD validation set if we remove bins lower than 6, but otherwise training on all other data, and to 85.38% if we remove bins lower than 9
Study subjects and analysis
crowdworkers: 810
The number of unique locations and roles that can be played by speakers (characters) is large (587 and 630, respectively). The number of players of the game at the time of freezing was over 13,000, which also makes the diversity far larger than typical crowdsourced datasets, e.g. LIGHT MTurk involved 1,052 and Empathetic Dialog involved 810 crowdworkers. Finally, the number of unique tokens is larger in LIGHT WILD, indicating the diversity of language used

user controls: 3
While a full of study of the elements of game design is outside of the scope of this paper, we note that for adjustments we did make to the game after initial deployment we observed large changes in user behavior. For example, after the addition of the three user controls for how to continue the game loop after an episode is finished (as described in Sec. 3), compared to only a single choice, we saw an increase in the continue rate by 3.3 ± 1.6% when using the same model. Model quality also affects cost and quality of the data collected

Reference
  • Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka, and Tom M Mitchell. 2010. Toward an architecture for never-ending language learning. In Twenty-Fourth AAAI conference on artificial intelligence.
    Google ScholarLocate open access versionFindings
  • Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV), pages 233–248.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, and Jason Weston. 2019a. Queens are powerful too: Mitigating gender bias in dialogue generation. arXiv preprint arXiv:1911.03842.
    Findings
  • Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019b. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4537–4546, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019c. Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Torben Grodal et al. 2000. Video games and the pleasures of control. Media entertainment: The psychology of its appeal, pages 197–213.
    Google ScholarFindings
  • Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. 2019. Learning from dialogue after deployment: Feed yourself, chatbot! In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3667–3684, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Matthew Horsfall and Andreas Oikonomou. 2011. A study of how different game play aspects can affect the popularity of role-playing video games. In 2011 16th International Conference on Computer Games (CGAMES), pages 63–6IEEE.
    Google ScholarLocate open access versionFindings
  • Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. In Proceedings of the International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Esther Levin, Roberto Pieraccini, and Wieland Eckert. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing, 8(1):11–23.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. 2016a. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155.
    Findings
  • Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. 2016b. Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823.
    Findings
  • Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. 2016c. Learning through dialogue interactions by asking questions. arXiv preprint arXiv:1612.04936.
    Findings
  • Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017).
    Google ScholarLocate open access versionFindings
  • Bing Liu and Ian Lane. 2017. An end-to-end trainable neural network model with belief tracking for taskoriented dialog. arXiv preprint arXiv:1708.05956.
    Findings
  • Sahisnu Mazumder, Bing Liu, Shuai Wang, and Nianzu Ma. 2019. Lifelong and interactive learning of factual knowledge in dialogues. arXiv preprint arXiv:1907.13295.
    Findings
  • Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A dialog research software platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 79–84. ACL.
    Google ScholarLocate open access versionFindings
  • Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bo Yang, Justin Betteridge, Andrew Carlson, B Dalvi, Matt Gardner, Bryan Kisiel, et al. 2018. Never-ending learning. Communications of the ACM, 61(5):103–115.
    Google ScholarLocate open access versionFindings
  • Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2019. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599.
    Findings
  • Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, et al. 2018. Conversational ai: The science behind the alexa prize. arXiv preprint arXiv:1801.03604.
    Findings
  • Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic opendomain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Verena Rieser and Oliver Lemon. 2011. Reinforcement learning for adaptive dialogue systems: a datadriven methodology for dialogue management and natural language generation. Springer Science & Business Media.
    Google ScholarFindings
  • Mark Bishop Ring. 1994. Continual learning in reinforcement environments. Ph.D. thesis, University of Texas at Austin Austin, Texas 78712.
    Google ScholarFindings
  • Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637.
    Findings
  • Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktäschel, Douwe Kiela, Arthur Szlam, and Jason Weston. 2019. Learning to speak and act in a fantasy text adventure game. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 673–683, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Luis Von Ahn. 2006. Games with a purpose. Computer, 39(6):92–94.
    Google ScholarLocate open access versionFindings
  • Harm de Vries, Dzmitry Bahdanau, and Christopher Manning. 2020. Towards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435.
    Findings
  • Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural text generation with unlikelihood training. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H Miller, Arthur Szlam, Douwe Kiela, and Jason Weston. 2017. Mastering the dungeon: Grounded language learning by mechanical turker descent. In Proceedings of International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2204–2213. ACL.
    Google ScholarLocate open access versionFindings
  • Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. 2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowledge Engineering Review, 21(2):97–126.
    Google ScholarLocate open access versionFindings
  • Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1702–1723. ACL.
    Google ScholarLocate open access versionFindings
  • Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al. 2017. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349.
    Findings
  • Daniel L Silver, Qiang Yang, and Lianghao Li. 2013. Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI spring symposium series.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments