Human centric dialog training via offline reinforcement learning

empirical methods in natural language processing, pp. 3985-4003, 2020.

Other Links: arxiv.org|academic.microsoft.com
Weibo:
We present novel techniques that enable successful offline reinforcement learning on any base language model from real human conversations

Abstract:

How can we train a dialog model to produce better conversations by learning from human feedback, without the risk of humans teaching it harmful chat behaviors? We start by hosting models online, and gather human feedback from real-time, open-ended conversations, which we then use to train and improve the models using offline reinforcement...More

Code:

Data:

0
Introduction
  • Training open-domain dialog models is inherently difficult, since for each utterance there are many acceptable responses, yet no perfect response.
  • To learn from real conversations with humans, the authors created an interactive, online platform which hosted a diverse set of neural network dialog models that users could chat with in real time.
  • The authors need to train and test models offline, to ensure safe model outputs.
  • In order to safely learn to optimize human feedback the authors pursued an offline reinforcement learning approach to training dialog models
Highlights
  • Training open-domain dialog models is inherently difficult, since for each utterance there are many acceptable responses, yet no perfect response
  • The approach we propose is based on KL-control, a branch of stochastic optimal control (SOC) (Stengel, 1986) where the Kullback-Leibler (KL) divergence from some distribution is used to regularize an reinforcement learning (RL) policy (Abdolmaleki et al, 2018; Kappen et al, 2012; Rawlik et al, 2012; Todorov, 2007)
  • 1In the appendix, we provide a study comparing Way Off-Policy (WOP) to prior work in traditional, non-dialog RL tasks, and find that it outperforms all relevant baselines including DBCQ
  • We present novel techniques that enable successful offline reinforcement learning on any base language model from real human conversations
  • RL currently remains the only option for maximizing user feedback over the course of a conversation
  • We have shown that the Way Off-Policy algorithm provides a more effective way to teach a language model specific behaviors from offline data than previously proposed RL or regularization techniques
Results
  • 6.1 Controlling bot conversation behavior

    The authors first examine whether the algorithms can successfully maximize the proposed bot rewards as intended1.
  • Syntax Warning (320190): Badly formatted number (a) Sentiment Rewards (b) User Rewards (c) Bot Repetition Rewards baseline VHRED model and a Sentiment and Infersent regularized VHRED model (as proposed by Ghandeharioun et al (2019)).
  • Figure 3a shows that the KL-control model, trained to maximize bot sentiment, achieves higher bot sentiment in experiments than both the VHRED baseline and the VHRED-EI model (with sentiment and topic regularization (Ghandeharioun et al, 2019))
  • This illustrates that for controlling bot sentiment, a reward-based approach better optimizes bot behavior than training with sentimentbased regularization.
  • Controlling bot sentiment leads to eliciting higher user sentiment in the open-domain experiments
Conclusion
  • The authors present novel techniques that enable successful offline reinforcement learning on any base language model from real human conversations
  • This allows the dialog systems practitioner to train models that learn language structure from vast, readily-available corpora, fine-tune for specific desirable behaviors post-hoc through RL rewards.
  • The authors observe that the new offline RL method successfully optimizes both generated bot rewards and elicited human responses
  • The authors show that it presents a better option than using regularization in training a specific bot behavior.
  • Compared to prior work in offline RL, the novel WOP offline RL algorithm achieves higher performance in traditional RL tasks, elicits more positive feedback in conversations with novel humans at test time, and earns overall higher human ratings
Summary
  • Introduction:

    Training open-domain dialog models is inherently difficult, since for each utterance there are many acceptable responses, yet no perfect response.
  • To learn from real conversations with humans, the authors created an interactive, online platform which hosted a diverse set of neural network dialog models that users could chat with in real time.
  • The authors need to train and test models offline, to ensure safe model outputs.
  • In order to safely learn to optimize human feedback the authors pursued an offline reinforcement learning approach to training dialog models
  • Objectives:

    The authors' goal is to improve a dialog model’s ability to engage in natural conversation with a human by learning from the implicit signals in the human’s response.
  • Results:

    6.1 Controlling bot conversation behavior

    The authors first examine whether the algorithms can successfully maximize the proposed bot rewards as intended1.
  • Syntax Warning (320190): Badly formatted number (a) Sentiment Rewards (b) User Rewards (c) Bot Repetition Rewards baseline VHRED model and a Sentiment and Infersent regularized VHRED model (as proposed by Ghandeharioun et al (2019)).
  • Figure 3a shows that the KL-control model, trained to maximize bot sentiment, achieves higher bot sentiment in experiments than both the VHRED baseline and the VHRED-EI model (with sentiment and topic regularization (Ghandeharioun et al, 2019))
  • This illustrates that for controlling bot sentiment, a reward-based approach better optimizes bot behavior than training with sentimentbased regularization.
  • Controlling bot sentiment leads to eliciting higher user sentiment in the open-domain experiments
  • Conclusion:

    The authors present novel techniques that enable successful offline reinforcement learning on any base language model from real human conversations
  • This allows the dialog systems practitioner to train models that learn language structure from vast, readily-available corpora, fine-tune for specific desirable behaviors post-hoc through RL rewards.
  • The authors observe that the new offline RL method successfully optimizes both generated bot rewards and elicited human responses
  • The authors show that it presents a better option than using regularization in training a specific bot behavior.
  • Compared to prior work in offline RL, the novel WOP offline RL algorithm achieves higher performance in traditional RL tasks, elicits more positive feedback in conversations with novel humans at test time, and earns overall higher human ratings
Tables
  • Table1: Purely reward-maximizing methods like Batch Q trivially exploit a reward for asking questions by only asking questions, and using the maximum number of tokens in every sentence. In contrast, KL-control methods output plausible language by staying close to the language prior, while eliciting positive feedback from humans
  • Table2: Interactive human evaluation of offline RL techniques (best RL model bolded). KL-control strongly outperforms other offline RL techniques. Ratings are Likert scale with 95% confidence intervals (n = 40). Votes and human reward are z-scores
  • Table3: Interactive human evaluation of WOP trained with different reward functions. Manual votes are outperformed by implicit signals. Ratings are Likert scale with 95% confidence intervals (n = 40), votes and human reward are z-scores
  • Table4: Interactive human evaluation of offline RL techniques on the VHRED-EI Model. Ratings are Likert scale with 95% confidence interval (n = 45), votes and human reward are z-scores
  • Table5: Interactive human evaluation of WOP trained with different reward functions on VHRED-EI model. Ratings are Likert scale with 95% confidence interval (n = 45), votes and human reward are z-scores
  • Table6: Reward weights used for RL model training
Download tables as Excel
Related work
  • 2.1 Dialog

    Improving dialog systems with RL has largely been restricted to task-oriented dialog systems, which have a limited number of task-specific actions (Fatemi et al, 2016; Gasicet al., 2011; Liu and Lane, 2017; Liu et al, 2018; Su et al, 2017). Some of these approaches incorporate human input through explicit, manual feedback (Shah et al, 2018) or implicit signals (e.g. the user interrupting the system or starting over) (Shi and Yu, 2018).

    RL in the open-domain dialog setting is less explored (Li et al, 2016, 2017b, 2018). Authors may choose to use a highly restricted action space; for example, using RL to choose which dialog model to

    Supervised Dialog Training

    Standard dialog corpora (e.g. Cornell Movies)

    Supervised Training

    Trained base model

    Collect human conversations and ratings

    Training with Human Feedback Via Offline-RL (Our Work)

    Reinforcement Learning Training (With Implicit Signals)

    Implicit conversational signals
Funding
  • This work has been partially supported by RTI2018-095232-B-C22 grant from the Spanish Ministry of Science
Study subjects and analysis
users: 80
These algorithms use KL-control to penalize divergence from a pre-trained prior language model, and use a new strategy to make the algorithm pessimistic, instead of optimistic, in the face of uncertainty. We test the resulting dialog model with ratings from 80 users in an open-domain setting and find it achieves significant improvements over existing deep offline RL approaches. The novel offline RL method is viable for improving any existing generative dialog model using a static dataset of human feedback

pairs: 46061
neural_chat/tree/master/BatchRL. Using the server, we collected a batch of human interaction data containing 46, 061 pairs of user input and agent response. Because humans may use inappropriate language with bots online (see (Horton, 2016)), we filtered this data to remove

response pairs: 45179
Because humans may use inappropriate language with bots online (see (Horton, 2016)), we filtered this data to remove. 1 character responses, profanities, and invalid inputs for a remaining total of 45, 179 response pairs. This filtering step is important to ensure undesirable human behavior is not learned by the

Mechanical Turk workers: 80
4.3 Evaluating offline RL models. We recruited 80 Mechanical Turk workers to provide a total of 600 7-point Likert scale ratings of the trained bots, after interacting with each for at least 6 turns. We note that using this platform to test our models “in the wild” with novel humans represents a more meaningful test of generalization than testing an RL model in the same limited (game) environment in which it was trained, since humans are not restricted in the text they can type as input to the model

Reference
  • Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. 2018. Maximum a posteriori policy optimisation. International Conference on Learning Representations.
    Google ScholarFindings
  • Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
    Findings
  • Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. 2019. Striving for simplicity in offpolicy deep reinforcement learning. arXiv preprint arXiv:1907.04543.
    Findings
  • Aditya Bhatt, Max Argus, Artemij Amiranashvili, and Thomas Brox. 2019. Crossnorm: Normalization for off-policy td reinforcement learning. arXiv preprint arXiv:1902.05605.
    Findings
  • Graham D Bodie, Kellie St. Cyr, Michelle Pence, Michael Rold, and James Honeycutt. 2012. Listening competence in initial interactions i: Distinguishing between what listening is and what listeners do. International Journal of Listening, 26(1):1–28.
    Google ScholarLocate open access versionFindings
  • Graham D Bodie, Andrea J Vickery, Kaitlin Cannava, and Susanne M Jones. 2015. The role of “active listening” in informal helping conversations: Impact on perceptions of listener helpfulness, sensitivity, and supportiveness and discloser emotional improvement. Western Journal of Communication, 79(2):151–173.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym.
    Google ScholarFindings
  • Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 201Universal sentence encoder. arXiv preprint arXiv:1803.11175.
    Findings
  • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680.
    Google ScholarLocate open access versionFindings
  • Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, pages 76–87. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thomas Degris, Martha White, and Richard S Sutton. 2012. Off-policy actor-critic. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, pages 1446–1455.
    Google ScholarLocate open access versionFindings
  • Mehdi Fatemi, Layla El Asri, Hannes Schulz, Jing He, and Kaheer Suleman. 2016. Policy networks with two-stage training for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 101– 110.
    Google ScholarLocate open access versionFindings
  • Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In 2017 Conference on Empirical Methods in Natural Language Processing Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Scott Fujimoto, David Meger, and Doina Precup. 2018. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900.
    Findings
  • Yarin Gal and Zoubin Ghahramani. 20Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059.
    Google ScholarLocate open access versionFindings
  • Milica Gasic, Filip Jurcıcek, Blaise Thomson, Kai Yu, and Steve Young. 2011. On-line policy optimisation of spoken dialogue systems via live interaction with human subjects. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pages 312–3IEEE.
    Google ScholarLocate open access versionFindings
  • Carles Gelada and Marc G Bellemare. 2019. Offpolicy deep reinforcement learning by bootstrapping the covariate shift. arXiv preprint arXiv:1901.09455.
    Findings
  • Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, and Rosalind Picard. 20Approximating interactive human evaluation with self-play for open-domain dialog systems. In Advances in Neural Information Processing Systems, pages 13658– 13669.
    Google ScholarLocate open access versionFindings
  • Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. 20Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. arXiv preprint arXiv:2007.11091.
    Findings
  • Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41–58. Brill.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1856–1865.
    Google ScholarLocate open access versionFindings
  • Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. 2019. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415.
    Findings
  • Jennifer Hay. 2000. Functions of humor in the conversations of men and women. Journal of pragmatics, 32(6):709–742.
    Google ScholarLocate open access versionFindings
  • James Henderson, Oliver Lemon, and Kallirroi Georgila. 2008. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computational Linguistics, 34(4):487–511.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
    Findings
  • Helena Horton. 2016. Microsoft deletes ’teen girl’ ai after it became a hitler-loving sex robot within 24 hours. In Telegraph UK.
    Google ScholarLocate open access versionFindings
  • Molly E Ireland, Richard B Slatcher, Paul W Eastwick, Lauren E Scissors, Eli J Finkel, and James W Pennebaker. 2011. Language style matching predicts relationship initiation and stability. Psychological science, 22(1):39–44.
    Google ScholarLocate open access versionFindings
  • Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Miguel Hernandez-Lobato, Richard E Turner, and Douglas Eck. 2017. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1645–1654. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Nan Jiang and Lihong Li. 2016. Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661.
    Google ScholarLocate open access versionFindings
  • Sham M Kakade. 2002. A natural policy gradient. In Advances in neural information processing systems (NIPS), volume 14, pages 1531–1538.
    Google ScholarLocate open access versionFindings
  • Hilbert J Kappen, Vicenc Gomez, and Manfred Opper. 2012. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. 2019. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949.
    Findings
  • Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. 2017a. Dialogue learning with human-in-the-loop. International Conference on Learning Representations.
    Google ScholarFindings
  • Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192– 1202.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean, Alan Ritter, and Dan Jurafsky. 2017b. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169.
    Google ScholarLocate open access versionFindings
  • Ziming Li, Julia Kiseleva, and Maarten de Rijke. 2018. Dialogue generation: From imitation learning to inverse reinforcement learning. arXiv preprint arXiv:1812.03509.
    Findings
  • Bing Liu and Ian Lane. 2017. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 482–489. IEEE.
    Google ScholarLocate open access versionFindings
  • Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, Pararth Shah, and Larry Heck. 2018. Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2060–2069.
    Google ScholarLocate open access versionFindings
  • Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. 2019. Off-policy policy gradient with state distribution correction. ICML 2019 Workshop RL4RealLife.
    Google ScholarLocate open access versionFindings
  • Shikib Mehri and Maxine Eskenazi. 2020. Unsupervised evaluation of interactive dialog with dialogpt. Proceedings of the SIGdial 2020 Conference.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop.
    Google ScholarFindings
  • Remi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062.
    Google ScholarLocate open access versionFindings
  • Yookoon Park, Jaemin Cho, and Gunhee Kim. 2018. A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1792–1801.
    Google ScholarLocate open access versionFindings
  • Jan Peters, Katharina Mulling, and Yasemin Altun. 2010. Relative entropy policy search. In AAAI, pages 1607–1612. Atlanta.
    Google ScholarFindings
  • Doina Precup. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80.
    Google ScholarLocate open access versionFindings
  • Robert R Provine. 1996. Laughter. American scientist, 84(1):38–48.
    Google ScholarLocate open access versionFindings
  • Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. 2012. On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: science and systems.
    Google ScholarFindings
  • Martin Riedmiller. 2005. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328. Springer.
    Google ScholarLocate open access versionFindings
  • Abdelrhman Saleh, Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, and Rosalind Picard. 2019. Hierarchical reinforcement learning for opendomain dialog. The Thirty-Fourth AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 1889–1897.
    Google ScholarLocate open access versionFindings
  • Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019).
    Google ScholarFindings
  • Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al. 2017a. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349.
    Findings
  • Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017b. A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Pararth Shah, Dilek Hakkani-Tur, Bing Liu, and Gokhan Tur. 2018. Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 41–51.
    Google ScholarLocate open access versionFindings
  • Weiyan Shi and Zhou Yu. 2018. Sentiment adaptive end-to-end dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1509–1519.
    Google ScholarLocate open access versionFindings
  • Jamin Shin, Peng Xu, Andrea Madotto, and Pascale Fung. 2019. Happybot: Generating empathetic dialogue responses by improving user experience lookahead. arXiv preprint arXiv:1906.08487.
    Findings
  • Candace L Sidner, Cory D Kidd, Christopher Lee, and Neal Lesh. 2004. Where to look: a study of humanrobot engagement. In Proceedings of the 9th international conference on Intelligent user interfaces, pages 78–84. ACM.
    Google ScholarLocate open access versionFindings
  • Robert F Stengel. 1986. Stochastic optimal control. John Wiley and Sons New York, New York.
    Google ScholarFindings
  • Pei-Hao Su, Paweł Budzianowski, Stefan Ultes, Milica Gasic, and Steve Young. 2017. Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 147–157.
    Google ScholarLocate open access versionFindings
  • Philip Thomas and Emma Brunskill. 2016. Dataefficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov. 2007. Linearly-solvable markov decision problems. In Advances in neural information processing systems (NIPS), pages 1369–1376.
    Google ScholarLocate open access versionFindings
  • Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Harry Weger Jr, Gina R Castle, and Melissa C Emmett. 2010. Active listening in peer interviews: The influence of message paraphrasing on perceptions of listening skill. The Intl. Journal of Listening, 24(1):34– 49.
    Google ScholarLocate open access versionFindings
  • Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2018. The design and implementation of xiaoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989.
    Findings
  • Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
    Findings
  • The underlying architecture of the baseline language models employed for this work is a Variational Hierarchical Recurrent Encoder Decoder (VHRED) (Serban et al., 2017b). We also conduct a second set of experiments on an enhanced version of this model with additional knowledge distillation to improve the model’s ability to track the sentiment and semantics of the conversation, as proposed by Ghandeharioun et al. (2019). The language models were originally trained on two datasets: movie dialogs (Danescu-NiculescuMizil and Lee, 2011) and a dataset scraped from reddit.com/r/casual_conversation (Ghandeharioun et al., 2019).
    Google ScholarLocate open access versionFindings
  • We also added layers to the Context RNN and regularized it to be able to predict the semantic content of the input utterance using a form of knowledge distillation (Hinton et al., 2015) from a stateof-the-art sentence-embedding model (Conneau et al., 2017). There were 2 additional feedforward semantic prediction prediction layers of size 128, which used ReLu activation. The VHRED model with sentiment and infersent regularization has 95.4 million parameters.
    Google ScholarLocate open access versionFindings
  • The RL models, the main focus of our work, were trained using human conversation data collected via the online interactive platform (described in Section F) and batch size was fixed at 32. Each model was trained for 2000 epochs. The RL models were initialized with the weights of the best model trained on the Reddit dataset. Early stopping was used to determine the number of training iterations of the best checkpoint. For each bot, 3 different stopping epochs were tested and the best was selected. The checkpoint was selected using manual tuning based on interactive chat with the chatbots. For the best performing bots, KL-Control Q and KL-Control Ψ, the 1600 and 1800 epoch checkpoints were selected respectively.
    Google ScholarFindings
  • Each RL model was trained on a NVIDIA GeForce GTX 1080 GPU. Training models for 2000 epochs took approximately 30 minutes for each model. The runtime for training the VHRED baseline models is around 6 hours. The speediness of training the RL models illustrates the scalability of RL training in improving dialog models for specific features.
    Google ScholarFindings
  • We also conducted experiments using each offline RL algorithm with a Sentiment and Infersent regularized VHRED Model. As described in Section A.1, by adding about 20 million extra parameters to the VHRED model in order to better achieve semantic coherence and sentiment contingency, the VHRED-EI (Emotion and Infersent regularized) model is a better performing baseline in terms of human ratings (Ghandeharioun et al., 2019).
    Google ScholarFindings
  • To demonstrate the effectiveness of these techniques, we tested them on traditional RL tasks using the OpenAI gym (Brockman et al., 2016), focusing on the CartPole-v0 and Acrobot-v1 experiments. We first train an online Q-learning Behavior policy, and store all (s, a, r, s ) experience samples into a replay buffer. We use this buffer to train a prior model of p(a|s) using a Variational Auto-encoder. The VAE was trained to reconstruct the next state given the current state, p(s |s), using a mean-squared error loss. The next action was predicted from the latent embedding z, meaning the model learned three functions: z = fe(s), s = fd(z), and a = fa(z). For Cartpole, both the encoder and decoder were made up of two linear layers with 750 neurons each. The latent dimension of the VAE was size 256. For Acrobot, the encoder and decoder had only one layer of size 256 each, and the latent dimension was 64.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments