AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
In the long run it would be desirable to make learning a task from human preferences no more difficult than learning it from a programmatic reward signal, ensuring that powerful reinforcement learning systems can be applied in the service of complex human values rather than low-c...

Deep Reinforcement Learning from Human Preferences.

neural information processing systems, (2017): 4302-4310

被引用467|浏览212
EI
下载 PDF 全文
引用
微博一下

摘要

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. Our approach separates learning the goal from learning the b...更多

代码

数据

0
简介
  • Recent success in scaling reinforcement learning (RL) to large problems has been driven in domains that have a well-specified reward function (Mnih et al, 2015, 2016; Silver et al, 2016).
  • The authors could try to design a simple reward function that approximately captures the intended behavior, but this will often result in behavior that optimizes the reward function without satisfying the preferences
  • This difficulty underlies recent concerns about misalignment between the values and the objectives of the RL systems (Bostrom, 2014; Russell, 2016; Amodei et al, 2016).
  • If the authors could successfully communicate the actual objectives to the agents, it would be a significant step towards addressing these concerns
重点内容
  • Recent success in scaling reinforcement learning (RL) to large problems has been driven in domains that have a well-specified reward function (Mnih et al, 2015, 2016; Silver et al, 2016)
  • The agent learns about the goal of the task only by asking a human which of two trajectory segments is better
  • Feedback is provided by contractors who are given a 1-2 sentence description of each task before being asked to compare several hundred to several thousand pairs of trajectory segments for that task
  • There is a large literature on preference elicitation and reinforcement learning from unknown reward functions, we provide the first evidence that these techniques can be economically scaled up to state-of-the-art reinforcement learning systems
  • Future work may be able to improve the efficiency of learning from human preferences, and expand the range of tasks to which it can be applied
  • In the long run it would be desirable to make learning a task from human preferences no more difficult than learning it from a programmatic reward signal, ensuring that powerful reinforcement learning systems can be applied in the service of complex human values rather than low-complexity goals
方法
  • At each point in time the method maintains a policy π : O → A and a reward function estimate r : O × A → R, each parametrized by deep neural networks.

    These networks are updated by three processes: 1.
  • At each point in time the method maintains a policy π : O → A and a reward function estimate r : O × A → R, each parametrized by deep neural networks.
  • These networks are updated by three processes: 1.
  • The parameters of the mapping rare optimized via supervised learning to fit the comparisons collected from the human so far
结果
  • The authors implemented the algorithm in TensorFlow (Abadi et al, 2016). The authors interface with MuJoCo (Todorov et al, 2012) and the Arcade Learning Environment (Bellemare et al, 2013) through the OpenAI Gym (Brockman et al, 2016).

    3.1 Reinforcement Learning Tasks with Unobserved Rewards

    In the first set of experiments, the authors attempt to solve a range of benchmark tasks for deep RL without observing the true reward.
  • The agent learns about the goal of the task only by asking a human which of two trajectory segments is better.
  • Feedback is provided by contractors who are given a 1-2 sentence description of each task before being asked to compare several hundred to several thousand pairs of trajectory segments for that task.
  • Contractors responded to the average query in 3-5 seconds, and so the experiments involving real human feedback required between 30 minutes and 5 hours of human time
结论
  • Discussion and Conclusions

    Agent-environment interactions are often radically cheaper than human interaction.
  • The authors show that by learning a separate reward model using supervised learning, it is possible to reduce the interaction complexity by roughly 3 orders of magnitude
  • Does this show that the authors can meaningfully train deep RL agents from human preferences, and that the authors are already hitting diminishing returns.
  • There is a large literature on preference elicitation and reinforcement learning from unknown reward functions, the authors provide the first evidence that these techniques can be economically scaled up to state-of-the-art reinforcement learning systems
  • This represents a step towards practical applications of deep RL to complex real-world tasks.
  • In the long run it would be desirable to make learning a task from human preferences no more difficult than learning it from a programmatic reward signal, ensuring that powerful RL systems can be applied in the service of complex human values rather than low-complexity goals
相关工作
  • A long line of work studies reinforcement learning from human ratings or rankings, including Akrour et al (2011), Pilarski et al (2011), Akrour et al (2012), Wilson et al (2012), Sugiyama et al (2012), Wirth and Fürnkranz (2013), Daniel et al (2015), El Asri et al (2016), Wang et al (2016), and Wirth et al (2016). Other lines of research considers the general problem of reinforcement learning from preferences rather than absolute reward values (Fürnkranz et al, 2012; Akrour et al, 2014), and optimizing using human preferences in settings other than reinforcement learning (Machwe and Parmee, 2006; Secretan et al, 2008; Brochu et al, 2010; Sørensen et al, 2016).

    Our algorithm follows the same basic approach as Akrour et al (2012) and Akrour et al (2014). They consider continuous domains with four degrees of freedom and small discrete domains, where they can assume that the reward is linear in the expectations of hand-coded features. We instead consider physics tasks with dozens of degrees of freedom and Atari tasks with no hand-engineered features; the complexity of our environments force us to use different RL algorithms and reward models, and to cope with different algorithmic tradeoffs. One notable difference is that Akrour et al (2012) and Akrour et al (2014) elicit preferences over whole trajectories rather than short clips. So although we gather about two orders of magnitude more comparisons, our experiments require less than one order of magnitude more human time. Other differences focus on changing our training procedure to cope with the nonlinear reward models and modern deep RL, for example using asynchronous training and ensembling.
基金
  • Finally, we thank OpenAI and DeepMind for providing a supportive research environment and for supporting and encouraging this collaboration
引用论文
  • Martin Abadi et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
    Findings
  • Riad Akrour, Marc Schoenauer, and Michele Sebag. Preference-based policy learning. Machine learning and knowledge discovery in databases, pages 12–27, 2011.
    Google ScholarLocate open access versionFindings
  • Riad Akrour, Marc Schoenauer, and Michèle Sebag. April: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116–131, 2012.
    Google ScholarLocate open access versionFindings
  • Riad Akrour, Marc Schoenauer, Michèle Sebag, and Jean-Christophe Souplet. Programming by feedback. In International Conference on Machine Learning, pages 1503–1511, 2014.
    Google ScholarLocate open access versionFindings
  • Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
    Findings
  • Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.
    Google ScholarFindings
  • Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
    Google ScholarLocate open access versionFindings
  • Eric Brochu, Tyson Brochu, and Nando de Freitas. A bayesian interactive optimization approach to procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 103–112. Eurographics Association, 2010.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
    Findings
  • Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer, and Jan Peters. Active reward learning. In Robotics: Science and Systems, 2014.
    Google ScholarLocate open access versionFindings
  • Christian Daniel, Oliver Kroemer, Malte Viering, Jan Metz, and Jan Peters. Active reward learning with a novel acquisition function. Autonomous Robots, 39(3):389–405, 2015.
    Google ScholarLocate open access versionFindings
  • Layla El Asri, Bilal Piot, Matthieu Geist, Romain Laroche, and Olivier Pietquin. Score-based inverse reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 457–465, 2016.
    Google ScholarLocate open access versionFindings
  • Arpad Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, volume 48, 2016.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: A formal framework and a policy iteration algorithm. Machine learning, 89(1-2):123–156, 2012.
    Google ScholarLocate open access versionFindings
  • Dylan Hadfield-Menell, Stuart Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 3909–3917, 2016.
    Google ScholarLocate open access versionFindings
  • Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z Leibo, and Audrunas Gruslys. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017.
    Findings
  • Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
    Google ScholarLocate open access versionFindings
  • W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In International Conference on Knowledge Capture, pages 9–16, 2009.
    Google ScholarLocate open access versionFindings
  • W. Bradley Knox and Peter Stone. Learning non-myopically from human-generated reward. In Jihie Kim, Jeffrey Nichols, and Pedro A. Szekely, editors, IUI, pages 191–202. ACM, 2013. ISBN 978-1-4503-1965-2. URL http://doi.acm.org/10.1145/2449396.
    Findings
  • William Bradley Knox. Learning from human-generated reward. PhD thesis, University of Texas at Austin, 2012.
    Google ScholarFindings
  • David Krueger, Jan Leike, Owain Evans, and John Salvatier. Active reinforcement learning: Observing rewards at a cost. In Future of Interactive Learning Machines, NIPS Workshop, 2016.
    Google ScholarLocate open access versionFindings
  • R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.
    Google ScholarFindings
  • James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, David Roberts, Matthew E Taylor, and Michael L Littman. Interactive learning from policy-dependent human feedback. arXiv preprint arXiv:1701.06049, 2017.
    Findings
  • AT Machwe and IC Parmee. Introducing machine learning within an interactive evolutionary design environment. In DS 36: Proceedings DESIGN 2006, the 9th International Design Conference, Dubrovnik, Croatia, 2006.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Andrew Y Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference on Machine learning, pages 663–670, 2000.
    Google ScholarLocate open access versionFindings
  • Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi, Jason P Carey, and Richard Sutton. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In International Conference on Rehabilitation Robotics, pages 1–7, 2011.
    Google ScholarLocate open access versionFindings
  • Stuart Russell. Should we fear supersmart robots? Scientific American, 314(6):58, 2016.
    Google ScholarLocate open access versionFindings
  • John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • Jimmy Secretan, Nicholas Beato, David B D Ambrosio, Adelein Rodriguez, Adam Campbell, and Kenneth O Stanley. Picbreeder: Evolving pictures collaboratively online. In Conference on Human Factors in Computing Systems, pages 1759–1768, 2008.
    Google ScholarLocate open access versionFindings
  • Roger N Shepard. Stimulus and response generalization: A stochastic model relating generalization to distance in psychological space. Psychometrika, 22(4):325–345, 1957.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
    Google ScholarLocate open access versionFindings
  • Patrikk D Sørensen, Jeppeh M Olsen, and Sebastian Risi. Breeding a diversity of super mario behaviors through interactive evolution. In Computational Intelligence and Games (CIG), 2016 IEEE Conference on, pages 1–7. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Hiroaki Sugiyama, Toyomi Meguro, and Yasuhiro Minami. Preference-learning based inverse reinforcement learning for dialog control. In INTERSPEECH, pages 222–225, 2012.
    Google ScholarLocate open access versionFindings
  • Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
    Google ScholarLocate open access versionFindings
  • Sida I Wang, Percy Liang, and Christopher D Manning. Learning language games through interaction. arXiv preprint arXiv:1606.02447, 2016.
    Findings
  • Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, pages 1133–1141, 2012.
    Google ScholarLocate open access versionFindings
  • Christian Wirth and Johannes Fürnkranz. Preference-based reinforcement learning: A preliminary survey. In ECML/PKDD Workshop on Reinforcement Learning from Generalized Feedback: Beyond Numeric Rewards, 2013.
    Google ScholarFindings
  • Christian Wirth, J Fürnkranz, Gerhard Neumann, et al. Model-free preference-based reinforcement learning. In AAAI, pages 2222–2228, 2016.
    Google ScholarLocate open access versionFindings
  • For the simulated robotics tasks, we optimize policies using trust region policy optimization (TRPO, Schulman et al., 2015) with discount rate γ = 0.995 and λ = 0.97. The reward predictor is a twolayer neural network with 64 hidden units each, using leaky ReLUs (α = 0.01) as nonlinearities.7 We compare trajectory segments that last 1.5 seconds, which varies from 15 to 60 timesteps depending on the task.
    Google ScholarFindings
  • Our Atari agents are trained using the standard set of environment wrappers used by Mnih et al. (2015): 0 to 30 no-ops in the beginning of an episode, max-pooling over adjacent frames, stacking of 4 frames, a frameskip of 4, life loss ending an episode (but not resetting the environment), and rewards clipped to [−1, 1].
    Google ScholarLocate open access versionFindings
  • For the Atari tasks we optimize policies using the A3C algorithm (Mnih et al., 2016) in synchronous form (A2C), with policy architecture as described in Mnih et al. (2015). We use standard settings for the hyperparameters: an entropy bonus of β = 0.01, learning rate of 0.0007 decayed linearly to reach zero after 80 million timesteps (although runs were actually trained for only 50 million timesteps), n = 5 steps per update, N = 16 parallel workers, discount rate γ = 0.99, and policy gradient using Adam with α = 0.99 and = 10−5.
    Google ScholarLocate open access versionFindings
  • 8e.g. http://www.free80sarcade.com/2600_Beamrider.php
    Findings
作者
Miljan Martic
Miljan Martic
Dario Amodei
Dario Amodei
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科