Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning

robotics science and systems, 2020.

Cited by: 0|Bibtex|Views74|DOI:https://doi.org/10.15607/RSS.2020.XVI.053
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We detail the successes and challenges encountered in our experiments, and hope our work could offer an important foundation toward the goal of unsupervised, continual reinforcement learning of robots in the real world for many days with zero human intervention

Abstract:

Reinforcement learning provides a general framework for learning robotic skills while minimizing engineering effort. However, most reinforcement learning algorithms assume that a well-designed reward function is provided, and learn a single behavior for that single reward function. Such reward functions can be difficult to design in pra...More

Code:

Data:

0
Introduction
  • Reinforcement learning (RL) has the potential of enabling autonomous agents to exhibit intricate behaviors and solve complex tasks from high-dimensional sensory input without hand-engineered policies or features [54, 48, 37, 35, 16].
  • One of the reasons for this is that the assumptions that are required in a standard RL formulation are not fully compatible with the requirements of real-world robotics systems
  • One of these assumptions is the existence of a ground truth reward signal, provided as part of the task.
  • The ability to use trajectories sampled from arbitrary policies enables these algorithms to be sample efficient
Highlights
  • Reinforcement learning (RL) has the potential of enabling autonomous agents to exhibit intricate behaviors and solve complex tasks from high-dimensional sensory input without hand-engineered policies or features [54, 48, 37, 35, 16]
  • We work in a Markov decision process (MDP) M = (S, A, p, r), where S denotes the state space of the agent, A denotes the action space of the agent, p : S ×S ×A → [0, ∞) denotes the underlying dynamics of the agent-environment which can be sampled starting from the initial state distribution p0 : S → [0, ∞), and a reward function r : S × A → [0, ∞)
  • We evaluate against the off-Dynamics-Aware Discovery of Skills variant where the skill-dynamics is trained on on-policy samples from the current policy
  • We demonstrate the off-Dynamics-Aware Discovery of Skills can be deployed for real world reward-free reinforcement learning
  • The improved sample-efficiency from offpolicy learning enabled the algorithm to be applied on a real hardware, a quadruped with 12 DoFs, to learn various locomotion gaits under 20 hours without human-design reward functions or hard-coded primitives
  • We detail the successes and challenges encountered in our experiments, and hope our work could offer an important foundation toward the goal of unsupervised, continual reinforcement learning of robots in the real world for many days with zero human intervention
Methods
  • The authors experimentally evaluate the robotic learning method, off-DADS, for unsupervised skill discovery.
  • The authors evaluate the off-DADS algorithm itself in isolation, on a set of standard benchmark tasks, to understand the gains in sample efficiency when compared to DADS proposed in [52], while ablating the role of hyperparameters and variants of off-DADS.
  • The authors evaluate the robotic learning method on D’Kitty from ROBEL [3], a real-world robotic benchmark suite.
Conclusion
  • The authors derived off-DADS, a novel off-policy variant to mutual-information-based reward-free reinforcement learning framework.
  • Given the dynamics-based formulation from [52], the authors further demonstrate those acquired skills are directly useful for solving downstream tasks such as navigation using online planning with no further learning.
  • The authors detail the successes and challenges encountered in the experiments, and hope the work could offer an important foundation toward the goal of unsupervised, continual reinforcement learning of robots in the real world for many days with zero human intervention
Summary
  • Introduction:

    Reinforcement learning (RL) has the potential of enabling autonomous agents to exhibit intricate behaviors and solve complex tasks from high-dimensional sensory input without hand-engineered policies or features [54, 48, 37, 35, 16].
  • One of the reasons for this is that the assumptions that are required in a standard RL formulation are not fully compatible with the requirements of real-world robotics systems
  • One of these assumptions is the existence of a ground truth reward signal, provided as part of the task.
  • The ability to use trajectories sampled from arbitrary policies enables these algorithms to be sample efficient
  • Methods:

    The authors experimentally evaluate the robotic learning method, off-DADS, for unsupervised skill discovery.
  • The authors evaluate the off-DADS algorithm itself in isolation, on a set of standard benchmark tasks, to understand the gains in sample efficiency when compared to DADS proposed in [52], while ablating the role of hyperparameters and variants of off-DADS.
  • The authors evaluate the robotic learning method on D’Kitty from ROBEL [3], a real-world robotic benchmark suite.
  • Conclusion:

    The authors derived off-DADS, a novel off-policy variant to mutual-information-based reward-free reinforcement learning framework.
  • Given the dynamics-based formulation from [52], the authors further demonstrate those acquired skills are directly useful for solving downstream tasks such as navigation using online planning with no further learning.
  • The authors detail the successes and challenges encountered in the experiments, and hope the work could offer an important foundation toward the goal of unsupervised, continual reinforcement learning of robots in the real world for many days with zero human intervention
Related work
  • Our work builds on a number of recent works [17, 32, 26, 36, 19, 41] that study end-to-end reinforcement learning of neural policies on real-world robot hardware, which poses significant challenges such as sample-efficiency, reward engineering and measurements, resets, and safety [17, 11, 57]. Gu et al [17], Kalashnikov et al [26], Haarnoja et al [19], Nagabandi et al [41] demonstrate that existing off-policy and model-based algorithms are sample efficient enough for real world training of simple manipulation and locomotion skills given reasonable task rewards. Eysenbach et al [12], Zhu et al [57] propose reset-free continual learning algorithms and demonstrate initial successes in simulated and real environments. To enable efficient reward-free discovery of skills, our work aims to address the sample-efficiency and reward-free learning jointly through a novel off-policy learning framework.

    Reward engineering has been a major bottleneck not only in robotics, but also in general RL domains. There are two kinds of approaches to alleviate this problem. The first kind involves recovering a task-specific reward function with alternative forms of specifications, such as inverse RL [42, 1, 58, 22] or preference feedback [8]; however, these approaches still require non-trivial human effort. The second kind proposes an intrinsic motivation reward that can be applied to different MDPs to discover useful policies, such as curiosity for novelty [50, 43, 51, 6, 44, 9], entropy maximization [23, 46, 33, 15], and mutual information [27, 25, 10, 14, 13, 38, 52]. Ours extends the dynamics-based mutual-information objective from Sharma et al [52] to sample-efficient off-policy learning.
Reference
  • Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004.
    Google ScholarLocate open access versionFindings
  • Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms. arXiv preprint arXiv:1807.10299, 2018.
    Findings
  • Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. Robel: Robotics benchmarks for learning with lowcost robots. arXiv preprint arXiv:1909.11639, 2019.
    Findings
  • Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Adrien Baranes and Pierre-Yves Oudeyer. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems, 61(1):49–73, 2013.
    Google ScholarLocate open access versionFindings
  • Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pages 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
    Google ScholarLocate open access versionFindings
  • Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
    Google ScholarLocate open access versionFindings
  • Cedric Colas, Pierre Fournier, Olivier Sigaud, Mohamed Chetouani, and Pierre-Yves Oudeyer. Curious: intrinsically motivated modular multi-goal reinforcement learning. arXiv preprint arXiv:1810.06284, 2018.
    Findings
  • Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pages 273–281, 2012.
    Google ScholarLocate open access versionFindings
  • Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.
    Findings
  • Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
    Findings
  • Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017.
    Findings
  • Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. Conference on Robot Learning (CoRL), 2019.
    Google ScholarFindings
  • Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with modelbased acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
    Google ScholarLocate open access versionFindings
  • Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Shixiang Shane Gu, Timothy Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Scholkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in neural information processing systems, pages 3846–3855, 2017.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker, and Sergey Levine. Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103, 2018.
    Findings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
    Findings
  • Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in neural information processing systems, pages 6765–6774, 2017.
    Google ScholarLocate open access versionFindings
  • Elad Hazan, Sham M Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. arXiv preprint arXiv:1812.02690, 2018.
    Findings
  • Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.
    Findings
  • Tobias Jung, Daniel Polani, and Peter Stone. Empowerment for continuous agentenvironment systems. Adaptive Behavior, 19(1):16–39, 2011.
    Google ScholarLocate open access versionFindings
  • Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
    Findings
  • Alexander S Klyubin, Daniel Polani, and Chrystopher L Nehaniv. All else being equal be empowered. In European Conference on Artificial Life, pages 744–753.
    Google ScholarLocate open access versionFindings
  • Jens Kober and Jan R Peters. Policy search for motor primitives in robotics. In Advances in neural information processing systems, pages 849–856, 2009.
    Google ScholarLocate open access versionFindings
  • Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
    Google ScholarLocate open access versionFindings
  • Nate Kohl and Peter Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, volume 3, pages 2619–2624. IEEE, 2004.
    Google ScholarLocate open access versionFindings
  • George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. Autonomous skill acquisition on a mobile manipulator. In Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
    Google ScholarLocate open access versionFindings
  • Vikash Kumar, Emanuel Todorov, and Sergey Levine. Optimal control with learned local models: Application to dexterous manipulation. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 378–383. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, and Ruslan Salakhutdinov. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019.
    Findings
  • Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierarchies with hindsight. arXiv preprint arXiv:1712.00948, 2017.
    Findings
  • Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • A Rupam Mahmood, Dmytro Korenkevych, Gautham Vasan, William Ma, and James Bergstra. Benchmarking reinforcement learning algorithms on real-world robots. arXiv preprint arXiv:1809.07731, 2018.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
    Findings
  • Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125–2133, 2015.
    Google ScholarLocate open access versionFindings
  • Remi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
    Google ScholarLocate open access versionFindings
  • Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018.
    Google ScholarLocate open access versionFindings
  • Anusha Nagabandi, Kurt Konoglie, Sergey Levine, and Vikash Kumar. Deep dynamics models for learning dexterous manipulation. arXiv preprint arXiv:1909.11652, 2019.
    Findings
  • Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pages 663–670, 2000.
    Google ScholarLocate open access versionFindings
  • Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by selfsupervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by selfsupervised prediction. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew-fit: Statecovering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
    Findings
  • Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
    Google ScholarLocate open access versionFindings
  • Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328.
    Google ScholarLocate open access versionFindings
  • Martin Riedmiller, Thomas Gabel, Roland Hafner, and Sascha Lange. Reinforcement learning for robot soccer. Autonomous Robots, 27(1):55–73, 2009.
    Google ScholarLocate open access versionFindings
  • Jurgen Schmidhuber. Curious model-building control systems. In Proc. international joint conference on neural networks, pages 1458–1463, 1991.
    Google ScholarLocate open access versionFindings
  • Jurgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230– 247, 2010.
    Google ScholarLocate open access versionFindings
  • Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. International Conference on Learning Representations (ICLR), 2020.
    Google ScholarFindings
  • Martin Stolle and Doina Precup. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pages 212–223.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
    Google ScholarFindings
  • Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
    Google ScholarLocate open access versionFindings
  • Philip Thomas and Emma Brunskill. Data-efficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
    Google ScholarLocate open access versionFindings
  • Henry Zhu, Justin Yu, Abhishek Gupta, Dhruv Shah, Kristian Hartikainen, Avi Singh, Vikash Kumar, and Sergey Levine. The ingredients of real world robotic reinforcement learning. International Conference on Representation Learning (ICLR), 2020.
    Google ScholarLocate open access versionFindings
  • Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments