Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

ICRA, pp. 3389-3396, 2017.

Cited by: 588|Bibtex|Views137
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
While we’ve shown that deep off-policy reinforcement learning algorithms are capable of learning complex manipulation skills from scratch and without purpose built representations, our method has a number of limitations

Abstract:

Reinforcement learning holds the promise of enabling autonomous robots to learn large repertoires of behavioral skills with minimal human intervention. However, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favor of achieving training times that are practical for real physical syst...More

Code:

Data:

0
Introduction
  • Reinforcement learning methods have been applied to range of robotic control tasks, from locomotion [1], [2] to manipulation [3], [4], [5], [6] and autonomous vehicle control [7].
  • The authors present a novel asynchronous variant of NAF, evaluate the speedup obtained with varying numbers of learners in simulation, and demonstrate real-world results with parallelism across multiple robots.
  • An illustration of these robots learning a door opening task is shown in Figure 1.
Highlights
  • Reinforcement learning methods have been applied to range of robotic control tasks, from locomotion [1], [2] to manipulation [3], [4], [5], [6] and autonomous vehicle control [7]
  • We show that a recently proposed deep reinforcement learning algorithms based on off-policy training of deep Q-functions [10], [11] can be extended to learn complex manipulation policies from scratch, without user-provided demonstrations, and using only general-purpose neural network representations that do not require task-specific domain knowledge
  • We presented an asynchronous deep reinforcement learning approach that can be used to learn complex robotic manipulation skills from scratch on real physical robotic manipulators
  • We demonstrate that our approach can learn a complex door opening task with only a few hours of training, and our simulated results demonstrate that training times decrease with more learners
  • Our technical contribution consists of a novel asynchronous version of the normalized advantage functions (NAF) deep reinforcement learning algorithm, as well as a number of practical extensions to enable safe and efficient deep reinforcement learning on physical systems, and our experiments confirm the benefits of nonlinear deep neural network policies over simpler shallow representations for complex robotic manipulation tasks
  • While we’ve shown that deep off-policy reinforcement learning algorithms are capable of learning complex manipulation skills from scratch and without purpose built representations, our method has a number of limitations
Conclusion
  • DISCUSSION AND FUTURE WORK

    The authors presented an asynchronous deep reinforcement learning approach that can be used to learn complex robotic manipulation skills from scratch on real physical robotic manipulators.
  • The authors' technical contribution consists of a novel asynchronous version of the normalized advantage functions (NAF) deep reinforcement learning algorithm, as well as a number of practical extensions to enable safe and efficient deep reinforcement learning on physical systems, and the experiments confirm the benefits of nonlinear deep neural network policies over simpler shallow representations for complex robotic manipulation tasks.
  • If the reward consists only of a binary success signal, both tasks become substantially more difficult and require considerably more exploration
  • Such simple binary rewards may be substantially easier to engineer in many practical robotic learning applications.
  • Improving exploration and learning speed in future work to enable the use of such sparse rewards would further improve the practical applicability of the class of methods explored here
Summary
  • Introduction:

    Reinforcement learning methods have been applied to range of robotic control tasks, from locomotion [1], [2] to manipulation [3], [4], [5], [6] and autonomous vehicle control [7].
  • The authors present a novel asynchronous variant of NAF, evaluate the speedup obtained with varying numbers of learners in simulation, and demonstrate real-world results with parallelism across multiple robots.
  • An illustration of these robots learning a door opening task is shown in Figure 1.
  • Conclusion:

    DISCUSSION AND FUTURE WORK

    The authors presented an asynchronous deep reinforcement learning approach that can be used to learn complex robotic manipulation skills from scratch on real physical robotic manipulators.
  • The authors' technical contribution consists of a novel asynchronous version of the normalized advantage functions (NAF) deep reinforcement learning algorithm, as well as a number of practical extensions to enable safe and efficient deep reinforcement learning on physical systems, and the experiments confirm the benefits of nonlinear deep neural network policies over simpler shallow representations for complex robotic manipulation tasks.
  • If the reward consists only of a binary success signal, both tasks become substantially more difficult and require considerably more exploration
  • Such simple binary rewards may be substantially easier to engineer in many practical robotic learning applications.
  • Improving exploration and learning speed in future work to enable the use of such sparse rewards would further improve the practical applicability of the class of methods explored here
Related work
  • Applications of reinforcement learning (RL) in robotics have included locomotion [1], [2], manipulation [3], [4],

    [5], [6], and autonomous vehicle control [7]. Many of the RL methods demonstrated on physical robotic systems have used relatively low-dimensional policy representations, typically with under one hundred parameters, due to the difficulty of efficiently optimizing high-dimensional policy parameter vectors [12]. Although there has been considerable research on reinforcement learning with general-purpose neural networks for some time [13], [14], [15], [16], [17], such methods have only recently been developed to the point where they could be applied to continuous control of highdimensional systems, such as 7 degree-of-freedom (DoF) arms, and with large and deep neural networks [18], [19], [10], [11]. This has made it possible to learn complex skills with minimal manual engineering, though it has remained unclear whether such approaches could be adapted to real systems given their sample complexity.

    In robotic learning scenarios, prior work has explored both model-based and model-free learning algorithms. Modelbased algorithms have explored a variety of dynamics estimation schemes, including Gaussian processes [20], mixture models [21], and local linear system estimation [22], with a more detailed overview in a recent survey [23]. Deep neural network policies have been combined with modelbased learning in the context of guided policy search algorithms [19], which use a model-based teacher to train a deep network policies. Such methods have been successful on a range of real-world tasks, but rely on the ability of the model-based teacher to discover good trajectories for the goal task. As shown in recent work, this can be difficult in domains with severe discontinuities in the dynamics and reward function [24].
Reference
  • N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion,” in International Conference on Robotics and Automation (IROS), 2004.
    Google ScholarLocate open access versionFindings
  • G. Endo, J. Morimoto, T. Matsubara, J. Nakanishi, and G. Cheng, “Learning CPG-based biped locomotion with a policy gradient method: Application to a humanoid robot,” International Journal of Robotic Research, vol. 27, no. 2, pp. 213–228, 2008.
    Google ScholarLocate open access versionFindings
  • J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.
    Google ScholarLocate open access versionFindings
  • E. Theodorou, J. Buchli, and S. Schaal, “Reinforcement learning of motor skills in high dimensions,” in International Conference on Robotics and Automation (ICRA), 2010.
    Google ScholarLocate open access versionFindings
  • J. Peters, K. Mulling, and Y. Altun, “Relative entropy policy search,” in AAAI Conference on Artificial Intelligence, 2010.
    Google ScholarLocate open access versionFindings
  • M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning force control policies for compliant manipulation,” in International Conference on Intelligent Robots and Systems (IROS), 2011.
    Google ScholarLocate open access versionFindings
  • P. Abbeel, A. Coates, M. Quigley, and A. Ng, “An application of reinforcement learning to aerobatic helicopter flight,” in Advances in Neural Information Processing Systems (NIPS), 2006.
    Google ScholarLocate open access versionFindings
  • J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” International Journal of Robotic Research, vol. 32, no. 11, pp. 1238–1274, 2013.
    Google ScholarLocate open access versionFindings
  • P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal, “Learning and generalization of motor skills by learning from demonstration,” in International Conference on Robotics and Automation (ICRA), 2009.
    Google ScholarLocate open access versionFindings
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” International Conference on Learning Representations (ICLR), 2016.
    Google ScholarLocate open access versionFindings
  • S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep qlearning with model-based acceleration,” in International Conference on Machine Learning (ICML), 2016.
    Google ScholarLocate open access versionFindings
  • M. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and Trends in Robotics, vol. 2, no. 1-2, pp. 1–142, 2013.
    Google ScholarLocate open access versionFindings
  • K. J. Hunt, D. Sbarbaro, R. Zbikowski, and P. J. Gawthrop, “Neural networks for control systems: A survey,” Automatica, vol. 28, no. 6, pp. 1083–1112, Nov. 1992.
    Google ScholarLocate open access versionFindings
  • M. Riedmiller, “Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method,” in European Conference on Machine Learning. Springer, 2005, pp. 317–328.
    Google ScholarLocate open access versionFindings
  • R. Hafner and M. Riedmiller, “Neural reinforcement learning controllers for a real robot application,” in International Conference on Robotics and Automation (ICRA), 2007.
    Google ScholarLocate open access versionFindings
  • M. Riedmiller, S. Lange, and A. Voigtlaender, “Autonomous reinforcement learning on raw visual input data in a real world application,” in International Joint Conference on Neural Networks, 2012.
    Google ScholarLocate open access versionFindings
  • J. Koutnık, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving largescale neural networks for vision-based reinforcement learning,” in Conference on Genetic and Evolutionary Computation, ser. GECCO ’13, 2013.
    Google ScholarFindings
  • J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, “Trust region policy optimization,” in International Conference on Machine Learning (ICML), 2015.
    Google ScholarLocate open access versionFindings
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research (JMLR), vol. 17, 2016.
    Google ScholarLocate open access versionFindings
  • M. Deisenroth and C. Rasmussen, “PILCO: a model-based and dataefficient approach to policy search,” in International Conference on Machine Learning (ICML), 2011.
    Google ScholarLocate open access versionFindings
  • T. Moldovan, S. Levine, M. Jordan, and S. Abbeel, “Optimism-driven exploration for nonlinear systems,” in International Conference on Robotics and Automation (ICRA), 2015.
    Google ScholarLocate open access versionFindings
  • R. Lioutikov, A. Paraschos, G. Neumann, and J. Peters, “Samplebased information-theoretic stochastic optimal control,” in International Conference on Robotics and Automation, 2014.
    Google ScholarLocate open access versionFindings
  • M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy search for robotics.” Foundations and Trends in Robotics, vol. 2, no. 1-2, pp. 1–142, 2013.
    Google ScholarLocate open access versionFindings
  • Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine, “Path integral guided policy search,” arXiv preprint arXiv:1610.00529, 2016.
    Findings
  • R. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3-4, pp. 229–256, May 1992.
    Google ScholarLocate open access versionFindings
  • C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
    Google ScholarLocate open access versionFindings
  • R. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems (NIPS), 1999.
    Google ScholarLocate open access versionFindings
  • J. Koutnık, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving largescale neural networks for vision-based reinforcement learning,” in Proceedings of the 15th annual conference on Genetic and evolutionary computation. ACM, 2013, pp. 1061–1068.
    Google ScholarLocate open access versionFindings
  • V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
    Google ScholarLocate open access versionFindings
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), 2016, pp. 1928–1937.
    Google ScholarLocate open access versionFindings
  • R. Hafner and M. Riedmiller, “Reinforcement learning in feedback control,” Machine learning, vol. 84, no. 1-2, pp. 137–169, 2011.
    Google ScholarLocate open access versionFindings
  • M. Inaba, S. Kagami, F. Kanehiro, and Y. Hoshino, “A platform for robotics research based on the remote-brained robot approach,” International Journal of Robotics Research, vol. 19, no. 10, 2000.
    Google ScholarLocate open access versionFindings
  • J. Kuffner, “Cloud-enabled humanoid robots,” in IEEE-RAS International Conference on Humanoid Robotics, 2010.
    Google ScholarLocate open access versionFindings
  • B. Kehoe, A. Matsukawa, S. Candido, J. Kuffner, and K. Goldberg, “Cloud-based robot grasping with the google object recognition engine,” in IEEE International Conference on Robotics and Automation, 2013.
    Google ScholarLocate open access versionFindings
  • B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A survey of research on cloud robotics and automation,” IEEE Transactions on Automation Science and Engineering, vol. 12, no. 2, April 2015.
    Google ScholarLocate open access versionFindings
  • A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine, “Collective robot reinforcement learning with distributed asynchronous guided policy search,” arXiv preprint arXiv:1610.00673, 2016.
    Findings
  • E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
    Google ScholarLocate open access versionFindings
  • J. Ba and D. Kingma, “Adam: A method for stochastic optimization,” 2015.
    Google ScholarFindings
  • J. Kober and J. Peters, “Learning motor primitives for robotics,” in International Conference on Robotics and Automation (ICRA), 2009.
    Google ScholarLocate open access versionFindings
  • R. Tedrake, T. W. Zhang, and H. S. Seung, “Learning to walk in 20 minutes.”
    Google ScholarFindings
  • S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
    Findings
  • J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • A. Rusu, S. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” in International Conference on Learning Representations (ICLR), 2016.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments