GTI: Learning to Generalize across Long-Horizon Tasks from Human Demonstrations

RSS 2020, 2020.

被引用0|引用|浏览109
微博一下
We present a novel imitation learning framework to enable robots to 1) learn complex real world manipulation tasks efficiently from a small number of human demonstrations, and 2) synthesize new behaviors not contained in the collected demonstrations

摘要

Imitation learning is an effective and safe technique to train robot policies in the real world because it does not depend on an expensive random exploration process. However, due to the lack of exploration, learning policies that generalize beyond the demonstrated behaviors is still an open challenge. We present a novel imitation learnin...更多

代码

数据

0
简介
  • Imitation Learning (IL) is a promising paradigm to train physical robots on complex manipulation skills by replicating behavior from expert demonstrations [32, 24].
  • The performance of IL depends on providing expert demonstrations that cover a wide variety of situations [34].
  • Expecting this kind of coverage from a fixed set of expert demonstrations is often unrealistic, especially for long-horizon multi-stage manipulation tasks, due to the combinatorial nature of possible task instances and valid solutions.
  • Due to the often complex correlation between behavior and task specification, these methods require large amounts of annotated demonstrations [25] and they do not scale well to physical robots in the real world
重点内容
  • Imitation Learning (IL) is a promising paradigm to train physical robots on complex manipulation skills by replicating behavior from expert demonstrations [32, 24]
  • We measure 4 different metrics averaged over all start locations and rollouts: (1) the Goal Reach Rate, which corresponds to the percentage of rollouts that reach the lower left or lower right corners, (2) the Seen Behavior metric, which corresponds to the percentage of goal-reaching rollouts that start and end on opposite sides of the y-axis, (3) the Unseen Behavior metric, which is the percentage of goal-reaching rollouts that start and end on the same side of the y-axis, and (4) the Occupancy metric, which is 100% for a start location if it contains rollouts that reach goals on both the lower left and lower right, 50% if it only reaches a single one, and 0% if no goals are reached
  • Goal-Conditioned Behavioral Cloning excels at reproducing start and goal combinations from the training data, but completely fails on start and goal combinations that are unseen, resulting in 50% success rate, no unseen behavior, and the paths depicted in the top Goal-Conditioned Behavioral Cloning column of Fig. 4
  • We presented Generalization Through Imitation (GTI), a novel algorithm to achieve compositional task generalization from a set of task demonstrations by leveraging such trajectory crossings to generalize to unseen combinations of task initializations and desired goals
  • We demonstrated that Generalization Through Imitation is able to both reproduce behaviors from demonstrations and, more importantly, generalize to novel start and goal configurations, both in simulated domains and a challenging real world kitchen domain
  • Leveraging Generalization Through Imitation in the context of unstructured “play” data [26] or large scale crowdsourced robotic manipulation datasets [28] is a promising direction, as there are likely to be many trajectories that intersect at several locations in the state space
方法
  • The authors propose a two-stage approach to achieve compositional generalization by extracting information from intersecting demonstrations (Fig. 2).
  • In Stage 1, a stochastic policy is trained to reproduce the diverse behaviors in the demonstrations through multimodal imitation learning.
  • In Stage 2, the authors collect sample trajectories from the multimodal imitation agent and distill them into a single goal-directed policy.
  • The goal-directed policy is the final result of the method and allows them to control the robot to demonstrate new behaviors, solving pairs (s0, g) of initial state and goal never demonstrated by a human.
  • In the following the authors will explain each stage in detail
结果
  • Evaluation of Simulation Experiments

    In Table I the authors report results on both the PointCross task and PointCrossStay task across BC, GCBC, and GTI.
  • On the PointCross, BC is able to consistently reach goal locations 100% of the time, but it collapses to exactly one goal location per start location and is unable to reach both goal locations from a single start location, resulting in 50% occupancy.
  • GCBC excels at reproducing start and goal combinations from the training data, but completely fails on start and goal combinations that are unseen, resulting in 50% success rate, no unseen behavior, and the paths depicted in the top GCBC column of Fig. 4.
  • It is able to produce trajectories that reach both the lower left and lower right goals from both the top left and top right starting states, as shown in the second from the right top panel of Fig. 4
结论
  • Common robotic manipulation tasks and domains possess intersectional structures, where trajectories through intersect at different states.
  • Leveraging GTI in the context of unstructured “play” data [26] or large scale crowdsourced robotic manipulation datasets [28] is a promising direction, as there are likely to be many trajectories that intersect at several locations in the state space.
总结
  • Introduction:

    Imitation Learning (IL) is a promising paradigm to train physical robots on complex manipulation skills by replicating behavior from expert demonstrations [32, 24].
  • The performance of IL depends on providing expert demonstrations that cover a wide variety of situations [34].
  • Expecting this kind of coverage from a fixed set of expert demonstrations is often unrealistic, especially for long-horizon multi-stage manipulation tasks, due to the combinatorial nature of possible task instances and valid solutions.
  • Due to the often complex correlation between behavior and task specification, these methods require large amounts of annotated demonstrations [25] and they do not scale well to physical robots in the real world
  • Methods:

    The authors propose a two-stage approach to achieve compositional generalization by extracting information from intersecting demonstrations (Fig. 2).
  • In Stage 1, a stochastic policy is trained to reproduce the diverse behaviors in the demonstrations through multimodal imitation learning.
  • In Stage 2, the authors collect sample trajectories from the multimodal imitation agent and distill them into a single goal-directed policy.
  • The goal-directed policy is the final result of the method and allows them to control the robot to demonstrate new behaviors, solving pairs (s0, g) of initial state and goal never demonstrated by a human.
  • In the following the authors will explain each stage in detail
  • Results:

    Evaluation of Simulation Experiments

    In Table I the authors report results on both the PointCross task and PointCrossStay task across BC, GCBC, and GTI.
  • On the PointCross, BC is able to consistently reach goal locations 100% of the time, but it collapses to exactly one goal location per start location and is unable to reach both goal locations from a single start location, resulting in 50% occupancy.
  • GCBC excels at reproducing start and goal combinations from the training data, but completely fails on start and goal combinations that are unseen, resulting in 50% success rate, no unseen behavior, and the paths depicted in the top GCBC column of Fig. 4.
  • It is able to produce trajectories that reach both the lower left and lower right goals from both the top left and top right starting states, as shown in the second from the right top panel of Fig. 4
  • Conclusion:

    Common robotic manipulation tasks and domains possess intersectional structures, where trajectories through intersect at different states.
  • Leveraging GTI in the context of unstructured “play” data [26] or large scale crowdsourced robotic manipulation datasets [28] is a promising direction, as there are likely to be many trajectories that intersect at several locations in the state space.
表格
  • Table1: Quantitative Stage 1 Evaluation on Simulated Domains: Performance of BC, GCBC, and GTI, in demonstrating a combination of seen and unseen behavior in order to reach both goal locations from both start locations
  • Table2: Quantitative Evaluation on Panda Kitchen Task: Success rate (SR) of BC, GCBC, and GTI in reproducing demonstrated and novel behavior in undirected setting (Stage 1) and goal-directed setting (Stage 2)
Download tables as Excel
相关工作
  • Imitation learning has been applied to multiple domains such as playing table tennis [22], Go [37] and video games [33], and driving autonomously [4, 31] In robotics, the idea of robots learning from human demonstrations have been extensively explored [36, 17, 23, 11, 14, 3, 5]. Imitation learning can be used to obtain a task policy for a task either by learning a mapping from observations to actions offline (e.g., behavioral cloning, BC [2]) or by inferring an underlying reward function to solve with RL (inverse reinforcement learning, IRL [35]). However, both BC and IRL require multiple demonstrations of the same short-horizon manipulation task and suffers when trying to imitate long-horizon activities even when they are composed of demonstrated short-horizon skills. In this work we present a method to leverage multi-task demonstrations of long-horizon manipulation via composition of demonstrated skills.

    Researchers have mainly approached long-horizon imitation learning in two ways: a) one-shot imitation learning (OSIL), and b) hierarchical planning with imitation. In OSIL the goal is to generate an imitator policy that learns to perform tasks based on a single demonstration. The task can be represented as a full video [41], a sequence of image keyframes [16], or a trajectory in state space [10], while the training objective can be either maximizing the likelihood of action given an expert trajectory as in [10, 41, 16, 30], matching the distribution of demonstrations (GAIL [40]), or following a trajectory based on learned dynamics models (AVID [38]). In this work, we do not assume access to task demonstrations nor instructions at testtime; we exploit variability and compositionality within the small set of given demonstrations to generate new strategies.
基金
  • Ajay Mandlekar acknowledges the support of the Department of Defense (DoD) through the NDSEG program
  • We acknowledge the support of Toyota Research Institute (“TRI”); this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity
引用论文
  • Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15, pages 103–129, 1995.
    Google ScholarLocate open access versionFindings
  • Aude Billard, Sylvain Calinon, Rudiger Dillmann, and Stefan Schaal. Robot programming by demonstration. In Springer Handbook of Robotics, 2008.
    Google ScholarLocate open access versionFindings
  • Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
    Findings
  • Sylvain Calinon, Florent D’halluin, Eric L. Sauser, Darwin G. Caldwell, and Aude Billard. Learning and reproduction of gestures by imitation. IEEE Robotics and Automation Magazine, 17:44–54, 2010.
    Google ScholarLocate open access versionFindings
  • Jonathan Chang, Nishanth Kumar, Sean Hastings, Aaron Gokaslan, Diego Romeres, Devesh Jha, Daniel Nikovski, George Konidaris, and Stefanie Tellex. Learning deep parameterized skills from demonstration for re-targetable visuomotor control. arXiv preprint arXiv:1910.10628, 2019.
    Findings
  • John Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Selfconsistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1009–1018, Stockholmsmassan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
    Google ScholarLocate open access versionFindings
  • Felipe Codevilla, Matthias Miiller, Antonio Lopez, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–9. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pages 273–281, 2012.
    Google ScholarLocate open access versionFindings
  • Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural Information Processing Systems, pages 1087–1098, 2017.
    Google ScholarLocate open access versionFindings
  • Peter Englert and Marc Toussaint. Learning manipulation skills from a single demonstration. The International Journal of Robotics Research, 37(1):137–154, 2018.
    Google ScholarLocate open access versionFindings
  • Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning diverse skills without a reward function. In International Conference of Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. In Conference in Robot Learning, volume abs/1709.04905, 2017.
    Google ScholarLocate open access versionFindings
  • Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. In International Conference in Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, and Juan Carlos Niebles. Neural task graphs: Generalizing to unseen tasks from a single video demonstration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8565–8574, 2019.
    Google ScholarLocate open access versionFindings
  • Auke Jan Ijspeert, Jun Nakanishi, and Stefan Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. Proceedings 2002 IEEE International Conference on Robotics and Automation, 2:1398–1403 vol.2, 2002.
    Google ScholarLocate open access versionFindings
  • Dinesh Jayaraman, Frederik Ebert, Alexei A Efros, and Sergey Levine. Time-agnostic prediction: Predicting predictable video frames. In International Conference of Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Oussama Khatib. A unified approach for motion and force control of robot manipulators: The operational space formulation. IEEE Journal on Robotics and Automation, 3(1):43–53, 1987.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
    Findings
  • Thomas Kipf, Yujia Li, Hanjun Dai, Vinicius Zambaldi, Alvaro Sanchez-Gonzalez, Edward Grefenstette, Pushmeet Kohli, and Peter Battaglia. Compile: Compositional imitation learning and execution. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Jens Kober and Jan Peters. Learning motor primitives for robotics. In 2009 IEEE International Conference on Robotics and Automation, pages 2112–2118. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • Jens Kober and Jan Peters. Imitation and reinforcement learning. IEEE Robotics and Automation Magazine, 17: 55–62, 2010.
    Google ScholarLocate open access versionFindings
  • Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196, 2018.
    Findings
  • Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. arXiv preprint arXiv:1903.01973, 2019.
    Findings
  • Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. In Conference on Robot Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. arXiv preprint arXiv:1811.02790, 2018.
    Findings
  • Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. arXiv preprint arXiv:1911.04052, 2019.
    Findings
  • Ajay Mandlekar, Fabio Ramos, Byron Boots, Li FeiFei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In IEEE International Conference on Robotics and Automation (ICRA), 2019.
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2050–2053, 2018.
    Google ScholarLocate open access versionFindings
  • Ashwini Pokle, Roberto Martın-Martın, Patrick Goebel, Vincent Chow, Hans M. Ewald, Junwei Yang, Zhenkai Wang, Amir Sadeghian, Dorsa Sadigh, Silvio Savarese, and Marynel Vazquez. Deep local trajectory replanning and control for robot navigation. In International Conference on Robotics and Automation, Montreal, QC, Canada, May 20-24, 2019, pages 5815–5822, 2019.
    Google ScholarLocate open access versionFindings
  • Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
    Google ScholarLocate open access versionFindings
  • Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2010.
    Google ScholarLocate open access versionFindings
  • Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
    Google ScholarLocate open access versionFindings
  • Stuart Russell. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103, 1998.
    Google ScholarLocate open access versionFindings
  • Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3:233–242, 1999.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 (7587):484, 2016.
    Google ScholarFindings
  • Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, and Sergey Levine. Avid: Learning multi-stage tasks via pixel-level translation of human videos. arXiv preprint arXiv:1912.04443, 2019.
    Findings
  • Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
    Google ScholarLocate open access versionFindings
  • Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and Nicolas Heess. Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, pages 5320–5329, 2017.
    Google ScholarLocate open access versionFindings
  • Danfei Xu, Suraj Nair, Yuke Zhu, Julian Gao, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Neural task programming: Learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
    Google ScholarLocate open access versionFindings
下载 PDF 全文
您的评分 :
0

 

标签
评论