Learning Latent Plans from Play

Corey Lynch
Corey Lynch
Mohi Khansari
Mohi Khansari
Ted Xiao
Ted Xiao
Vikash Kumar
Vikash Kumar

CoRL, pp. 1113-1132, 2019.

Cited by: 24|Bibtex|Views88
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We propose self-supervising control on top of human teleoperated play data as a way to scale up skill learning

Abstract:

We propose learning from teleoperated play data (LfP) as a way to scale up multi-task robotic skill learning. Learning from play (LfP) offers three main advantages: 1) It is cheap. Large amounts of play data can be collected quickly as it does not require scene staging, task segmenting, or resetting to an initial state. 2) It is general...More

Code:

Data:

0
Introduction
  • There has been significant recent progress showing that robots can be trained to be competent specialists, learning complex individual skills like grasping ([1]), locomotion, and dexterous manipulation ([2]).
  • Using reinforcement learning in complex settings such as robotics requires overcoming significant exploration challenges, typically addressed by introducing manual scripting primitives to an unsupervised collection ([4]) that increase the likelihood of behavior with non-zero reward
  • In general for both paradigms, for each new skill a robot is required to perform, a corresponding, sizeable, and non-transferable human effort must be expended
Highlights
  • There has been significant recent progress showing that robots can be trained to be competent specialists, learning complex individual skills like grasping ([1]), locomotion, and dexterous manipulation ([2])
  • Obtaining multiple skills involves defining a discrete set of tasks we care about, collecting a large number of labeled and segmented expert demonstrations per task, training one specialist policy per task in a learning from demonstration (LfD) [3] scenario
  • We aim to answer the following questions: 1) Can a single play-supervised policy generalize to a wide variety of user-specified visual manipulation tasks, despite not being trained on task-specific data? 2) Are play-supervised models trained on cheap to collect play data (LfP) competitive with specialist models trained on expensive expert demonstrations for each task (LfD)? 3) Does decoupling latent plan inference and plan decoding into independent problems, as is done in Play-LMP, improve performance over goal-conditioned Behavioral Cloning (Play-GCBC),?
  • A single task-agnostic PlayLMP policy, trained on unlabeled play data generalizes with 85.5% success to the 18 test-time tasks with no finetuning, outperforming a collection of 18 expert-trained BC policies who reach 70.3% average success. This holds true even when Play-LMP is artificially restricted to only 30 minutes of play data (71.8%), despite play being easier and cheaper to collect than expert demonstrations
  • Robustness: In Fig. 6b, we find that models trained on play data (Play-LMP and Play-GCBC) are significantly more robust to perturbations than the model trained on expert demonstrations only (BC), a phenomenon we attribute to the inherent coverage properties of play data over demonstration data
  • We introduce a self-supervised plan representation learning algorithm able to discover task semantics despite never seeing any task labels
Methods
  • The authors aim to answer the following questions: 1) Can a single play-supervised policy generalize to a wide variety of user-specified visual manipulation tasks, despite not being trained on task-specific data? 2) Are play-supervised models trained on cheap to collect play data (LfP) competitive with specialist models trained on expensive expert demonstrations for each task (LfD)? 3) Does decoupling latent plan inference and plan decoding into independent problems, as is done in Play-LMP, improve performance over goal-conditioned Behavioral Cloning (Play-GCBC),?

    Tasks and Dataset: The authors define 18 visual manipulation tasks in the same environment that play was collected in (Fig. 3 and A.3.1).
  • To compare the play-supervised models to a conventional scenario, the authors collect a training set of 100 expert demonstrations per task in the environment, and train one behavioral cloning policy (BC, details in A.1.2) on the corresponding expert dataset.
  • This results in 1800 demonstrations total or ∼1.5 hours of expert data.
  • The motivation of the state experiments is to understand the how all methods compare on the control problem independent of visual representation learning, which could potentially be improved independently via other self-supervised methods e.g. Sermanet et al [37]
Results
  • A single task-agnostic PlayLMP policy, trained on unlabeled play data generalizes with 85.5% success to the 18 test-time tasks with no finetuning, outperforming a collection of 18 expert-trained BC policies who reach 70.3% average success.
  • This holds true even when Play-LMP is artificially restricted to only 30 minutes of play data (71.8%), despite play being easier and cheaper to collect than expert demonstrations
Conclusion
  • The authors advocate for learning the full continuum of tasks using unlabeled play data, rather than discrete tasks using expert demonstrations.
  • The authors showed that play brings scalability to data collection, as well as robustness to the models trained with it.
  • The authors explore the setting where play data and test-time tasks are defined over the same playroom environment.
  • Future work includes exploring whether generalization is possible to novel objects or novel environments, as well as exploring the effects of imbalance in play data distributions as discussed in A.5
Summary
  • Introduction:

    There has been significant recent progress showing that robots can be trained to be competent specialists, learning complex individual skills like grasping ([1]), locomotion, and dexterous manipulation ([2]).
  • Using reinforcement learning in complex settings such as robotics requires overcoming significant exploration challenges, typically addressed by introducing manual scripting primitives to an unsupervised collection ([4]) that increase the likelihood of behavior with non-zero reward
  • In general for both paradigms, for each new skill a robot is required to perform, a corresponding, sizeable, and non-transferable human effort must be expended
  • Objectives:

    The authors aim to answer the following questions: 1) Can a single play-supervised policy generalize to a wide variety of user-specified visual manipulation tasks, despite not being trained on task-specific data? 2) Are play-supervised models trained on cheap to collect play data (LfP) competitive with specialist models trained on expensive expert demonstrations for each task (LfD)? 3) Does decoupling latent plan inference and plan decoding into independent problems, as is done in Play-LMP, improve performance over goal-conditioned Behavioral Cloning (Play-GCBC),?.
  • The authors aim to answer the following questions: 1) Can a single play-supervised policy generalize to a wide variety of user-specified visual manipulation tasks, despite not being trained on task-specific data?
  • 2) Are play-supervised models trained on cheap to collect play data (LfP) competitive with specialist models trained on expensive expert demonstrations for each task (LfD)?
  • 3) Does decoupling latent plan inference and plan decoding into independent problems, as is done in Play-LMP, improve performance over goal-conditioned Behavioral Cloning (Play-GCBC),?
  • Methods:

    The authors aim to answer the following questions: 1) Can a single play-supervised policy generalize to a wide variety of user-specified visual manipulation tasks, despite not being trained on task-specific data? 2) Are play-supervised models trained on cheap to collect play data (LfP) competitive with specialist models trained on expensive expert demonstrations for each task (LfD)? 3) Does decoupling latent plan inference and plan decoding into independent problems, as is done in Play-LMP, improve performance over goal-conditioned Behavioral Cloning (Play-GCBC),?

    Tasks and Dataset: The authors define 18 visual manipulation tasks in the same environment that play was collected in (Fig. 3 and A.3.1).
  • To compare the play-supervised models to a conventional scenario, the authors collect a training set of 100 expert demonstrations per task in the environment, and train one behavioral cloning policy (BC, details in A.1.2) on the corresponding expert dataset.
  • This results in 1800 demonstrations total or ∼1.5 hours of expert data.
  • The motivation of the state experiments is to understand the how all methods compare on the control problem independent of visual representation learning, which could potentially be improved independently via other self-supervised methods e.g. Sermanet et al [37]
  • Results:

    A single task-agnostic PlayLMP policy, trained on unlabeled play data generalizes with 85.5% success to the 18 test-time tasks with no finetuning, outperforming a collection of 18 expert-trained BC policies who reach 70.3% average success.
  • This holds true even when Play-LMP is artificially restricted to only 30 minutes of play data (71.8%), despite play being easier and cheaper to collect than expert demonstrations
  • Conclusion:

    The authors advocate for learning the full continuum of tasks using unlabeled play data, rather than discrete tasks using expert demonstrations.
  • The authors showed that play brings scalability to data collection, as well as robustness to the models trained with it.
  • The authors explore the setting where play data and test-time tasks are defined over the same playroom environment.
  • Future work includes exploring whether generalization is possible to novel objects or novel environments, as well as exploring the effects of imbalance in play data distributions as discussed in A.5
Related work
  • Robotic learning methods generally require some form of supervision to acquire behavioral skills. Conventionally, this supervision either consists of a cost or reward signal, as in reinforcement learning [8, 9, 10], or demonstrations, as in imitation learning Pastor et al [3]. However, both of these sources of supervision require considerable human effort to obtain: reward functions must be engineered by hand, which can be highly non-trivial in environments with natural observations, and demonstrations must be provided manually for each task. When using high-capacity models, hundreds or even thousands of demonstrations may be required for each task (Zhang et al [11], Rahmatizadeh et al [12], Rajeswaran et al [13], Duan et al [14]). In this paper, we instead aim to learn general-purpose policies that can flexibly accomplish a wide range of user-specified tasks, using data that is not task-specific and is easy to collect. Our model can in principle use any past experience for training, but the particular data collection approach we used is based on human-provided play data.
Funding
  • A single task-agnostic PlayLMP policy, trained on unlabeled play data generalizes with 85.5% success to the 18 test-time tasks with no finetuning, outperforming a collection of 18 expert-trained BC policies who reach 70.3% average success. This holds true even when Play-LMP is artificially restricted to only 30 minutes of play data (71.8%), despite play being easier and cheaper to collect than expert demonstrations
Study subjects and analysis
BC expert demonstration datasets: 18
This results in 1800 demonstrations total or ∼1.5 hours of expert data. We additionally train a single multi-task behavioral cloning baseline conditioned on state and task id, Multitask BC (Rahmatizadeh et al [26]), trained on all 18 BC expert demonstration datasets. We collect play datasets (example in A.3.2) of various sizes as training data for Play-LMP and Play-GCBC, up to ∼7 hours total

BC expert demonstration datasets: 18
This results in 1800 demonstrations total or ∼1.5 hours of expert data. We additionally train a single multi-task behavioral cloning baseline conditioned on state and task id, Multitask BC (Rahmatizadeh et al [26]), trained on all 18 BC expert demonstration datasets. We collect play datasets (example in A.3.2) of various sizes as training data for Play-LMP and Play-GCBC, up to ∼7 hours total

Reference
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
    Findings
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
    Findings
  • P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal. Learning and generalization of motor skills by learning from demonstration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 763–768. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
    Findings
  • D. Warde-Farley, T. V. de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V. Mnih. Unsupervised control through non-parametric discriminative rewards. CoRR, abs/1811.11359, 2018. URL http://arxiv.org/abs/1811.11359.
    Findings
  • [7] S. Ross, G. J. Gordon, and J. A. Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011. URL https://arxiv.org/pdf/1011.0686.pdf.
    Findings
  • [8] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • [9] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
    Google ScholarLocate open access versionFindings
  • [10] M. P. Deisenroth, G. Neumann, J. Peters, et al. A survey on policy search for robotics. Foundations and Trends R in Robotics, 2(1–2):1–142, 2013.
    Google ScholarLocate open access versionFindings
  • [11] T. Zhang, Z. McCarthy, O. Jow, D. Lee, K. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. CoRR, abs/1710.04615, 2017. URL http://arxiv.org/abs/1710.04615.
    Findings
  • [12] R. Rahmatizadeh, P. Abolghasemi, L. Boloni, and S. Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. CoRR, abs/1707.02920, 2017. URL http://arxiv.org/abs/1707.02920.
    Findings
  • [13] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
    Findings
  • [14] Y. Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning. CoRR, abs/1703.07326, 2017. URL http://arxiv.org/abs/1703.07326.
    Findings
  • [15] A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9209–9220, 2018.
    Google ScholarLocate open access versionFindings
  • [16] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
    Google ScholarLocate open access versionFindings
  • [17] A. Levy, R. P. Jr., and K. Saenko. Hierarchical actor-critic. CoRR, abs/1712.00948, 2017. URL http://arxiv.org/abs/1712.00948.
    Findings
  • [18] P. Rauber, F. Mutz, and J. Schmidhuber. Hindsight policy gradients. CoRR, abs/1711.06006, 20URL http://arxiv.org/abs/1711.06006.
    Findings
  • [19] S. Cabi, S. G. Colmenarejo, M. W. Hoffman, M. Denil, Z. Wang, and N. de Freitas. The intentional unintentional agent: Learning to solve many continuous control tasks simultaneously. CoRR, abs/1707.03300, 2017. URL http://arxiv.org/abs/1707.03300.
    Findings
  • [20] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. CoRR, abs/1606.07419, 2016. URL http://arxiv.org/abs/1606.07419.
    Findings
  • [21] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. CoRR, abs/1703.02018, 2017. URL http://arxiv.org/abs/1703.02018.
    Findings
  • [22] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba. Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518, 2016.
    Findings
  • [23] F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation. CoRR, abs/1805.01954, 2018. URL http://arxiv.org/abs/1805.01954.
    Findings
  • [24] L. Pinto and A. Gupta. Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours. IEEE International Conference on Robotics and Automation (ICRA), 2016.
    Google ScholarLocate open access versionFindings
  • [25] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen. Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. International Journal of Robotics Research, 2017.
    Google ScholarLocate open access versionFindings
  • [26] R. Rahmatizadeh, P. Abolghasemi, L. Boloni, and S. Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3758–3765. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • [27] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell. Zero-shot visual imitation. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • [28] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rk07ZXZRb.
    Locate open access versionFindings
  • [29] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905, 2017.
    Findings
  • [30] Z. Wang, J. S. Merel, S. E. Reed, N. de Freitas, G. Wayne, and N. Heess. Robust imitation of diverse behaviors. In Advances in Neural Information Processing Systems, pages 5320–5329, 2017.
    Google ScholarLocate open access versionFindings
  • [31] T. L. Paine, S. G. Colmenarejo, Z. Wang, S. E. Reed, Y. Aytar, T. Pfaff, M. W. Hoffman, G. Barth-Maron, S. Cabi, D. Budden, and N. de Freitas. One-shot high-fidelity imitation: Training large-scale deep nets with RL. CoRR, abs/1810.05017, 2018. URL http://arxiv.org/abs/1810.05017.
    Findings
  • [32] O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. CoRR, abs/1810.01257, 2018. URL http://arxiv.org/abs/1810.01257.
    Findings
  • [33] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
    Findings
  • [34] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3483–3491. Curran Associates, Inc., 2015.
    Google ScholarLocate open access versionFindings
  • [35] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. 2016.
    Google ScholarFindings
  • [36] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. βVAE: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • [37] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. International Conference in Robotics and Automation (ICRA), 2018. URL http://arxiv.org/abs/1704.06888.
    Findings
  • [38] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017. URL http://arxiv.org/abs/1701.05517.
    Findings
  • [39] V. Kumar and E. Todorov. Mujoco haptix: A virtual reality system for hand manipulation. In Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th International Conference on, pages 657–663. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • [40] A. van den Oord, O. Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315, 2017.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments