Unsupervised Perceptual Rewards for Imitation Learning

Robotics: Science and Systems, 2017.

Cited by: 67|Bibtex|Views195
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
By leveraging the general features learned from pre-trained deep models, we propose a method for rapidly learning an incremental reward function from human demonstrations which we successfully demonstrate on a real robotic learning task

Abstract:

Reward function design and exploration time are arguably the biggest obstacles to the deployment of reinforcement learning (RL) agents in the real world. In many real-world tasks, designing a suitable reward function takes considerable manual engineering and often requires additional and potentially visible sensors to be installed just to...More

Code:

Data:

0
Introduction
  • Social learning, such as imitation, plays a critical role in allowing humans and animals to quickly acquire complex skills in the real world
  • Humans can use this weak form of supervision to acquire behaviors from very small numbers of demonstrations, in sharp contrast to deep reinforcement learning (RL) methods, which typically require extensive training data.
  • Once the reward function has been extracted, the agent can use its own experience at the task to determine the physical structure of the behavior, even when the reward is provided by an agent with a substantially different embodiment
Highlights
  • Social learning, such as imitation, plays a critical role in allowing humans and animals to quickly acquire complex skills in the real world
  • We propose a reward learning method for understanding the intent of a user demonstration through the use of pre-trained visual features, which provide the “prior knowledge” for efficient imitation
  • The approach we propose in this work which can be interpreted as a simple and efficient approximation to inverse reinforcement learning (IRL), can use demonstrations that consist of videos of a human performing the task using their own body, and can acquire reward functions with intermediate sub-goals using just a few examples
  • By leveraging the general features learned from pre-trained deep models, we propose a method for rapidly learning an incremental reward function from human demonstrations which we successfully demonstrate on a real robotic learning task
  • We show there exists a small subset of pre-trained features that are highly discriminative even for previously unseen scenes and which can be used to reduce the search space for future work in unsupervised steps discovery. Another compelling direction for future work is to explore how reward learning algorithms can be combined with robotic lifelong learning
  • One of the biggest barriers for lifelong learning in the real world is the ability of an agent to obtain reward supervision, without which no learning is possible
Methods
  • The authors discuss the empirical evaluation, starting with an analysis of the learned reward functions in terms of both qualitative reward structure and quantitative segmentation accuracy.
  • The authors present results for a real-world validation of the method on robotic door opening.
  • 3.1 PERCEPTUAL REWARDS EVALUATION The authors report results on two demonstrated tasks: door opening and liquid pouring.
  • The intermediate reward function for the door opening task which corresponds to a human hand manipulating the door handle seems rather noisy or wrong in 10b, 10c and 10e (”action1” on the y-axis of the plots).The reward function in 11f remains flat while liquid is being poured into the glass.
  • The liquid being somewhat transparent, the authors suspect that it looks too similar to the transparent glass for the function to fire
Results
  • The authors show in Fig. 4 that the feature selection approach works well when the number of features n is in the region [32, 64] but collapses to 0% accuracy when n > 8192.
Conclusion
  • The authors present a method for automatically identifying important intermediate goal given a few visual demonstrations of a task.
  • The authors show there exists a small subset of pre-trained features that are highly discriminative even for previously unseen scenes and which can be used to reduce the search space for future work in unsupervised steps discovery.
  • Another compelling direction for future work is to explore how reward learning algorithms can be combined with robotic lifelong learning.
  • Continuous learning using unsupervised rewards promises to substantially increase the variety and diversity of experience that is available for robotic reinforcement learning, resulting in more powerful, robust, and general robotic skills
Summary
  • Introduction:

    Social learning, such as imitation, plays a critical role in allowing humans and animals to quickly acquire complex skills in the real world
  • Humans can use this weak form of supervision to acquire behaviors from very small numbers of demonstrations, in sharp contrast to deep reinforcement learning (RL) methods, which typically require extensive training data.
  • Once the reward function has been extracted, the agent can use its own experience at the task to determine the physical structure of the behavior, even when the reward is provided by an agent with a substantially different embodiment
  • Objectives:

    The authors aim to answer the question of whether the previously visualized reward function can be used to learn a real-world robotic motion skill.
  • Since the aim is mainly to validate that the learned reward functions capture the goals of the task well enough for learning, the authors employ a relatively simple linear-Gaussian parameterization of the policy, which corresponds to a sequence of open-loop torque commands with fixed linear feedback to correct for perturbations, as in the work of Chebotar et al (2016)
  • Methods:

    The authors discuss the empirical evaluation, starting with an analysis of the learned reward functions in terms of both qualitative reward structure and quantitative segmentation accuracy.
  • The authors present results for a real-world validation of the method on robotic door opening.
  • 3.1 PERCEPTUAL REWARDS EVALUATION The authors report results on two demonstrated tasks: door opening and liquid pouring.
  • The intermediate reward function for the door opening task which corresponds to a human hand manipulating the door handle seems rather noisy or wrong in 10b, 10c and 10e (”action1” on the y-axis of the plots).The reward function in 11f remains flat while liquid is being poured into the glass.
  • The liquid being somewhat transparent, the authors suspect that it looks too similar to the transparent glass for the function to fire
  • Results:

    The authors show in Fig. 4 that the feature selection approach works well when the number of features n is in the region [32, 64] but collapses to 0% accuracy when n > 8192.
  • Conclusion:

    The authors present a method for automatically identifying important intermediate goal given a few visual demonstrations of a task.
  • The authors show there exists a small subset of pre-trained features that are highly discriminative even for previously unseen scenes and which can be used to reduce the search space for future work in unsupervised steps discovery.
  • Another compelling direction for future work is to explore how reward learning algorithms can be combined with robotic lifelong learning.
  • Continuous learning using unsupervised rewards promises to substantially increase the variety and diversity of experience that is available for robotic reinforcement learning, resulting in more powerful, robust, and general robotic skills
Tables
  • Table1: Unsupervised steps discovery accuracy (Jaccard overlap on training sets) versus the ordered random steps baseline
  • Table2: Reward functions accuracy by steps (Jaccard overlap on test sets)
Download tables as Excel
Related work
  • Deep reinforcement learning and deep robotic learning work has previously examined learning reward functions based on images. One of the most common approaches to image-based reward functions is to directly specify a “target image” by showing the learner the raw pixels of a successful task completion state, and then using distance to that image (or its latent representation) as a reward function (Lange et al, 2012; Finn et al, 2015; Watter et al, 2015). However, this approach has several severe shortcomings. First, the use of a target image presupposes that the system can achieve a substantially similar visual state, which precludes generalization to semantically similar but visually distinct situations. Second, the use of a target image does not provide the learner with information about which facet of the image is more or less important for task success, which might result in the learner excessively emphasizing irrelevant factors of variation (such as the color of a door due to light and shadow) at the expense of relevant factors (such as whether or not the door is open or closed). Analyzing a collection of demonstrations to learn a parsimonious reward function that explains the demonstrated behavior is known as inverse reinforcement learning (IRL) (Ng et al, 2000). A few recently proposed IRL algorithms have sought to combine IRL with vision and deep network representations (Finn et al, 2016b; Wulfmeier et al, 2016). However, scaling IRL to high-dimensional systems and open-ended reward representations is very challenging. The previous work closest to ours used images together with robot state information (joint angles and end effector pose), with tens of demonstrations provided through kinesthetic teaching (Finn et al, 2016b). The approach we propose in this work which can be interpreted as a simple and efficient approximation to IRL, can use demonstrations that consist of videos of a human performing the task using their own body, and can acquire reward functions with intermediate sub-goals using just a few examples. This kind of efficient vision-based reward learning from videos of humans has not been demonstrated in prior IRL work. The idea of perceptual reward functions using raw pixels was also explored by Edwards et al (2016) which, while sharing the same spirit as this work, was limited to simple synthetic tasks and used single images as perceptual goals rather than multiple demonstration videos.
Funding
  • We show in Fig. 4 that the feature selection approach works well when the number of features n is in the region [32, 64] but collapses to 0% accuracy when n > 8192
Reference
  • Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. ACM, 2004.
    Google ScholarLocate open access versionFindings
  • Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. 2011.
    Google ScholarFindings
  • Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. arXiv preprint arXiv:1610.00529, 2016.
    Findings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
    Google ScholarLocate open access versionFindings
  • Ashley Edwards, Charles Isbell, and Atsuo Takanishi. Perceptual reward functions. arXiv preprint arXiv:1608.03824, 2016.
    Findings
  • Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. arXiv preprint arXiv:1509.06113, 2015.
    Findings
  • Chelsea Finn, Paul Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016a.
    Findings
  • Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. arXiv preprint arXiv:1603.00448, 2016b.
    Findings
  • A. Joulin, K. Tang, and L. Fei-Fei. Efficient image and video co-localization with frank-wolfe algorithm. In European Conference on Computer Vision (ECCV), 2014.
    Google ScholarLocate open access versionFindings
  • Mrinal Kalakrishnan, Evangelos Theodorou, and Stefan Schaal. Inverse reinforcement learning with pi 2. 2010.
    Google ScholarFindings
  • Sascha Lange, Martin Riedmiller, and Arne Voigtlander. Autonomous reinforcement learning on raw visual input data in a real world application. In The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.
    Google ScholarLocate open access versionFindings
  • Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In AAAI Conference on Artificial Intelligence (AAAI 2010), 2010.
    Google ScholarLocate open access versionFindings
  • Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51 (61801):1–4, 2007.
    Google ScholarFindings
  • Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. ACM, 2006.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567.
    Findings
  • Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11(Nov):3137–3181, 2010.
    Google ScholarLocate open access versionFindings
  • Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pp. 2746–2754, 2015.
    Google ScholarLocate open access versionFindings
  • Under review as a conference paper at ICLR 2017 Markus Wulfmeier, Dominic Zeng Wang, and Ingmar Posner. Watch This: Scalable Cost-Function
    Google ScholarFindings
  • Learning for Path Planning in Urban Environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016. arxiv preprint: http://arxiv.org/abs/1607.02329. Jinhui Yuan, Huiyi Wang, Lan Xiao, Wujie Zheng, Jianmin Li, Fuzong Lin, and Bo Zhang. A formal study of shot boundary detection. IEEE transactions on circuits and systems for video technology, 17(2):168–186, 2007. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, pp.1433–1438, 2008.
    Findings
Full Text
Your rating :
0

 

Tags
Comments