AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
While each of these three topics has a large, dedicated community working on domain-relevant benchmarks and methodologies, there are only a few works that have addressed the challenge of integration

Language-Conditioned Imitation Learning for Robot Manipulation Tasks

NIPS 2020, (2020)

Cited by: 0|Views137
EI
Full Text
Bibtex
Weibo

Abstract

Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone (i.e., motion trajectories and perceptual data). No adequate communication channel exists between the human expert and the robot to describe critical aspects of the task...More
0
Introduction
  • Learning robot control policies by imitation [31] is an appealing approach to skill acquisition and has been successfully applied to several tasks, including locomotion, grasping, and even table tennis [8, 2, 25].
  • Neural approaches scale imitation learning [27, 4, 20, 1, 9] to high-dimensional spaces by enabling agents to learn task-specific feature representations
  • Both foundational references [27], as well as more recent literature [10], have noted that these methods lack “a communication channel,” which would allow the user to provide further information about the intended task, at nearly no additional cost [11].
  • The authors' model can extract a variety of information directly from natural language
Highlights
  • Learning robot control policies by imitation [31] is an appealing approach to skill acquisition and has been successfully applied to several tasks, including locomotion, grasping, and even table tennis [8, 2, 25]
  • These demonstrations are used to derive a control policy that generalizes the observed behavior to a larger set of scenarios that allow for responses to perceptual stimuli with appropriate actions
  • We show that our approach leveraged perception, language, and motion modalities to generalize the demonstrated behavior to new user commands or experimental setups
  • We present an approach for end-to-end imitation learning of robot manipulation policies that combines language, vision, and control
  • While we use FRCNN for perception and GloVe for language embeddings, our approach is independent of these choices and more recent models for vision and language, such as BERT [13], can be used as a replacement
  • We showed that our approach significantly outperformed alternative methods, while generalizing across a variety of experimental setups and achieving credible results on free-form, unconstrained natural-language instructions from previously unseen users
  • While each of these three topics has a large, dedicated community working on domain-relevant benchmarks and methodologies, there are only a few works that have addressed the challenge of integration
Results
  • The authors evaluated the approach in a simulated robot task with a table-top setup.
  • In this task, a seven-DOF robot manipulator had to be taught by an expert how to perform a combination of picking and pouring behaviors.
  • The expert provided both a kinesthetic demonstration of the task and a verbal description (e.g.,“pour a little into the red bowl”).
  • The authors show that the approach leveraged perception, language, and motion modalities to generalize the demonstrated behavior to new user commands or experimental setups
Conclusion
  • The authors present an approach for end-to-end imitation learning of robot manipulation policies that combines language, vision, and control.
  • The extracted language-conditioned policies provided a simple and intuitive interface to a human user for providing unstructured commands
  • This represents a significant departure from existing work on imitation learning and enables a tight coupling of semantic knowledge extraction and control signal generation.
  • The authors' work describes a machine-learning approach that fundamentally combined language, vision, and motion to produce changes in a physical environment
  • While each of these three topics has a large, dedicated community working on domain-relevant benchmarks and methodologies, there are only a few works that have addressed the challenge of integration.
  • “users reported being satisfied with Alexa even when it did not produce sought information”[21]
Tables
  • Table1: Model ablations concerning auxiliary losses, model structure, and dataset size
  • Table2: Generalization to new seninside the designated bowl (PIn), the percentage of correctly tences and changes in illumination dispersed quantities (QDif), underlining our model’s ability to adjust motions based on semantic cues, the mean-average-
Download tables as Excel
Funding
  • Acknowledgments and Disclosure of Funding This work was supported by a grant from the Interplanetary Initiative at Arizona State University
Study subjects and analysis
training samples: 45000
Given the sent of synonyms and templates, our language generator could create 99,864 unique task descriptions of which we randomly used 45,000 to generate our data set. The final data set contained 22,500 complete task demonstrations composed of the two subtasks (grasping and pouring), resulting in 45,000 training samples. Of these samples, we used 4,000 for validation and 1,000 for testing, leaving 40,000 for training

human users: 4
The remaining feature combinations reflected the general success rate of 85% for the pouring action. Generalization to new users and perturbations: Subsequently, we evaluated our model’s performance when interacting with a new set of four human users, from which we collected 160 new sentences. The corresponding results can be seen in Table 2, row 2

training samples: 30000
Significant performance increases could be seen when gradually increasing the sample size from. 2,500 to 30,000 training samples. However, the step from 30,000 to 40,000 samples (our main model) only yielded a 4% performance increase, which was negligible compared to the previous increases of

samples: 40000
2,500 to 30,000 training samples. However, the step from 30,000 to 40,000 samples (our main model) only yielded a 4% performance increase, which was negligible compared to the previous increases of. ≥ 20% between each step

Reference
  • Pooya Abolghasemi, Amir Mazaheri, Mubarak Shah, and Ladislau Boloni. Pay attention!robustifying a deep visuomotor policy through task-focused visual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4254–4262, 2019.
    Google ScholarLocate open access versionFindings
  • Heni Ben Amor, Oliver Kroemer, Ulrich Hillenbrand, Gerhard Neumann, and Jan Peters. Generalization of human grasping for multi-fingered robot hands. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 2043–2050. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Technical report, 2017. URL http://www.panderson.me/up-down-attention.
    Findings
  • Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee. Chasing ghosts: Instruction following as bayesian state tracking. In Advances in Neural Information Processing Systems 32, pages 369–379. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
    Google ScholarLocate open access versionFindings
  • Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
    Google ScholarLocate open access versionFindings
  • Joseph Campbell, Simon Stepputtis, and Heni Ben Amor. Probabilistic multimodal modeling for human-robot interaction tasks, 2019.
    Google ScholarFindings
  • Rawichote Chalodhorn, David B Grimes, Keith Grochow, and Rajesh PN Rao. Learning to walk through imitation. In IJCAI, volume 7, pages 2084–2090, 2007.
    Google ScholarLocate open access versionFindings
  • Jonathan Chang, Nishanth Kumar, Sean Hastings, Aaron Gokaslan, Diego Romeres, Devesh Jha, Daniel Nikovski, George Konidaris, and Stefanie Tellex. Learning Deep Parameterized Skills from Demonstration for Re-targetable Visuomotor Control. Technical report, 201URL http://arxiv.org/abs/1910.10628.
    Findings
  • Felipe Codevilla, Matthias Müller, Alexey Dosovitskiy, Antonio López, and Vladlen Koltun. End-to-end driving via conditional imitation learning. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–9, 2018.
    Google ScholarLocate open access versionFindings
  • Yuchen Cui, Qiping Zhang, Alessandro Allievi, Peter Stone, Scott Niekum, and W. Bradley Knox. The empathic framework for task learning from implicit human feedback, 2020.
    Google ScholarFindings
  • Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks, 2016.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
    Google ScholarFindings
  • Yiming Ding, Carlos Florensa, Mariano Phielipp, and Pieter Abbeel. Goal-conditioned imitation learning. Advances in Neural Information Processing Systems, 2019. URL http://arxiv.org/abs/1906.05838.
    Findings
  • Nakul Gopalan, Dilip Arumugam, Lawson Wong, and Stefanie Tellex. Sequence-to-sequence language grounding of non-markovian task specifications. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018. doi: 10.15607/RSS.2018.XIV.067.
    Locate open access versionFindings
  • Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation, 25 (2):328–373, 2013.
    Google ScholarFindings
  • Stephen James, Marc Freese, and Andrew J. Davison. Pyrep: Bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176, 2019.
    Findings
  • Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments. Technical report, 2020.
    Google ScholarFindings
  • Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Temporal-logic-based reactive mission and motion planning. IEEE transactions on robotics, 25(6):1370–1381, 2009.
    Google ScholarLocate open access versionFindings
  • Yen-Ling Kuo, Boris Katz, and Andrei Barbu. Deep compositional robotic planners that follow natural language commands. Technical report, 2020.
    Google ScholarFindings
  • Irene Lopatovska, Katrina Rink, Ian Knight, Kieran Raines, Kevin Cosenza, Harriet Williams, Perachya Sorsche, David Hirsch, Qi Li, and Adrianna Martinez. Talk to me: Exploring user interactions with the amazon alexa. Journal of Librarianship and Information Science, page 096100061875941, 03 2018. doi: 10.1177/0961000618759414.
    Locate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems 32, pages 13–23. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Guilherme Maeda, Marco Ewerton, Rudolf Lioutikov, Heni Ben Amor, Jan Peters, and Gerhard Neumann. Learning interaction for collaborative tasks with probabilistic movement primitives. In Humanoid Robots (Humanoids), 2014 14th IEEE-RAS International Conference on, pages 527–534. IEEE, 2014.
    Google ScholarLocate open access versionFindings
  • Cynthia Matuszek. Grounded Language Learning: Where Robotics and NLP Meet *. Technical report, 2017.
    Google ScholarFindings
  • Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263–279, 2013.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
    Locate open access versionFindings
  • Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
    Google ScholarLocate open access versionFindings
  • Vasumathi Raman, Cameron Finucane, and Hadas Kress-Gazit. Temporal logic robot mission planning for slow and fast actions. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 251–256. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2015.
    Google ScholarFindings
  • E. Rohmer, S. P. N. Singh, and M. Freese. Coppeliasim (formerly v-rep): a versatile and scalable robot simulation framework. In Proc. of The International Conference on Intelligent Robots and Systems (IROS), 2013. www.coppeliarobotics.com.
    Locate open access versionFindings
  • Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
    Google ScholarLocate open access versionFindings
  • Yuuya Sugita and Jun Tani. Learning Semantic Combinatoriality from the Interaction between Linguistic and Behavioral Processes. Technical report, 2005.
    Google ScholarFindings
  • Tom Williams and Matthias Scheutz. The state-of-the-art in autonomous wheelchairs controlled through natural language: A survey. Robotics and Autonomous Systems, 96:171–183, 2017.
    Google ScholarLocate open access versionFindings
  • Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks, 2020.
    Google ScholarFindings
Author
Simon Stepputtis
Simon Stepputtis
Joseph Campbell
Joseph Campbell
Mariano Phielipp
Mariano Phielipp
Stefan Lee
Stefan Lee
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科