AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
While each of these three topics has a large, dedicated community working on domain-relevant benchmarks and methodologies, there are only a few works that have addressed the challenge of integration
Language-Conditioned Imitation Learning for Robot Manipulation Tasks
NIPS 2020, (2020)
Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone (i.e., motion trajectories and perceptual data). No adequate communication channel exists between the human expert and the robot to describe critical aspects of the task...More
PPT (Upload PPT)
- Learning robot control policies by imitation  is an appealing approach to skill acquisition and has been successfully applied to several tasks, including locomotion, grasping, and even table tennis [8, 2, 25].
- Neural approaches scale imitation learning [27, 4, 20, 1, 9] to high-dimensional spaces by enabling agents to learn task-specific feature representations
- Both foundational references , as well as more recent literature , have noted that these methods lack “a communication channel,” which would allow the user to provide further information about the intended task, at nearly no additional cost .
- The authors' model can extract a variety of information directly from natural language
- Learning robot control policies by imitation  is an appealing approach to skill acquisition and has been successfully applied to several tasks, including locomotion, grasping, and even table tennis [8, 2, 25]
- These demonstrations are used to derive a control policy that generalizes the observed behavior to a larger set of scenarios that allow for responses to perceptual stimuli with appropriate actions
- We show that our approach leveraged perception, language, and motion modalities to generalize the demonstrated behavior to new user commands or experimental setups
- We present an approach for end-to-end imitation learning of robot manipulation policies that combines language, vision, and control
- While we use FRCNN for perception and GloVe for language embeddings, our approach is independent of these choices and more recent models for vision and language, such as BERT , can be used as a replacement
- We showed that our approach significantly outperformed alternative methods, while generalizing across a variety of experimental setups and achieving credible results on free-form, unconstrained natural-language instructions from previously unseen users
- While each of these three topics has a large, dedicated community working on domain-relevant benchmarks and methodologies, there are only a few works that have addressed the challenge of integration
- The authors evaluated the approach in a simulated robot task with a table-top setup.
- In this task, a seven-DOF robot manipulator had to be taught by an expert how to perform a combination of picking and pouring behaviors.
- The expert provided both a kinesthetic demonstration of the task and a verbal description (e.g.,“pour a little into the red bowl”).
- The authors show that the approach leveraged perception, language, and motion modalities to generalize the demonstrated behavior to new user commands or experimental setups
- The authors present an approach for end-to-end imitation learning of robot manipulation policies that combines language, vision, and control.
- The extracted language-conditioned policies provided a simple and intuitive interface to a human user for providing unstructured commands
- This represents a significant departure from existing work on imitation learning and enables a tight coupling of semantic knowledge extraction and control signal generation.
- The authors' work describes a machine-learning approach that fundamentally combined language, vision, and motion to produce changes in a physical environment
- While each of these three topics has a large, dedicated community working on domain-relevant benchmarks and methodologies, there are only a few works that have addressed the challenge of integration.
- “users reported being satisfied with Alexa even when it did not produce sought information”
- Table1: Model ablations concerning auxiliary losses, model structure, and dataset size
- Table2: Generalization to new seninside the designated bowl (PIn), the percentage of correctly tences and changes in illumination dispersed quantities (QDif), underlining our model’s ability to adjust motions based on semantic cues, the mean-average-
- Acknowledgments and Disclosure of Funding This work was supported by a grant from the Interplanetary Initiative at Arizona State University
Study subjects and analysis
training samples: 45000
Given the sent of synonyms and templates, our language generator could create 99,864 unique task descriptions of which we randomly used 45,000 to generate our data set. The final data set contained 22,500 complete task demonstrations composed of the two subtasks (grasping and pouring), resulting in 45,000 training samples. Of these samples, we used 4,000 for validation and 1,000 for testing, leaving 40,000 for training
human users: 4
The remaining feature combinations reflected the general success rate of 85% for the pouring action. Generalization to new users and perturbations: Subsequently, we evaluated our model’s performance when interacting with a new set of four human users, from which we collected 160 new sentences. The corresponding results can be seen in Table 2, row 2
training samples: 30000
Significant performance increases could be seen when gradually increasing the sample size from. 2,500 to 30,000 training samples. However, the step from 30,000 to 40,000 samples (our main model) only yielded a 4% performance increase, which was negligible compared to the previous increases of
2,500 to 30,000 training samples. However, the step from 30,000 to 40,000 samples (our main model) only yielded a 4% performance increase, which was negligible compared to the previous increases of. ≥ 20% between each step
- Pooya Abolghasemi, Amir Mazaheri, Mubarak Shah, and Ladislau Boloni. Pay attention!robustifying a deep visuomotor policy through task-focused visual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4254–4262, 2019.
- Heni Ben Amor, Oliver Kroemer, Ulrich Hillenbrand, Gerhard Neumann, and Jan Peters. Generalization of human grasping for multi-fingered robot hands. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 2043–2050. IEEE, 2012.
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Technical report, 2017. URL http://www.panderson.me/up-down-attention.
- Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee. Chasing ghosts: Instruction following as bayesian state tracking. In Advances in Neural Information Processing Systems 32, pages 369–379. Curran Associates, Inc., 2019.
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- Joseph Campbell, Simon Stepputtis, and Heni Ben Amor. Probabilistic multimodal modeling for human-robot interaction tasks, 2019.
- Rawichote Chalodhorn, David B Grimes, Keith Grochow, and Rajesh PN Rao. Learning to walk through imitation. In IJCAI, volume 7, pages 2084–2090, 2007.
- Jonathan Chang, Nishanth Kumar, Sean Hastings, Aaron Gokaslan, Diego Romeres, Devesh Jha, Daniel Nikovski, George Konidaris, and Stefanie Tellex. Learning Deep Parameterized Skills from Demonstration for Re-targetable Visuomotor Control. Technical report, 201URL http://arxiv.org/abs/1910.10628.
- Felipe Codevilla, Matthias Müller, Alexey Dosovitskiy, Antonio López, and Vladlen Koltun. End-to-end driving via conditional imitation learning. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–9, 2018.
- Yuchen Cui, Qiping Zhang, Alessandro Allievi, Peter Stone, Scott Niekum, and W. Bradley Knox. The empathic framework for task learning from implicit human feedback, 2020.
- Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks, 2016.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Yiming Ding, Carlos Florensa, Mariano Phielipp, and Pieter Abbeel. Goal-conditioned imitation learning. Advances in Neural Information Processing Systems, 2019. URL http://arxiv.org/abs/1906.05838.
- Nakul Gopalan, Dilip Arumugam, Lawson Wong, and Stefanie Tellex. Sequence-to-sequence language grounding of non-markovian task specifications. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018. doi: 10.15607/RSS.2018.XIV.067.
- Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation, 25 (2):328–373, 2013.
- Stephen James, Marc Freese, and Andrew J. Davison. Pyrep: Bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176, 2019.
- Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments. Technical report, 2020.
- Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Temporal-logic-based reactive mission and motion planning. IEEE transactions on robotics, 25(6):1370–1381, 2009.
- Yen-Ling Kuo, Boris Katz, and Andrei Barbu. Deep compositional robotic planners that follow natural language commands. Technical report, 2020.
- Irene Lopatovska, Katrina Rink, Ian Knight, Kieran Raines, Kevin Cosenza, Harriet Williams, Perachya Sorsche, David Hirsch, Qi Li, and Adrianna Martinez. Talk to me: Exploring user interactions with the amazon alexa. Journal of Librarianship and Information Science, page 096100061875941, 03 2018. doi: 10.1177/0961000618759414.
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems 32, pages 13–23. Curran Associates, Inc., 2019.
- Guilherme Maeda, Marco Ewerton, Rudolf Lioutikov, Heni Ben Amor, Jan Peters, and Gerhard Neumann. Learning interaction for collaborative tasks with probabilistic movement primitives. In Humanoid Robots (Humanoids), 2014 14th IEEE-RAS International Conference on, pages 527–534. IEEE, 2014.
- Cynthia Matuszek. Grounded Language Learning: Where Robotics and NLP Meet *. Technical report, 2017.
- Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263–279, 2013.
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
- Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
- Vasumathi Raman, Cameron Finucane, and Hadas Kress-Gazit. Temporal logic robot mission planning for slow and fast actions. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 251–256. IEEE, 2012.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2015.
- E. Rohmer, S. P. N. Singh, and M. Freese. Coppeliasim (formerly v-rep): a versatile and scalable robot simulation framework. In Proc. of The International Conference on Intelligent Robots and Systems (IROS), 2013. www.coppeliarobotics.com.
- Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
- Yuuya Sugita and Jun Tani. Learning Semantic Combinatoriality from the Interaction between Linguistic and Behavioral Processes. Technical report, 2005.
- Tom Williams and Matthias Scheutz. The state-of-the-art in autonomous wheelchairs controlled through natural language: A survey. Robotics and Autonomous Systems, 96:171–183, 2017.
- Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks, 2020.