AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We study how we can acquire effective object-centric representations for robotic manipulation tasks without human labeling by using autonomous robot interaction with the environment

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping.

CoRL, (2018): 99-112

Cited by: 60|Views215
EI
Full Text
Bibtex
Weibo

Abstract

Well structured visual representations can make robot learning faster and can improve generalization. In this paper, we study how we can acquire effective object-centric representations for robotic manipulation tasks without human labeling by using autonomous robot interaction with the environment. Such representation learning methods can...More

Code:

Data:

0
Introduction
  • Robotic learning algorithms based on reinforcement, self-supervision, and imitation can acquire end-to-end controllers from images for diverse tasks such as robotic mobility [1, 2] and object manipulation [3, 4].
  • These end-to-end controllers acquire perception systems that are tailored to the task, picking up on the cues that are most useful for the control problem at hand.
  • It can look at its gripper and see the object from a new
Highlights
  • Robotic learning algorithms based on reinforcement, self-supervision, and imitation can acquire end-to-end controllers from images for diverse tasks such as robotic mobility [1, 2] and object manipulation [3, 4]
  • These end-to-end controllers acquire perception systems that are tailored to the task, picking up on the cues that are most useful for the control problem at hand
  • By interacting with the real world, an agent can learn about the interplay of perception and action
  • We presented grasp2vec, a representation learning approach that learns to represent scenes as sets of objects, admitting basic manipulations such as removing and adding objects as arithmetic operations in the learned embedding space
  • Our method is supervised entirely with data that can be collected autonomously by a robot, and we demonstrate that the learned representation can be used to localize objects, recognize instances, and supervise a goal-conditioned grasping method that can learn via goal relabeling to pick up user-commanded objects
  • Our work suggests a number of promising directions for future research: incorporating semantic information into the representation, leveraging the learned representations for spatial, object-centric relational reasoning tasks (e.g., [27]), and further exploring the compositionality in the representation to enable planning compound skills in the embedding space
Methods
  • The authors evaluate the representation both in terms of its ability to localize and detect object instances, and in terms of its ability to enable goal-conditioned grasping by supplying an effective reward function and goal representation.
  • Illustrations and videos are at https: //sites.google.com/site/grasp2vec/
Results
  • As shown in Table 1, grasp2vec embeddings perform localization at almost 80% accuracy on objects that were never seen during training, and without receiving any position labels.
  • The authors expect that such a method could be used to provide goals for pick and place or pushing task where a particular object position is desired.
  • This network is only able to localize the objects at 15% accuracy, because the features of an object in the gripper are not necessary similar to the features of that same object in the bin
Conclusion
  • The authors presented grasp2vec, a representation learning approach that learns to represent scenes as sets of objects, admitting basic manipulations such as removing and adding objects as arithmetic operations in the learned embedding space.
  • The authors' method is supervised entirely with data that can be collected autonomously by a robot, and the authors demonstrate that the learned representation can be used to localize objects, recognize instances, and supervise a goal-conditioned grasping method that can learn via goal relabeling to pick up user-commanded objects.
  • The authors' work suggests a number of promising directions for future research: incorporating semantic information into the representation, leveraging the learned representations for spatial, object-centric relational reasoning tasks (e.g., [27]), and further exploring the compositionality in the representation to enable planning compound skills in the embedding space
Tables
  • Table1: Quantitative study of Grasp2Vec embeddings. Object retrieval for a grasp is calculated by finding the outcome image whose embedding is closest to φs(spre) − φs(spost) in the dataset. The object retrieval is counted as correct if the nearest neighbor image contains the same object as the one grasped from spre. As we cannot expect weights trained on ImageNet to exhibit this property, we evaluate the nearest neighbors between outcome images. The accuracy measures how often top nearest neighbor for an outcome image contains the same object. Object localization is calculated by multiplying φo(o) with each feature vector in φs,spatial(spre) to obtain a heatmap. An object is localized if the maximum point of the heatmap lies on the outcome object in the scene. See Figure 4b for examples heatmaps and Appendix A for example retrievals
  • Table2: Evaluation and ablation studies on a simulated instance grasping task, averaged over 700 trials. In simulation, the scene graph is accessed to evaluate ground-truth performance, but it is withheld from our learning algorithms. Performance is reported as percentage of grasps that picked up the user-specified object
  • Table3: Instance grasp success rate on composed goals: We average two grasp2vec vectors at test time and evaluate whether the policy can grasp either of the requested objects
Download tables as Excel
Related work
  • Unsupervised representation learning. Past works on interactive learning have used egomotion of a mobile agent or poking motions [6, 7, 8, 9, 10] to provide data-efficient learning of perception and control. Our approach learns representations that abstract away position and appearance, while preserving object identity and the combinatorial structure in the world (i.e., which objects are present) via a single feature vector. Past work has also found that deep representations can exhibit intuitive linear relationships, such as in word embeddings [11], and in face attributes [12]. Wang et al represent actions as the transformation from precondition to effect in the action recognition domain [13]. While our work shares the idea of arithmetic coherence over the course of an action, we optimize a different criterion and apply the model to learning policies rather than action recognition.
Funding
  • As shown in Table 1, grasp2vec embeddings perform localization at almost 80% accuracy on objects that were never seen during training, and without receiving any position labels
  • We expect that such a method could be used to provide goals for pick and place or pushing task where a particular object position is desired. For this localization evaluation, we compare grasp2vec embeddings against the same ResNet50-based architecture used in the embeddings, but trained on ImageNet [26]. This network is only able to localize the objects at 15% accuracy, because the features of an object in the gripper are not necessary similar to the features of that same object in the bin
Reference
  • S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert. Learning monocular reactive uav control in cluttered natural environments. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 1765–1772. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
    Findings
  • A. Ghadirzadeh, A. Maki, D. Kragic, and M. Bjorkman. Deep predictive policy training using reinforcement learning. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 2351–2358. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. In International Symposium on Experimental Robotics (ISER 2016), 2016.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
    Google ScholarLocate open access versionFindings
  • P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pages 5074–5082, 2016.
    Google ScholarLocate open access versionFindings
  • F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017.
    Findings
  • L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, and A. Gupta. The curious robot: Learning visual representations via physical interactions. In European Conference on Computer Vision, pages 3–18.
    Google ScholarLocate open access versionFindings
  • D. Jayaraman and K. Grauman. Learning image representations tied to ego-motion. In Proceedings of the IEEE International Conference on Computer Vision, pages 1413–1421, 2015.
    Google ScholarLocate open access versionFindings
  • R. Jonschkowski and O. Brock. Learning state representations with robotic priors. Autonomous Robots, 39(3):407–428, 2015.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    Google ScholarLocate open access versionFindings
  • A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
    Findings
  • X. Wang, A. Farhadi, and A. Gupta. Actionstransformations. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2658–2667, 2016.
    Google ScholarLocate open access versionFindings
  • L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 3406–3413. IEEE, 2016.
    Google ScholarLocate open access versionFindings
  • I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
    Google ScholarLocate open access versionFindings
  • A. Zeng, K.-T. Yu, S. Song, D. Suo, E. Walker, A. Rodriguez, and J. Xiao. Multi-view selfsupervised deep learning for 6d pose estimation in the amazon picking challenge. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 1386–1383. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt. Grasp pose detection in point clouds. The International Journal of Robotics Research, page 0278364917735594, 2017.
    Google ScholarLocate open access versionFindings
  • E. Jang, S. Vijaynarasimhan, P. Pastor, J. Ibarz, and S. Levine. End-to-end learning of semantic grasping. arXiv preprint arXiv:1707.01932, 2017.
    Findings
  • K. Fang, Y. Bai, S. Hinterstoisser, and M. Kalakrishnan. Multi-task domain adaptation for deep learning of instance grasping from simulation. arXiv preprint arXiv:1710.06422, 2017.
    Findings
  • M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5048–5058. Curran Associates, Inc., 2017.
    Google ScholarLocate open access versionFindings
  • S. Cabi, S. G. Colmenarejo, M. W. Hoffman, M. Denil, Z. Wang, and N. De Freitas. The intentional unintentional agent: Learning to solve many continuous control tasks simultaneously. arXiv preprint arXiv:1707.03300, 2017.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1857–1865. Curran Associates, Inc., 2016.
    Google ScholarLocate open access versionFindings
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for visionbased robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
    Findings
  • E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2018.
    Findings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科