AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We present an extension of the task of EmbodiedQA to photorealistic environments utilizing the Matterport 3D dataset and propose the MP3D-Embodied Question Answering v1 dataset

Embodied Question Answering in Photorealistic Environments with Point Cloud Perception.

CVPR, (2019)

Cited by: 40|Views138
EI
Full Text
Bibtex
Weibo

Abstract

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination...More

Code:

Data:

0
Introduction
  • Imagine asking a home robot ‘Hey - can you go check if my laptop is on my desk? And if so, bring it to me.’ In order to be successful, such an agent would need a range of artificial intelligence (AI) skills – visual perception, language understanding, and navigation of potentially novel environments.
  • Much of the recent success in these areas is due to large neural networks trained on massive human-annotated datasets collected from the web
  • This static paradigm of ‘internet vision’ is poorly suited for training embodied agents.
  • While these tasks are set in semantically realistic environments, most are based in synthetic environments that are perceptually quite different from what agents embodied in the real world might experience
  • These environments lack visual realism both in terms of the fidelity of textures, lighting, and object geometries and with respect to the rich in-class variation of objects.
  • These problems are typically approached with 2D perception (RGB frames) despite the widespread use of depth-sensing cameras (RGB-D) on actual robotic platforms [13,14,15]
Highlights
  • Imagine asking a home robot ‘Hey - can you go check if my laptop is on my desk? And if so, bring it to me.’ In order to be successful, such an agent would need a range of artificial intelligence (AI) skills – visual perception, language understanding, and navigation of potentially novel environments
  • We address these points of disconnect by instantiating a large-scale, language-based navigation task in photorealistic environments and by developing end-toend trainable models with point cloud perception – from raw 3D point clouds to goal-driven navigation policies
  • We introduce the Matterport 3D (MP3D)-Embodied Question Answering (EQA) dataset, consisting of 1136 questions and answers grounded in 83 environments
  • We present an extension of the task of EmbodiedQA to photorealistic environments utilizing the Matterport 3D dataset and propose the MP3D-EQA v1 dataset
  • We present a thorough study of 2 navigation baselines and 2 different navigation architectures with 8 different input variations
  • We develop an end-to-end trainable navigation model capable of learning goal-driving navigation policies directly from 3D point clouds
Methods
  • Experiments and Analysis

    The authors closely follow the experimental protocol of Das et al [1].
  • Agents are evaluated on their performance 10, 30, or 50 primitive actions away from the question target, corresponding to distances of 0.35, 1.89, and 3.54 meters respectively.
  • The authors perform an exhaustive evaluation of design decisions, training a total of 16 navigation models (2 architectures, 2 language variations, and 4 perception variations), 3 visual question answering models, and 2 perception models.
Results
  • The top-1 accuracy for different answering modules on the validation set using the groundtruth navigator is shown below.
  • In-order to compare QA performance between navigators, the authors report all QA results with the best-performing module – spatial+RGB+Q – regardless of the navigator.
  • The authors use the following notation to specify the models: For the base architecture, R denotes reactive models and M denotes memory models.
  • A memory model that utilizes point clouds is denoted as M+PC.
  • The authors denote the two baseline navigators, forward-only and random, as Fwd and Random, respectively
Conclusion
  • The authors present an extension of the task of EmbodiedQA to photorealistic environments utilizing the Matterport 3D dataset and propose the MP3D-EQA v1 dataset.
  • The authors present a thorough study of 2 navigation baselines and 2 different navigation architectures with 8 different input variations.
  • The authors develop an end-to-end trainable navigation model capable of learning goal-driving navigation policies directly from 3D point clouds.
  • The authors provide analysis and insight into the factors that affect navigation performance and propose a novel weighting scheme – Inflection Weighting – that increases the effectiveness of behavior cloning.
  • The authors demonstrate that two the navigation baselines, random and forward-only, are quite strong under the evaluation settings presented by [1].
  • The authors' work serves as a step towards bridging the gap between internet vision-style problems and the goal of vision for embodied perception
Summary
  • Introduction:

    Imagine asking a home robot ‘Hey - can you go check if my laptop is on my desk? And if so, bring it to me.’ In order to be successful, such an agent would need a range of artificial intelligence (AI) skills – visual perception, language understanding, and navigation of potentially novel environments.
  • Much of the recent success in these areas is due to large neural networks trained on massive human-annotated datasets collected from the web
  • This static paradigm of ‘internet vision’ is poorly suited for training embodied agents.
  • While these tasks are set in semantically realistic environments, most are based in synthetic environments that are perceptually quite different from what agents embodied in the real world might experience
  • These environments lack visual realism both in terms of the fidelity of textures, lighting, and object geometries and with respect to the rich in-class variation of objects.
  • These problems are typically approached with 2D perception (RGB frames) despite the widespread use of depth-sensing cameras (RGB-D) on actual robotic platforms [13,14,15]
  • Methods:

    Experiments and Analysis

    The authors closely follow the experimental protocol of Das et al [1].
  • Agents are evaluated on their performance 10, 30, or 50 primitive actions away from the question target, corresponding to distances of 0.35, 1.89, and 3.54 meters respectively.
  • The authors perform an exhaustive evaluation of design decisions, training a total of 16 navigation models (2 architectures, 2 language variations, and 4 perception variations), 3 visual question answering models, and 2 perception models.
  • Results:

    The top-1 accuracy for different answering modules on the validation set using the groundtruth navigator is shown below.
  • In-order to compare QA performance between navigators, the authors report all QA results with the best-performing module – spatial+RGB+Q – regardless of the navigator.
  • The authors use the following notation to specify the models: For the base architecture, R denotes reactive models and M denotes memory models.
  • A memory model that utilizes point clouds is denoted as M+PC.
  • The authors denote the two baseline navigators, forward-only and random, as Fwd and Random, respectively
  • Conclusion:

    The authors present an extension of the task of EmbodiedQA to photorealistic environments utilizing the Matterport 3D dataset and propose the MP3D-EQA v1 dataset.
  • The authors present a thorough study of 2 navigation baselines and 2 different navigation architectures with 8 different input variations.
  • The authors develop an end-to-end trainable navigation model capable of learning goal-driving navigation policies directly from 3D point clouds.
  • The authors provide analysis and insight into the factors that affect navigation performance and propose a novel weighting scheme – Inflection Weighting – that increases the effectiveness of behavior cloning.
  • The authors demonstrate that two the navigation baselines, random and forward-only, are quite strong under the evaluation settings presented by [1].
  • The authors' work serves as a step towards bridging the gap between internet vision-style problems and the goal of vision for embodied perception
Tables
  • Table1: Statistics of splits for EQA in Matterport3D
Download tables as Excel
Related work
  • Embodied Agents and Environments. End-to-end learning methods – to predict actions directly from raw pixels [17] – have recently demonstrated strong performance. Gupta et al [2] learn to navigate via mapping and planning. Sadeghi et al [18] teach an agent to fly using simulated data and deploy it in the real world. Gandhi et al [19] collect a dataset of drone crashes and train self-supervised agents to avoid obstacles. A number of new challenging tasks have been proposed including instruction-based navigation [6,7], target-driven navigation [2, 4], embodied/interactive question answering [1, 9], and task planning [5].
Funding
  • This work was supported in part by NSF (Grant # 1427300), AFRL, DARPA, Siemens, Samsung, Google, Amazon, ONR YIPs and ONR Grants N00014-16-1-{2713,2793}
Reference
  • Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied Question Answering. In CVPR, 2018. 1, 2, 3, 4, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environments. In ICLR Workshop, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li FeiFei, Abhinav Gupta, Roozbeh Mottaghi, and Ali Farhadi. Visual Semantic Planning using Deep Successor Representations. In ICCV, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-oriented language grounding. arXiv preprint arXiv:1706.07230, 2017. 1, 2
    Findings
  • Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojtek Czarnecki, Max Jaderberg, Denis Teplyashin, et al. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017. 1
    Findings
  • Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. IQA: Visual question answering in interactive environments. In CVPR, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Neural Modular Control for Embodied Question Answering. In Proceedings of the Conference on Robot Learning (CoRL), 2018. 1
    Google ScholarLocate open access versionFindings
  • Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. Unity: A general platform for intelligent agents, 2018. 1
    Google ScholarFindings
  • Albert S Huang, Abraham Bachrach, Peter Henry, Michael Krainin, Daniel Maturana, Dieter Fox, and Nicholas Roy. Visual odometry and mapping for autonomous flight using an rgb-d camera. In Robotics Research. 2017. 2
    Google ScholarFindings
  • Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois Robert Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli, Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem Qu Nair, Druck Green, Ian Taylor, Weber Liu, Thomas Funkhouser, and Alberto Rodriguez. Robotic pickand-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In ICRA, 2018. 2
    Google ScholarLocate open access versionFindings
  • Shiqi Zhang, Yuqian Jiang, Guni Sharon, and Peter Stone. Multirobot symbolic planning under temporal uncertainty. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Sytems (AAMAS), May 2017. 2
    Google ScholarLocate open access versionFindings
  • Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGBD data in indoor environments. International Conference on 3D Vision (3DV), 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. JMLR, 17(1):1334–1373, Jan. 2016. 2
    Google ScholarLocate open access versionFindings
  • Fereshteh Sadeghi and Sergey Levine. CAD2RL: Real single-image flight without a single real image. RSS, 2017. 2
    Google ScholarLocate open access versionFindings
  • Abhinav Gupta Dhiraj Gandhi, Lerrel Pinto. Learning to fly by crashing. IROS, 2017. 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Kuttler, Andrew Lefrancq, Simon Green, Vıctor Valdes, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. Deepmind lab. arXiv. 2
    Google ScholarFindings
  • Michal Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. Vizdoom: A doom-based AI research platform for visual reinforcement learning. arXiv 1605.02097, 2016. 2
    Findings
  • Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. MINOS: Multimodal indoor simulator for navigation in complex environments. arXiv:1712.03931, 2017. 2, 3, 4
    Findings
  • Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron C. Courville. Home: a household multimodal environment. arXiv 1711.11017, 2017. 2
    Findings
  • Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017. 2
    Google ScholarFindings
  • Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015. 3
    Google ScholarLocate open access versionFindings
  • Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015. 3
    Google ScholarLocate open access versionFindings
  • Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016. 3
    Google ScholarLocate open access versionFindings
  • Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017. 3, 4
    Google ScholarLocate open access versionFindings
  • Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017. 3, 4, 5
    Google ScholarLocate open access versionFindings
  • Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In ICCV. IEEE, 2017. 3
    Google ScholarLocate open access versionFindings
  • Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017. 3
    Google ScholarLocate open access versionFindings
  • Kenneth L Kelly. Twenty-two colors of maximum contrast. Color Engineering, 3(26):26–27, 1965. 3
    Google ScholarLocate open access versionFindings
  • Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392, 2017. 5
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5
    Google ScholarLocate open access versionFindings
  • Ankesh Anand, Eugene Belilovsky, Kyle Kastner, Hugo Larochelle, and Aaron Courville. Blindfold Baselines for Embodied QA. arXiv preprint arXiv:1811.05013, 2018. 5, 7
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. 5
    Google ScholarLocate open access versionFindings
  • Jesse Thomason, Daniel Gordan, and Yonatan Bisk. Shifting the baseline: Single modality performance on visual navigation & qa. arXiv preprint arXiv:1811.00613, 2018. 6
    Findings
  • Yuxin Wu and Kaiming He. Group normalization. arXiv preprint arXiv:1803.08494, 2018. 6
    Findings
  • Alex Nash, Sven Koenig, and Craig Tovey. Lazy theta*: Any-angle path planning and path length analysis in 3d. In Third Annual Symposium on Combinatorial Search, 2010. 6
    Google ScholarLocate open access versionFindings
  • David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015. 6
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科