AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We introduced multiON, a task framework allowing for systematic analysis of embodied AI navigation agents utilizing semantic map representations

MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation

NIPS 2020, (2020)

Cited by: 0|Views45
EI
Full Text
Bibtex
Weibo

Abstract

Navigation tasks in photorealistic 3D environments are challenging because they require perception and effective planning under partial observability. Recent work shows that map-like memory is useful for long-horizon navigation tasks. However, a focused investigation of the impact of maps on navigation tasks of varying complexity has no...More

Code:

Data:

0
Introduction
  • Recent work on embodied AI agents has made tremendous progress on tasks such as visual navigation [25, 39, 52], embodied question answering [20, 44], and natural language instruction following [3].
  • Though map-like memory structures need not be optimal for learning-based agents, they bring the advantage of a widely-used spatial abstraction and human interpretability
  • They impose inductive bias tied to the structure of interiors that have been shown to outperform implicit memory architectures in a variety of navigation tasks [7, 13,14,15].
  • This is in contrast with prior work which focuses on how to aggregate information under
Highlights
  • Recent work on embodied AI agents has made tremendous progress on tasks such as visual navigation [25, 39, 52], embodied question answering [20, 44], and natural language instruction following [3]. This progress has been enabled by the availability of realistic 3D environments and software platforms that simulate navigation tasks within such data [32, 38, 39, 47]
  • Large-scale training has led to near-perfect agent performance for basic visual navigation tasks under certain assumptions [45]
  • Spatial maps built with Simultaneous Localization and Mapping (SLAM) have been used for tasks such as exploration [51], and playing FPS games [8]
  • We introduced multiON, a task framework allowing for systematic analysis of embodied AI navigation agents utilizing semantic map representations
  • We hope that the multiON framework provides a flexible benchmark for systematic study of spatial memory and mapping mechanisms in embodied navigation agents
Methods
  • The authors describe a series of experiments with the various agent models on the multiON task.
  • The authors generate datasets for the 1-ON, 2-ON, and 3-ON tasks on the Matterport3D [11] scenes using the standard train/val/test split.
  • The geodesic distance from the agent starting position to the first goal and between successive goals is constrained to be between 2m and 20m.
  • This ensures that ‘trivial’ episodes are not generated
Results
  • Topological maps such as those proposed by Chaplot et al [15], Savinov et al [37] do not align visual information to the environment’s top-down layout.
  • Instead, they store landmarks as nodes and their connectivity as edges.
  • Prior knowledge of scene layout is used as a knowledge graph in various works [49, 46]
Conclusion
  • The authors introduced multiON, a task framework allowing for systematic analysis of embodied AI navigation agents utilizing semantic map representations.
  • The authors' experiments with several agent models show that semantic maps are highly useful for navigation, with a relatively naïve integration of semantic information into map memory providing high gains against more complex learned map representations.
  • The authors hope that the multiON framework provides a flexible benchmark for systematic study of spatial memory and mapping mechanisms in embodied navigation agents
Tables
  • Table1: Comparison of multiON to navigation tasks with multiple object goals from prior work. Note that prior work does not adopt a FOUND action, so incidental navigation to a goal is treated as success. Moreover, the set of objects is held fixed, with no episode-specifity for the goal objects
  • Table2: Agent performance on 1-ON, 2-ON and 3-ON test set (maximum 2,500 steps). The multiON task is challenging with Rand+OracleFound achieving 26% success (SPL 8%) for 1-ON, and Rand failing completely. Performance decreases for all agents as we add more objects. Overall, maps help considerably, with the ability to represent goal objects in the map being particularly valuable (compare OracleMap (Obj) and OracleMap (Occ) as well as ObjRecogMap and ProjNeuralMap)
Download tables as Excel
Related work
  • Embodied AI agents. There has been much interest in studying AI agents in simulated 3D environments [1, 9, 11, 32, 47, 4, 39, 48, 43]. Learning how to tackle a variety of tasks from egocentric perception is a common theme in this area. Embodied navigation is a family of closely related tasks where the goal is to navigate to specific points, objects or areas, respectively PointGoal, ObjectGoal, and AreaGoal [2]. In PointGoal navigation [39, 14, 23], the agent has access to a displacement vector to the goal at each time step, largely obviating the need for long-horizon planning and map-like memory. PointGoal has been extensively studied and recently ‘solved’ [45]. In contrast, ObjectGoal has not been well-studied despite being introduced in early work by Zhu et al [53] and explored in natural language grounding settings [12, 28].
Funding
  • Chang is supported by the Canada CIFAR AI Chair program
  • Manolis Savva is supported by an NSERC Discovery Grant
Reference
  • P. Ammirato, P. Poirson, E. Park, J. Kosecka, and A. Berg. A dataset for developing and benchmarking active vision. In ICRA, 2016.
    Google ScholarLocate open access versionFindings
  • P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
    Findings
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
    Findings
  • A. Banino, C. Barry, B. Uria, C. Blundell, T. Lillicrap, P. Mirowski, A. Pritzel, M. J. Chadwick, T. Degris, J. Modayil, G. Wayne, H. Soyer, F. Viola, B. Zhang, R. Goroshin, N. Rabinowitz, R. Pascanu, C. Beattie, S. Petersen, A. Sadik, S. Gaffney, H. King, K. Kavukcuoglu, D. Hassabis, R. Hadsell, and D. Kumaran. Vector-based navigation using grid-like representations in artificial agents. Nature, 557(7705):429–433, 2018.
    Google ScholarLocate open access versionFindings
  • E. Beeching, C. Wolf, J. Dibangoye, and O. Simonin. Deep reinforcement learning on a budget: 3D control and reasoning without a supercomputer. arXiv preprint arXiv:1904.01806, 2019.
    Findings
  • E. Beeching, C. Wolf, J. Dibangoye, and O. Simonin. EgoMap: Projective mapping and structured egocentric memory for deep RL. arXiv preprint arXiv:2002.02286, 2020.
    Findings
  • S. Bhatti, A. Desmaison, O. Miksik, N. Nardelli, N. Siddharth, and P. H. Torr. Playing Doom with SLAM-augmented deep reinforcement learning. arXiv preprint arXiv:1612.00380, 2016.
    Findings
  • S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. Courville. Home: A household multimodal environment. arXiv preprint arXiv:1711.11017, 2017.
    Findings
  • V. Cartillier, Z. Ren, N. Jain, S. Lee, I. Essa, and D. Batra. Semantic MapNet: Building allocentric semantic maps and representations from egocentric views. arXiv preprint arXiv:2010.01191, 2020.
    Findings
  • A. Chang, A. Dai, T. Funkhouser,, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, 2017.
    Google ScholarLocate open access versionFindings
  • D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. R. Salakhutdinov. Gated-attention architectures for task-oriented language grounding. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. In Conference on Neural Information Processing Systems, 2020.
    Google ScholarLocate open access versionFindings
  • D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov. Learning to explore using active neural SLAM. In International Conference on Learning Representations (ICLR), 2020.
    Google ScholarLocate open access versionFindings
  • D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta. Neural topological SLAM for visual navigation. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • C. Chen, U. Jain, C. Schissler, S. Vicenc, A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman. Soundspaces: Audio-visual navigation in 3d environments. arXiv preprint arXiv:1912.11474, 2019. first two authors contributed equally.
    Findings
  • T. Chen, S. Gupta, and A. Gupta. Learning exploration policies for navigation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SyMWn05F7.
    Locate open access versionFindings
  • K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied question answering. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • K. Fang, A. Toshev, L. Fei-Fei, and S. Savarese. Scene memory transformer for embodied agents in long-horizon tasks. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. IQA: Visual question answering in interactive environments. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • D. Gordon, A. Kadian, D. Parikh, J. Hoffman, and D. Batra. SplitNet: Sim2Sim and Task2Task transfer for embodied visual navigation. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • C. Guo and F. Berkhahn. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.
    Findings
  • S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual navigation. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • S. Gupta, D. Fouhey, S. Levine, and J. Malik. Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125, 2017.
    Findings
  • J. F. Henriques and A. Vedaldi. MapNet: An allocentric spatial memory for mapping environments. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • F. Hill, S. Clark, K. M. Hermann, and P. Blunsom. Understanding early word learning in situated artificial agents. arXiv preprint arXiv:1710.09867, 2017.
    Findings
  • U. Jain, L. Weihs, E. Kolve, M. Rastegari, S. Lazebnik, A. Farhadi, A. G. Schwing, and A. Kembhavi. Two body problem: Collaborative visual task completion. In CVPR, 2019. first two authors contributed equally.
    Google ScholarFindings
  • U. Jain, L. Weihs, E. Kolve, A. Farhadi, S. Lazebnik, A. Kembhavi, and A. G. Schwing. A cordial sync: Going beyond marginal policies for multi-agent embodied tasks. In ECCV, 2020. first two authors contributed equally.
    Google ScholarFindings
  • M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jakowski. VizDoom: A Doom-based AI research platform for visual reinforcement learning. In Proc. IEEE Conf. on Computational Intelligence and Games, 2016.
    Google ScholarLocate open access versionFindings
  • E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. AI2-Thor: An interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474, 2017.
    Findings
  • P. Mirowski, M. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, D. Teplyashin, K. Simonyan, A. Zisserman, and R. Hadsell. Learning to navigate in cities without a map. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • J. Oh, V. Chockalingam, S. Singh, and H. Lee. Control of memory, active perception, and action in minecraft. In ICML, 2016.
    Google ScholarLocate open access versionFindings
  • E. Parisotto and R. Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • P. Rodríguez, M. A. Bautista, J. Gonzalez, and S. Escalera. Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75:21–31, 2018.
    Google ScholarLocate open access versionFindings
  • N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653, 2018.
    Findings
  • M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun. Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931, 2017.
    Findings
  • M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra. Habitat: A platform for embodied AI research. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017.
    Google ScholarFindings
  • M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • X. Wang, W. Xiong, H. Wang, and W. Yang Wang. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 37–53, 2018.
    Google ScholarLocate open access versionFindings
  • L. Weihs, J. Salvador, K. Kotar, U. Jain, K.-H. Zeng, R. Mottaghi, and A. Kembhavi. Allenact: A framework for embodied ai research. arXiv preprint arXiv:2008.12760, 2020.
    Findings
  • E. Wijmans, S. Datta, O. Maksymets, A. Das, G. Gkioxari, S. Lee, I. Essa, D. Parikh, and D. Batra. Embodied question answering in photorealistic environments with point cloud perception. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. In ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Wu, Y. Wu, A. Tamar, S. Russell, G. Gkioxari, and Y. Tian. Bayesian relational memory for semantic visual navigation. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese. Gibson Env: Real-world perception for embodied agents. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese. Interactive Gibson: A benchmark for interactive navigation in cluttered environments. arXiv preprint arXiv:1910.14442, 2019.
    Findings
  • W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi. Visual semantic navigation using scene priors. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • L. Yu, X. Chen, G. Gkioxari, M. Bansal, T. L. Berg, and D. Batra. Multi-target embodied question answering. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • J. Zhang, L. Tai, J. Boedecker, W. Burgard, and M. Liu. Neural SLAM: Learning to explore with external memory. arXiv preprint arXiv:1706.09520, 2017.
    Findings
  • Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi. Visual Semantic Planning using Deep Successor Representations. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017.
    Google ScholarLocate open access versionFindings
Author
Saim Wani
Saim Wani
Shivansh Patel
Shivansh Patel
Unnat Jain
Unnat Jain
Manolis Savva
Manolis Savva
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科