Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation

european conference on computer vision, pp. 513-529, 2020.

Cited by: 0|Bibtex|Views99
Other Links: arxiv.org|academic.microsoft.com
Weibo:
Our results using ground truth maps indicate that there is a large scope for improvement in room navigation performance by improving the intermediate map prediction step

Abstract:

We introduce a learning-based approach for room navigation using semantic maps. Our proposed architecture learns to predict top-down belief maps of regions that lie beyond the agent's field of view while modeling architectural and stylistic regularities in houses. First, we train a model to generate amodal semantic top-down maps indicat...More

Code:

Data:

0
Introduction
  • Humans have an uncanny ability to seamlessly navigate in unseen environments by quickly understanding their surroundings.
  • A possible, but tedious solution is to head in a random direction and exhaustively search the space until you end up in the kitchen.
  • Another option, and most probably the one you’d pick, is to walk towards the dining room as you are more likely to find the kitchen near the dining room rather than the bedroom.
  • The agent models correlations between the appearance and architectural layout of houses to efficiently navigate in unseen scenes
Highlights
  • Humans have an uncanny ability to seamlessly navigate in unseen environments by quickly understanding their surroundings
  • (2) Through carefully designed ablations, we show that our model trained to predict semantic maps as intermediate representations achieves better performance on unseen environments compared to a baseline which doesn’t explicitly generate semantic top-down maps
  • We develop a room navigation framework with an explicit mapping strategy to predict amodal semantic maps by learning underlying correlations in houses, and navigating using these maps
  • Our room navigation framework described in Sec. 4 achieves an SPL of 0.31 on validation and 0.29 on the test set
  • We proposed a novel learning-based approach for Room Navigation which models architectural and stylistic regularities in houses
  • Our results using ground truth maps indicate that there is a large scope for improvement in room navigation performance by improving the intermediate map prediction step
Results
  • Closest point in room

    Agent path Shortest path

    The authors design baselines to evaluate the effectiveness of each component of Target: Bedroom

    Starting location the proposed room navigation frame- Fig. 5: SPL for Room Navigation.

    work and to validate the approach.
  • Table 1 shows the RoomNav SPL truth target point and the SPL is comand Success scores on the room navi- puted w.r.t. this point.
  • The authors' room navigation framework described in Sec. 4 achieves an SPL of 0.31 on validation and 0.29 on the test set.
  • Fine-tuning the point navigation policy on points predicted by the point prediction network improves the SPL to 0.35 on validation and 0.33 on test, making this the best performing model.
  • The authors compare to an approach that does not use semantic maps to model correlations and does not use point navigation
Conclusion
  • The authors proposed a novel learning-based approach for Room Navigation which models architectural and stylistic regularities in houses.
  • The authors' approach consists of predicting the top down belief maps containing room semantics beyond the field of view of the agent, finding a point in the specified target room, and navigating to that point using a point navigation policy.
  • The authors' model’s improved performance (SPL) compared to the baselines confirms that learning to generate amodal semantic belief maps of room layouts improves room navigation performance in unseen environments.
Summary
  • Introduction:

    Humans have an uncanny ability to seamlessly navigate in unseen environments by quickly understanding their surroundings.
  • A possible, but tedious solution is to head in a random direction and exhaustively search the space until you end up in the kitchen.
  • Another option, and most probably the one you’d pick, is to walk towards the dining room as you are more likely to find the kitchen near the dining room rather than the bedroom.
  • The agent models correlations between the appearance and architectural layout of houses to efficiently navigate in unseen scenes
  • Objectives:

    The goal of this work is to elicit a similar behaviour in embodied agents by enabling them to predict regions which lie beyond their field of view through learned scene priors.
  • Results:

    Closest point in room

    Agent path Shortest path

    The authors design baselines to evaluate the effectiveness of each component of Target: Bedroom

    Starting location the proposed room navigation frame- Fig. 5: SPL for Room Navigation.

    work and to validate the approach.
  • Table 1 shows the RoomNav SPL truth target point and the SPL is comand Success scores on the room navi- puted w.r.t. this point.
  • The authors' room navigation framework described in Sec. 4 achieves an SPL of 0.31 on validation and 0.29 on the test set.
  • Fine-tuning the point navigation policy on points predicted by the point prediction network improves the SPL to 0.35 on validation and 0.33 on test, making this the best performing model.
  • The authors compare to an approach that does not use semantic maps to model correlations and does not use point navigation
  • Conclusion:

    The authors proposed a novel learning-based approach for Room Navigation which models architectural and stylistic regularities in houses.
  • The authors' approach consists of predicting the top down belief maps containing room semantics beyond the field of view of the agent, finding a point in the specified target room, and navigating to that point using a point navigation policy.
  • The authors' model’s improved performance (SPL) compared to the baselines confirms that learning to generate amodal semantic belief maps of room layouts improves room navigation performance in unseen environments.
Tables
  • Table1: RoomNav-SPL for our approach, baselines, and oracle methods on test and validation sets of the Room Nav Dataset. Our proposed model (Map Generation + Point Prediction + PointNav + Fine-tune) achieves 0.35 RoomNav-SPL and outperforms all other baselines
  • Table2: Ablation study of our mapping model and point prediction model
Download tables as Excel
Related work
  • Navigation in mobile robotics. Conventional solutions to the navigation problem in robotics are comprised of two main steps: (1) mapping the environment and simultaneously localizing the agent in the generated map (2) path planning towards the target using the generated map. Geometric solutions to the mapping problem include (i) structure from motion and (ii) simultaneous localization and mapping (SLAM) [4, 7, 13, 16, 18, 36]. Various SLAM algorithms have been developed for different sensory inputs available to the agent. Using the generated map, a path can be computed to the target location via several path planning algorithms [25]. These approaches fall under the passive SLAM category where a human navigates around the environment beforehand to generate the maps. On the other hand, active SLAM research focuses on dynamically controlling the camera for building spatial representations of the environment. Some works formulate active SLAM as Partially Observable Markov Decision Process and use either Bayesian Optimization [27] or Reinforcement Learning [23] to plan trajectories that lead to accurate maps. [8] and [35] use Rao-Blackwellized Particle Filters to choose the set of actions that maximize the information gain and minimize the uncertainty of the predicted maps.
Funding
  • The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon
  • Darrells group was supported in part by DoD, NSF, BAIR, and BDD
Study subjects and analysis
workers: 64
We use Adam [22] with a learning rate of 2.5 × 10−4. We use DD-PPO [41] to train 64 workers on 64 GPUs. 5 Room Navigation Dataset

Reference
  • Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018) 10, 11
    Findings
  • Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sunderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018) 4
    Google ScholarFindings
  • Aydemir, A., Gobelbecker, M., Pronobis, A., Sjoo, K., Jensfelt, P.: Plan-based object search and exploration using semantic spatial knowledge in the real world. In: ECMR (2011) 3
    Google ScholarFindings
  • Bailey, T., Durrant-Whyte, H.: Simultaneous localization and mapping (SLAM): Part ii. IEEE Robotics & Automation Magazine (2006) 3
    Google ScholarLocate open access versionFindings
  • Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems (NeurIPS) 7
    Google ScholarLocate open access versionFindings
  • Bowman, S.L., Atanasov, N., Daniilidis, K., Pappas, G.J.: Probabilistic data association for semantic slam. In: International Conference on Robotics and Automation (ICRA) (2017) 3
    Google ScholarLocate open access versionFindings
  • Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on Robotics (2016) 3
    Google ScholarLocate open access versionFindings
  • Carlone, L., Du, J., Ng, M.K., Bona, B., Indri, M.: Active slam and exploration with particle filters using kullback-leibler divergence. Journal of Intelligent & Robotic Systems (2014) 3
    Google ScholarLocate open access versionFindings
  • Chang, A., Dai, A., Funkhouser, T., Halber, M., Nießner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 (2017), matterport3D dataset available at ”https://niessner.github.io/Matterport/6, 10
    Findings
  • Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. arXiv preprint arXiv:1903.01959 (2019) 3, 4
    Findings
  • Crespo, J., Castillo, J.C., Mozos, O.M., Barber, R.: Semantic information for robot navigation: A survey. Applied Sciences (2020) 3
    Google ScholarLocate open access versionFindings
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009) 9
    Google ScholarLocate open access versionFindings
  • Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: part i. IEEE Robotics & Automation Magazine (2006) 3
    Google ScholarLocate open access versionFindings
  • Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 4
    Google ScholarLocate open access versionFindings
  • Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.P., BergKirkpatrick, T., Saenko, K., Klein, D., Darrell, T.: Speaker-follower models for vision-and-language navigation. In: Advances in Neural Information Processing Systems (NeurIPS) (2018) 4
    Google ScholarLocate open access versionFindings
  • Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendon-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artificial Intelligence Review (2015) 3
    Google ScholarLocate open access versionFindings
  • Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 2, 4
    Google ScholarLocate open access versionFindings
  • Hartley, R., Zisserman, A.: Multiple view geometry in computer vision (2003) 2, 3 19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 9 20.
    Google ScholarLocate open access versionFindings
  • Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (1997) 9 21.
    Google ScholarLocate open access versionFindings
  • Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 9, 10 22.
    Findings
  • Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 10 23.
    Findings
  • Kollar, T., Roy, N.: Trajectory optimization using reinforcement learning for map exploration. The International Journal of Robotics Research (2008) 3 24.
    Google ScholarLocate open access versionFindings
  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV) (2017) 4 25.
    Google ScholarLocate open access versionFindings
  • LaValle, S.M.: Planning algorithms. Cambridge University Press (2006) 2, 3 26.
    Google ScholarFindings
  • Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015) 4 27.
    Findings
  • Martinez-Cantin, R., de Freitas, N., Brochu, E., Castellanos, J., Doucet, A.: A bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots (2009) 3 28.
    Google ScholarLocate open access versionFindings
  • Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al.: Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016) 4 29.
    Findings
  • Pronobis, A., Jensfelt, P.: Large-scale semantic mapping and reasoning with heterogeneous modalities. In: International Conference on Robotics and Automation (ICRA) (2012) 3 30.
    Google ScholarLocate open access versionFindings
  • Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015) 7 31.
    Google ScholarLocate open access versionFindings
  • Savinov, N., Dosovitskiy, A., Koltun, V.: Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653 (2018) 4 32.
    Findings
  • Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., et al.: Habitat: A platform for embodied ai research. arXiv preprint arXiv:1904.01201 (2019) 3, 4, 5, 8, 9, 10, 11, 13 33.
    Findings
  • Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015) 10 34.
    Findings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 9 35.
    Findings
  • Stachniss, C., Grisetti, G., Burgard, W.: Information gain-based exploration using rao-blackwellized particle filters. In: Robotics: Science and Systems (2005) 3 36.
    Google ScholarFindings
  • Thrun, S., Burgard, W., Fox, D.: Probabilistic robotics. MIT press (2005) 2, 3 37.
    Google ScholarFindings
  • Walter, M.R., Hemachandra, S., Homberg, B., Tellex, S., Teller, S.: Learning semantic maps from natural language descriptions. In: Robotics: Science and Systems (2013) 3 38.
    Google ScholarFindings
  • Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y.F., Wang, W.Y., Zhang, L.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 4 39.
    Google ScholarLocate open access versionFindings
  • Wang, X., Xiong, W., Wang, H., Yang Wang, W.: Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-andlanguage navigation. In: European Conference on Computer Vision (ECCV) (2018) 4 40.
    Google ScholarLocate open access versionFindings
  • Wang, Z., Zhang, Q., Li, J., Zhang, S., Liu, J.: A computationally efficient semantic slam solution for dynamic scenes. Remote Sensing (2019) 3 41.
    Google ScholarLocate open access versionFindings
  • Wijmans, E., Kadian, A., Morcos, A., Lee, S., Essa, I., Parikh, D., Savva, M., Batra, D.: DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. In: International Conference on Learning Representations (ICLR) (2020) 5, 8, 9, 10 42.
    Google ScholarLocate open access versionFindings
  • Wu, Y., Wu, Y., Gkioxari, G., Tian, Y.: Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209 (2018) 4 43.
    Findings
  • Wu, Y., Wu, Y., Tamar, A., Russell, S., Gkioxari, G., Tian, Y.: Bayesian relational memory for semantic visual navigation. arXiv preprint arXiv:1909.04306 (2019) 2, 4 44.
    Findings
  • Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vision (ECCV) (2018) 10 45.
    Google ScholarLocate open access versionFindings
  • Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 10 46.
    Google ScholarLocate open access versionFindings
  • Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543 (2018) 2, 4 47.
    Findings
  • Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: International Conference on Robotics and Automation (ICRA) (2017) 2, 4
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments