Audio-Visual Floorplan Reconstruction

Sebastian Vicenc Amengual Gari
Sebastian Vicenc Amengual Gari
Philip Robinson
Philip Robinson
Cited by: 0|Bibtex|Views32
Other Links: arxiv.org
Weibo:
We proposed a new research direction: audio-visual floorplan reconstruction from short video sequences

Abstract:

Given only a few glimpses of an environment, how much can we infer about its entire floorplan? Existing methods can map only what is visible or immediately apparent from context, and thus require substantial movements through a space to fully map it. We explore how both audio and visual sensing together can provide rapid floorplan recon...More

Code:

Data:

0
Introduction
  • Floorplans of complex 3D environments—such as homes, offices, shops, churches—are a compact ground-plane representation of their overall layout, showing the different rooms and their connectivity.
  • Existing methods are limited to mapping the regions they directly observe.
  • They either require a dense walk-through for the camera to capture most of the space—wasteful if not impossible for a robotic agent trying to immediately perform tasks in a new environment—or else they fail to map rooms beyond those where the camera was placed
Highlights
  • Floorplans of complex 3D environments—such as homes, offices, shops, churches—are a compact ground-plane representation of their overall layout, showing the different rooms and their connectivity
  • Existing methods are limited to mapping the regions they directly observe. They either require a dense walk-through for the camera to capture most of the space—wasteful if not impossible for a robotic agent trying to immediately perform tasks in a new environment—or else they fail to map rooms beyond those where the camera was placed
  • We propose a new research direction: audiovisual floorplan reconstruction
  • Through extensive qualitative and quantitative results, we demonstrate that our proposed model can effectively leverage both audio and visual signals to reason about the extent of the interior of environments (Sec 4.1) and classify regions of the interior into the associated rooms (Sec 4.2)
  • This approach requires knowing the ground-truth impulse responses at each microphone location, which our method does not have access to
  • We proposed a new research direction: audio-visual floorplan reconstruction from short video sequences
Results
  • Through extensive qualitative and quantitative results, the authors demonstrate that the proposed model can effectively leverage both audio and visual signals to reason about the extent of the interior of environments (Sec 4.1) and classify regions of the interior into the associated rooms (Sec 4.2).

    Baselines In order to conduct a thorough analysis of the proposed model, the authors consider several baselines.
  • Acoustic Echoes [13]: This method assumes that all room shapes are convex polyhedra and estimates room shape by listening to audio echoes
  • This approach requires knowing the ground-truth impulse responses at each microphone location, which the method does not have access to.
  • While this method’s setup is artificial, the authors use it as an upper bound for what an existing audio-only method could provide.
  • Ours audio-only and RGB-only: As ablations of the model, the authors train variants with either modality removed
Conclusion
  • The authors proposed a new research direction: audio-visual floorplan reconstruction from short video sequences.
  • The authors developed a multi-modal model to estimate the floorplan around and far beyond the camera trajectory.
  • The authors' AV-Map model successfully infers the structure and semantics of areas that are not visible, outperforming the state-of-the-art in extrapolated visual maps.
  • In future work the authors plan to consider extensions to multi-level floorplans and connect the mapping idea to a robotic agent actively controlling the camera
Summary
  • Introduction:

    Floorplans of complex 3D environments—such as homes, offices, shops, churches—are a compact ground-plane representation of their overall layout, showing the different rooms and their connectivity.
  • Existing methods are limited to mapping the regions they directly observe.
  • They either require a dense walk-through for the camera to capture most of the space—wasteful if not impossible for a robotic agent trying to immediately perform tasks in a new environment—or else they fail to map rooms beyond those where the camera was placed
  • Objectives:

    In contrast to navigation, where an intelligent agent controls the camera and builds its map in service of reaching a target, the goal is to transform a passive video sequence into a map.
  • The authors' goal is to estimate the 2D layout of an environment depicted in a short video.
  • Results:

    Through extensive qualitative and quantitative results, the authors demonstrate that the proposed model can effectively leverage both audio and visual signals to reason about the extent of the interior of environments (Sec 4.1) and classify regions of the interior into the associated rooms (Sec 4.2).

    Baselines In order to conduct a thorough analysis of the proposed model, the authors consider several baselines.
  • Acoustic Echoes [13]: This method assumes that all room shapes are convex polyhedra and estimates room shape by listening to audio echoes
  • This approach requires knowing the ground-truth impulse responses at each microphone location, which the method does not have access to.
  • While this method’s setup is artificial, the authors use it as an upper bound for what an existing audio-only method could provide.
  • Ours audio-only and RGB-only: As ablations of the model, the authors train variants with either modality removed
  • Conclusion:

    The authors proposed a new research direction: audio-visual floorplan reconstruction from short video sequences.
  • The authors developed a multi-modal model to estimate the floorplan around and far beyond the camera trajectory.
  • The authors' AV-Map model successfully infers the structure and semantics of areas that are not visible, outperforming the state-of-the-art in extrapolated visual maps.
  • In future work the authors plan to consider extensions to multi-level floorplans and connect the mapping idea to a robotic agent actively controlling the camera
Tables
  • Table1: Interior reconstruction evaluation: Our proposed AVMap model (here with device-generated sounds) outperforms existing methods and the baselines. Methods creating only a binary map output cannot be scored by AP (NA)
  • Table2: Interior reconstruction in different settings of audio: Our proposed AV-Map model is applicable with either devicegenerated or environment-generated sounds
  • Table3: Impact of Sequence Modeling: We observe a significant improvement in performance of our AV-Map model when trained on sequences compared to making independent predictions at each step
  • Table4: Interior Map Average Precision: We present a qualitative analysis of various models trained to predict an interior area covering 164m2 at each time step
Download tables as Excel
Related work
  • Floorplan and room layout reconstruction The vision and graphics communities have explored various ways to use visual data, depth sensors, and laser scanners to build floorplans. Geometric approaches use 3D point cloud inputs to construct building-wide floor plans [41, 30]. Given RGB-D scans, FloorNet [28] and Floor-SP [8] estimate 2D floorplans and rooms’ semantic labels using a mix of deep learning and optimization. Given monocular RGB images [26] or 360◦ panoramas [45, 48, 42, 49], other methods estimate a 3D indoor Manhattan room layout. Using only a small portion of a 360◦ panorama, models can be trained to infer missing viewpoints [22] and/or semantic labels [39]. Unlike any of the above, our approach leverages both audio and visual sensing to infer a 2D floorplan map and its semantic room labels. As our results show, audio offers the advantage of sensing further beyond the field of view of visual sensors.
Funding
  • Our results on 85 large real-world environments show the impact: with just a few glimpses spanning 26% of an area, we can estimate the whole area with 66% accuracy -- substantially better than the state of the art approach for extrapolating visual maps
Reference
  • T. Abhayapala and D. Ward. Theory and design of higher order sound field microphones using spherical microphone array. In ICASSP, 2002. 3
    Google ScholarLocate open access versionFindings
  • Marcus A. Brubaker, Andreas Geiger, and Raquel Urtasun. Lost! leveraging the crowd for probabilistic visual selflocalization. In CVPR, 2013. 2
    Google ScholarFindings
  • Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158. Matterport3D license available at http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf., 2017.2, 5, 15
    Findings
  • Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155, 2020. 2, 6
    Findings
  • Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Audio-visual embodied navigation. arXiv preprint arXiv:1912.11474, 2019. 3
    Findings
  • Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Soundspaces: Audiovisual navigation in 3d environments. In ECCV, 2020. 5, 6, 15
    Google ScholarLocate open access versionFindings
  • Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Learning to set waypoints for audio-visual navigation. In arXiv, 2020. 3
    Google ScholarLocate open access versionFindings
  • Jiacheng Chen, Chen Liu, Jiaye Wu, and Yasutaka Furukawa. Floor-sp: Inverse cad for floorplans by sequential room-wise shortest path. In Proceedings of the IEEE International Conference on Computer Vision, pages 2661–2670, 2019. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning exploration policies for navigation. In 7th International Conference on Learning Representations, ICLR 2019, 2012, 6
    Google ScholarLocate open access versionFindings
  • Jesper Haahr Christensen, Sascha Hornauer, and X Yu Stella. Batvision: Learning to see 3d spatial layout with two ears. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 1581–1587. IEEE, 2020. 2
    Google ScholarLocate open access versionFindings
  • Marco Crocco, Samuele Martelli, Andrea Trucco, Andrea Zunino, and Vittorio Murino. Audio tracking in noisy environments by acoustic map and spectral signature. EEE Trans Cybern, 2018. 2
    Google ScholarLocate open access versionFindings
  • Victoria Dean, Shubham Tulsiani, and Abhinav Gupta. See, hear, explore: Curiosity via audio-visual association. Advances in Neural Information Processing Systems, 33, 2020. 3
    Google ScholarLocate open access versionFindings
  • Ivan Dokmanic, Reza Parhizkar, Andreas Walther, Yue M Lu, and Martin Vetterli. Acoustic echoes reveal room shape. Proceedings of the National Academy of Sciences, 110(30):12186–12191, 202, 6, 7
    Google ScholarLocate open access versionFindings
  • Itamar Eliakim, Zahi Cohen, Gabor Kosa, and Yossi Yovel. A fully autonomous terrestrial bat-like acoustic robot. PLoS computational biology, 14(9):e1006406, 2018. 2
    Google ScholarLocate open access versionFindings
  • Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B Tenenbaum. Look, listen, and act: Towards audiovisual embodied navigation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9701– 9707. IEEE, 2020. 3
    Google ScholarLocate open access versionFindings
  • Chuang Gan, Hang Zhao, Peihao Chen, David Cox, and Antonio Torralba. Self-supervised moving vehicle tracking with stereo sound. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • R. Gao, C. Chen, Z. Al-Halah, C. Schissler, and K. Grauman. VisualEchoes: Spatial image representation learning through echolocation. In ECCV, 2020. 3
    Google ScholarFindings
  • Jort Gemmeke, Daniel Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017. 3
    Google ScholarFindings
  • Dimitrios Giannoulis, Emmanouil Benetos, Dan Stowell, Mathias Rossignol, Mathieu Lagrange, and Mark D. Plumbley. Detection and classification of acoustic scenes and events: An ieee aasp challenge. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013. 3
    Google ScholarLocate open access versionFindings
  • Saurabh Gupta, David Fouhey, Sergey Levine, and Jitendra Malik. Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125, 2017. 2
    Findings
  • R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. 2004. 3
    Google ScholarFindings
  • D. Jayaraman and K. Grauman. Learning to look around: Intelligently exploring unseen environments for unknown tasks. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • H. Jeong and Y. Lam. Source implementation to eliminate low-frequency artifacts in finite difference time domain room acoustic simulation. Journal of the Acoustical Society of America, 2012. 6
    Google ScholarLocate open access versionFindings
  • Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Are we making real progress in simulated environments? measuring the sim2real gap in embodied visual navigation. arXiv preprint arXiv:1912.06321, 2019. 2
    Findings
  • Hansung Kim, Luca Remaggi, Philip JB Jackson, Filippo Maria Fazi, and Adrian Hilton. 3d room geometry reconstruction using audio-visual sensors. In 2017 International Conference on 3D Vision (3DV), pages 621–629. IEEE, 2017. 3
    Google ScholarLocate open access versionFindings
  • Chen-Yu Lee, Vijay Badrinarayanan, Tomasz Malisiewicz, and Andrew Rabinovich. Roomnet: End-to-end room layout estimation. In ICCV, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • David B Lindell, Gordon Wetzstein, and Vladlen Koltun. Acoustic non-line-of-sight imaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6780–6789, 2019. 2
    Google ScholarLocate open access versionFindings
  • Chen Liu, Jiaye Wu, and Yasutaka Furukawa. Floornet: A unified framework for floorplan reconstruction from 3d scans. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–217, 2018. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Medhini Narasimhan, Erik Wijmans, Xinlei Chen, Trevor Darrell, Dhruv Batra, Devi Parikh, and Amanpreet Singh. Seeing the un-scene: Learning amodal semantic maps for room navigation. arXiv preprint arXiv:2007.09841, 2020. 2
    Findings
  • Brian Okorn, Xuehan Xiong, Burcu Akinci, and Daniel Huber. Toward automated modeling of floor plans. In 3DPVT, 2009. 1, 2
    Google ScholarLocate open access versionFindings
  • Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413, 2016. 3
    Google ScholarLocate open access versionFindings
  • B. Rafaely. Analysis and design of higher order sound field microphones using spherical microphone array. IEEE Trans. on Speech and Audio Processing, 2005. 3
    Google ScholarLocate open access versionFindings
  • Santhosh K Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Occupancy anticipation for efficient exploration and navigation. arXiv preprint arXiv:2008.09285, 2020. 2, 3, 6, 7
    Findings
  • J. Santos, D. Portugal, and R. Rocha. An evaluation of 2d slam techniques available in robot operating system. In IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), 2013. 2
    Google ScholarLocate open access versionFindings
  • Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653, 2018. 2
    Findings
  • Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE International Conference on Computer Vision, pages 9339–9347, 2019. 2, 6
    Google ScholarLocate open access versionFindings
  • S. Se, D. Lowe, and J. Little. Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. IJRR, 2002. 2
    Google ScholarLocate open access versionFindings
  • Jascha Sohl-Dickstein, Santani Teng, Benjamin M Gaub, Chris C Rodgers, Crystal Li, Michael R DeWeese, and Nicol S Harper. A device for human ultrasonic echolocation. IEEE Transactions on Biomedical Engineering, 62(6):1526–1534, 2015. 2
    Google ScholarLocate open access versionFindings
  • Shuran Song, Andy Zeng, Angel X. Chang, Manolis Savva, Silvio Savarese, and Thomas Funkhouser. Im2pano3d: Extrapolating 360 structure and semantics beyond the field of view. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019. 2
    Findings
  • Wei Sui, Lingfeng Wang, Bin Fan, Hongfei Xiao, Huaiyu Wu, and Chunhong Pan. Layer-wise floorplan extraction for automatic urban building reconstruction. IEEE Transactions on Visualization and Computer Graphics, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Cheng Sun, Chi-Wei Hsiao, Min Sun, and Hwann-Tzong Chen. Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation. In CVPR, 2019. 2
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 4
    Google ScholarLocate open access versionFindings
  • Antonio Pico Villalpando, Guido Schillaci, Verena V Hafner, and Bruno Lara Guzmán. Ego-noise predictions for echolocation in wheeled robots. In Artificial Life Conference Proceedings, pages 567–573. MIT Press, 2019. 2
    Google ScholarLocate open access versionFindings
  • S. Yang, F. Wang, C. Peng, P. Wonka, M. Sun, and H. Chu. Dula-net: A dual-projection network for estimating room layouts from a single rgb panorama. In CVPR, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Mao Ye, Yu Zhang, Ruigang Yang, and Dinesh Manocha. 3d reconstruction in the presence of glasses by acoustic and stereo fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4885–4893, 2015. 3
    Google ScholarLocate open access versionFindings
  • Zhoutong Zhang, Jiajun Wu, Qiujia Li, Zhengjia Huang, James Traer, Josh H McDermott, Joshua B Tenenbaum, and William T Freeman. Generative modeling of audible shapes for object perception. In Proceedings of the IEEE International Conference on Computer Vision, pages 1251–1260, 2017. 3
    Google ScholarLocate open access versionFindings
  • Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem. Layoutnet: Reconstructing the 3d room layout from a single rgb image. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Chuhang Zou, Jheng-Wei Su, Chi-Han Peng, Alex Colburn, Qi Shan, Peter Wonka, Hung-Kuo Chu, and Derek Hoiem. 3d manhattan room layout reconstruction from a single 360 image. arXiv preprint arXiv:1910.04099, 2019. 2 (Supplementary Material) Audio-Visual Floorplan Reconstruction
    Findings
Full Text
Your rating :
0

 

Tags
Comments