VisualEchoes: Spatial Image Representation Learning through Echolocation
european conference on computer vision, pp. 658-676, 2020.
Weibo:
Abstract:
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial re...More
Code:
Data:
Introduction
- The perceptual and cognitive abilities of embodied agents are inextricably tied to their physical being.
- The authors perceive spatial sound.
- Can the authors identify the sound-emitting object, and the authors can determine that object’s location, based on the time difference between when the sound reaches each ear (Interaural Time Difference, ITD) and the difference in sound level as it enters each ear (Interaural Level Difference, ILD).
- Some animals capitalize on these cues by using echolocation—actively emitting sounds to perceive the 3D spatial layout of their surroundings [62]
Highlights
- The perceptual and cognitive abilities of embodied agents are inextricably tied to their physical being
- Our contributions are threefold: 1) We explore the spatial cues contained in echoes, analyzing how they inform depth prediction; 2) We propose VisualEchoes, a novel interaction-based feature learning framework that uses echoes to learn an image representation and does not require audio at test time; 3) We successfully validate the learned spatial representation for the fundamental downstream vision tasks of monocular depth prediction, surface normal estimation, and visual navigation, with results comparable or even outperforming heavily supervised pre-training baselines
- We evaluate on a heldout set of three Replica environments with standard metrics: root mean squared error (RMS), mean relative error (REL), mean log 10 error, and thresholded accuracy [36,16]
- We presented an approach to learn spatial image representations via echolocation
- We performed an in-depth study on the spatial cues contained in echoes and how it can inform single-view depth estimation
- We showed that the learned spatial features can benefit three downstream vision tasks
Methods
- The authors present experiments to validate VisualEchoes for three tasks and three datasets. The goal is to examine the impact of the features compared to either learning features for that task from scratch or learning features with manual semantic supervision.
Datasets: 1) Replica [67]: contains 18 3D scenes having 1,740 navigable locations × 4 orientations = 6, 960 agent states in total. - The authors use the standard splits of 464 scenes, and use 249 scenes for training and 215 for testing following [36].
- The authors use the dataset split as formulated in [33].
- 3) DIODE [72]: the first public dataset that includes RGB-D images of both indoor and outdoor scenes.
- The authors only use the indoor scenes and use the official train/val split.
Results
- The authors show some qualitative results for the downstream tasks described in the last section.
- Fig. 6 and Fig. 7 show example results on monocular depth prediction and surface normal estimation, respectively.
- Using the pre-trained VisualEchoes network as initialization leads to much more accurate depth prediction and surface normal estimation results compared to no pre-training, demonstrating the usefulness of the learned spatial features.
- ImageNet Pre-trained [36] 0.555 0.126.
- MIT Indoor Scene Pre-trained 0.711 0.180.
- (a) Depth prediction results on NYU-V2.
- Mean Dist.
- ↓ Median Dist.
Conclusion
- Conclusions and Future Work
The authors presented an approach to learn spatial image representations via echolocation. - The authors showed that the learned spatial features can benefit three downstream vision tasks.
- The authors' results show that the learned spatial features already benefit transfer to vision-only tasks in real photos outside of the scanned environments, indicating the realism of what the system learned.
- The authors are interested in pursuing these ideas within a sequential model, such that the agent could actively decide when to emit chirps and what type of chirps to emit to get the most informative echo responses
Summary
Introduction:
The perceptual and cognitive abilities of embodied agents are inextricably tied to their physical being.- The authors perceive spatial sound.
- Can the authors identify the sound-emitting object, and the authors can determine that object’s location, based on the time difference between when the sound reaches each ear (Interaural Time Difference, ITD) and the difference in sound level as it enters each ear (Interaural Level Difference, ILD).
- Some animals capitalize on these cues by using echolocation—actively emitting sounds to perceive the 3D spatial layout of their surroundings [62]
Objectives:
Whereas these methods aim to learn features generically useful for recognition, the objective is to learn features generically useful for spatial estimation tasks.- The authors' goals are to show that echoes convey spatial information, to learn visual representations by echolocation, and to leverage the learned representations for downstream visual spatial tasks
Methods:
The authors present experiments to validate VisualEchoes for three tasks and three datasets. The goal is to examine the impact of the features compared to either learning features for that task from scratch or learning features with manual semantic supervision.
Datasets: 1) Replica [67]: contains 18 3D scenes having 1,740 navigable locations × 4 orientations = 6, 960 agent states in total.- The authors use the standard splits of 464 scenes, and use 249 scenes for training and 215 for testing following [36].
- The authors use the dataset split as formulated in [33].
- 3) DIODE [72]: the first public dataset that includes RGB-D images of both indoor and outdoor scenes.
- The authors only use the indoor scenes and use the official train/val split.
Results:
The authors show some qualitative results for the downstream tasks described in the last section.- Fig. 6 and Fig. 7 show example results on monocular depth prediction and surface normal estimation, respectively.
- Using the pre-trained VisualEchoes network as initialization leads to much more accurate depth prediction and surface normal estimation results compared to no pre-training, demonstrating the usefulness of the learned spatial features.
- ImageNet Pre-trained [36] 0.555 0.126.
- MIT Indoor Scene Pre-trained 0.711 0.180.
- (a) Depth prediction results on NYU-V2.
- Mean Dist.
- ↓ Median Dist.
Conclusion:
Conclusions and Future Work
The authors presented an approach to learn spatial image representations via echolocation.- The authors showed that the learned spatial features can benefit three downstream vision tasks.
- The authors' results show that the learned spatial features already benefit transfer to vision-only tasks in real photos outside of the scanned environments, indicating the realism of what the system learned.
- The authors are interested in pursuing these ideas within a sequential model, such that the agent could actively decide when to emit chirps and what type of chirps to emit to get the most informative echo responses
Tables
- Table1: Case study depth prediction results. ↓ lower better, ↑ higher better
- Table2: Depth prediction on Replica, NYU-V2, and DIODE datasets. We use the RGB2Depth network from Sec. 3.2 for all models. Our VisualEchoes pretraining transfers well, consistently predicting depth better than the model trained from scratch. Furthermore, it is even competitive with the supervised models, whether they are pre-trained for ImageNet (1M manually labeled images) or MIT Indoor Scenes (16K manually labeled images). ↓ lower better, ↑ higher better. (Un)sup = (un)supervised. We boldface the best unsupervised method
- Table3: Results for three downstream tasks. ↓ lower better, ↑ higher better
Related work
- Auditory Scene Analysis using Echoes Previous work shows that using echo responses only, one can predict the 2D [5] or 3D [14] room geometry. Additionally, echoes can complement vision, especially when vision-based depth estimates are not reliable, e.g., on transparent windows or featureless walls [42,80]. In dynamic environments, autonomous robots can leverage echoes for obstacle avoidance [71] and mapping and navigation [17] using a bat-like echolocation model. Concurrently with our work, predicting depth maps purely from echo
1 person, robot, or simulated robot responses is also explored in [12] with a low-cost audio system called BatVision. Our work explores a novel direction for auditory scene analysis by employing echoes for spatial visual feature learning, and the features are applicable in the absence of any audio.
Self-Supervised Image Representation Learning Self-supervised image feature learning methods leverage structured information within the data itself to generate labels for representation learning [63,33]. To this end, many “pretext” tasks have been explored—for example, predicting the rotation applied to an input image [31,1], discriminating image instances [19], colorizing images [45,82], solving a jigsaw puzzle from image patches [51], or multi-task learning using synthetic imagery [60]. Temporal information in videos also permits self-supervised tasks, for example, by predicting whether a frame sequence is in the correct order [49,20] or ensuring visual coherence of tracked objects [76,28,38]. Whereas these methods aim to learn features generically useful for recognition, our objective is to learn features generically useful for spatial estimation tasks. Accordingly, our echolocation objective is well-aligned with our target family of spatial tasks (depth, surfaces, navigation), consistent with findings that task similarity is important for positive transfer [81]. Furthermore, unlike any of the above, rather than learn from massive repositories of human-taken photos, the proposed approach learns from interactions with the scene via echolocation.
Reference
- Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)
- Agrawal, P., Nair, A.V., Abbeel, P., Malik, J., Levine, S.: Learning to poke by poking: Experiential learning of intuitive physics. In: NeurIPS (2016) 3. Alameda-Pineda, X., Staiano, J., Subramanian, R., Batrinca, L., Ricci, E., Lepri, B., Lanz, O., Sebe, N.: Salsa: A novel dataset for multimodal group behavior analysis. TPAMI (2015)
- 4. Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
- 5. Antonacci, F., Filos, J., Thomas, M.R., Habets, E.A., Sarti, A., Naylor, P.A., Tubaro, S.: Inference of room geometry from acoustic impulse responses. IEEE Transactions on Audio, Speech, and Language Processing (2012)
- 6. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
- 7. Arandjelovic, R., Zisserman, A.: Objects that sound. In: ECCV (2018)
- 8. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations from unlabeled video. In: NeurIPS (2016)
- 9. Ban, Y., Li, X., Alameda-Pineda, X., Girin, L., Horaud, R.: Accounting for room acoustics in audio-visual multi-speaker tracking. In: ICASSP (2018)
- 10. Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor environments. 3DV (2017)
- 11. Chen, C., Jain, U., Schissler, C., Gari, S.V.A., Al-Halah, Z., Ithapu, V.K., Robinson, P., Grauman, K.: Audio-Visual Embodied Navigation. arXiv preprint arXiv:1912.11474 (2019)
- 12. Christensen, J., Hornauer, S., Yu, S.: Batvision - learning to see 3d spatial layout with two ears. In: ICRA (2020)
- 13. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
- 14. Dokmanic, I., Parhizkar, R., Walther, A., Lu, Y.M., Vetterli, M.: Acoustic echoes reveal room shape. Proceedings of the National Academy of Sciences (2013)
- 15. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
- 16. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)
- 17. Eliakim, I., Cohen, Z., Kosa, G., Yovel, Y.: A fully autonomous terrestrial bat-like acoustic robot. PLoS computational biology (2018)
- 18. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T., Rubinstein, M.: Looking to listen at the cocktail party: A speakerindependent audio-visual model for speech separation. In: SIGGRAPH (2018)
- 19. Feng, Z., Xu, C., Tao, D.: Self-supervised representation learning by rotation feature decoupling. In: CVPR (2019)
- 20. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)
- 21. Fouhey, D.F., Gupta, A., Hebert, M.: Data-driven 3d primitives for single image understanding. In: ICCV (2013)
- 22. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR (2018)
- 23. Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: ICCV (2019)
- 24. Gandhi, D., Pinto, L., Gupta, A.: Learning to fly by crashing. In: IROS (2017)
- 25. Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: ECCV (2018)
- 26. Gao, R., Grauman, K.: 2.5d visual sound. In: CVPR (2019)
- 27. Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: ICCV (2019)
- 28. Gao, R., Jayaraman, D., Grauman, K.: Object-centric representation learning from unlabeled videos. In: ACCV (2016)
- 29. Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: ECCV (2016)
- 30. Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: ICCV Workshops (2015)
- 31. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
- 32. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)
- 33. Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmarking selfsupervised visual representation learning. In: ICCV (2019)
- 34. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
- 35. Hershey, J.R., Movellan, J.R.: Audio vision: Using audio-visual synchrony to locate sounds. In: NeurIPS (2000)
- 36. Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In: WACV (2019)
- 37. Irie, G., Ostrek, M., Wang, H., Kameoka, H., Kimura, A., Kawanishi, T., Kashino, K.: Seeing through sounds: Predicting visual semantic segmentation results from multichannel audio signals. In: ICASSP (2019)
- 38. Jayaraman, D., Grauman, K.: Slow and steady feature analysis: Higher order temporal coherence in video. In: CVPR (2016)
- 39. Jayaraman, D., Grauman, K.: Learning image representations equivariant to egomotion. In: ICCV (2015)
- 40. Jiang, H., Larsson, G., Maire Greg Shakhnarovich, M., Learned-Miller, E.: Selfsupervised relative depth learning for urban scene understanding. In: ECCV (2018)
- 41. Karsch, K., Liu, C., Kang, S.B.: Depth transfer: Depth extraction from video using non-parametric sampling. TPAMI (2014)
- 42. Kim, H., Remaggi, L., Jackson, P.J., Fazi, F.M., Hilton, A.: 3d room geometry reconstruction using audio-visual sensors. In: 3DV (2017)
- 43. Korbar, B., Tran, D., Torresani, L.: Co-training of audio and video representations from self-supervised temporal synchronization. In: NeurIPS (2018)
- 44. Kuttruff, H.: Room Acoustics. CRC Press (2017)
- 45. Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR (2017)
- 46. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sanchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis (2017)
- 47. Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: CVPR (2015) 48. McGuire, K., De Wagter, C., Tuyls, K., Kappen, H., de Croon, G.: Minimal navigation solution for a swarm of tiny flying robots to explore an unknown environment. Science Robotics (2019)
- 49. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: ECCV (2016)
- 50. Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360◦ video. In: NeurIPS (2018)
- 51. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)
- 52. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)
- 53. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)
- 54. Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: ECCV (2016)
- 55. Palossi, D., Loquercio, A., Conti, F., Flamand, E., Scaramuzza, D., Benini, L.: A 64-mw dnn-based visual navigation engine for autonomous nano-drones. IEEE Internet of Things Journal (2019)
- 56. Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: ICRA (2016)
- 57. Purushwalkam, S., Gupta, A., Kaufman, D.M., Russell, B.: Bounce and learn: Modeling scene dynamics with real-world bounces. In: ICLR (2019)
- 58. Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: CVPR (2009)
- 59. Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR (2019)
- 60. Ren, Z., Jae Lee, Y.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: CVPR (2018)
- 61. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
- 62. Rosenblum, L.D., Gordon, M.S., Jarquin, L.: Echolocating distance by moving and stationary listeners. Ecological Psychology (2000) 63. de Sa, V.R.: Learning classification with unlabeled data. In: NeurIPS (1994)
- 64. Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., Batra, D.: Habitat: A Platform for Embodied AI Research. In: ICCV (2019)
- 65. Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: CVPR (2018)
- 66. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)
- 67. Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., et al.: The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)
- 68. Stroffregen, T.A., Pittenger, J.B.: Human echolocation as a basic form of perception and action. Ecological psychology (1995)
- 69. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)
- 70. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and motion network for learning monocular stereo. In: CVPR (2017)
- 71. Vanderelst, D., Holderied, M.W., Peremans, H.: Sensorimotor model of obstacle avoidance in echolocating bats. PLoS computational biology (2015)
- 72. Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F.Z., Daniele, A.F., Mostajabi, M., Basart, S., Walter, M.R., Shakhnarovich, G.: DIODE: A Dense Indoor and Outdoor DEpth Dataset. arXiv preprint arXiv:1908.00463 (2019)
- 73. Veach, E., Guibas, L.: Bidirectional estimators for light transport. In: Photorealistic Rendering Techniques (1995)
- 74. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: Sfmnet: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017)
- 75. Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: CVPR (2015)
- 76. Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV (2015)
- 77. Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson env: Realworld perception for embodied agents. In: CVPR (2018)
- 78. Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In: CVPR (2017)
- 79. Yang, Z., Wang, P., Xu, W., Zhao, L., Nevatia, R.: Unsupervised learning of geometry with edge-aware depth-normal consistency. In: AAAI (2018)
- 80. Ye, M., Zhang, Y., Yang, R., Manocha, D.: 3d reconstruction in the presence of glasses by acoustic and stereo fusion. In: ICCV (2015)
- 81. Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: CVPR (2018)
- 82. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016)
- 83. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: ECCV (2018)
- 84. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
- 85. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
- 86. Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: Generating natural sound for videos in the wild. In: CVPR (2018)
Full Text
Tags
Comments