VisualEchoes: Spatial Image Representation Learning through Echolocation

european conference on computer vision, pp. 658-676, 2020.

Cited by: 0|Bibtex|Views82
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We performed an in-depth study on the spatial cues contained in echoes and how it can inform single-view depth estimation

Abstract:

Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial re...More

Code:

Data:

0
Introduction
  • The perceptual and cognitive abilities of embodied agents are inextricably tied to their physical being.
  • The authors perceive spatial sound.
  • Can the authors identify the sound-emitting object, and the authors can determine that object’s location, based on the time difference between when the sound reaches each ear (Interaural Time Difference, ITD) and the difference in sound level as it enters each ear (Interaural Level Difference, ILD).
  • Some animals capitalize on these cues by using echolocation—actively emitting sounds to perceive the 3D spatial layout of their surroundings [62]
Highlights
  • The perceptual and cognitive abilities of embodied agents are inextricably tied to their physical being
  • Our contributions are threefold: 1) We explore the spatial cues contained in echoes, analyzing how they inform depth prediction; 2) We propose VisualEchoes, a novel interaction-based feature learning framework that uses echoes to learn an image representation and does not require audio at test time; 3) We successfully validate the learned spatial representation for the fundamental downstream vision tasks of monocular depth prediction, surface normal estimation, and visual navigation, with results comparable or even outperforming heavily supervised pre-training baselines
  • We evaluate on a heldout set of three Replica environments with standard metrics: root mean squared error (RMS), mean relative error (REL), mean log 10 error, and thresholded accuracy [36,16]
  • We presented an approach to learn spatial image representations via echolocation
  • We performed an in-depth study on the spatial cues contained in echoes and how it can inform single-view depth estimation
  • We showed that the learned spatial features can benefit three downstream vision tasks
Methods
  • The authors present experiments to validate VisualEchoes for three tasks and three datasets. The goal is to examine the impact of the features compared to either learning features for that task from scratch or learning features with manual semantic supervision.

    Datasets: 1) Replica [67]: contains 18 3D scenes having 1,740 navigable locations × 4 orientations = 6, 960 agent states in total.
  • The authors use the standard splits of 464 scenes, and use 249 scenes for training and 215 for testing following [36].
  • The authors use the dataset split as formulated in [33].
  • 3) DIODE [72]: the first public dataset that includes RGB-D images of both indoor and outdoor scenes.
  • The authors only use the indoor scenes and use the official train/val split.
Results
  • The authors show some qualitative results for the downstream tasks described in the last section.
  • Fig. 6 and Fig. 7 show example results on monocular depth prediction and surface normal estimation, respectively.
  • Using the pre-trained VisualEchoes network as initialization leads to much more accurate depth prediction and surface normal estimation results compared to no pre-training, demonstrating the usefulness of the learned spatial features.
  • ImageNet Pre-trained [36] 0.555 0.126.
  • MIT Indoor Scene Pre-trained 0.711 0.180.
  • (a) Depth prediction results on NYU-V2.
  • Mean Dist.
  • ↓ Median Dist.
Conclusion
  • Conclusions and Future Work

    The authors presented an approach to learn spatial image representations via echolocation.
  • The authors showed that the learned spatial features can benefit three downstream vision tasks.
  • The authors' results show that the learned spatial features already benefit transfer to vision-only tasks in real photos outside of the scanned environments, indicating the realism of what the system learned.
  • The authors are interested in pursuing these ideas within a sequential model, such that the agent could actively decide when to emit chirps and what type of chirps to emit to get the most informative echo responses
Summary
  • Introduction:

    The perceptual and cognitive abilities of embodied agents are inextricably tied to their physical being.
  • The authors perceive spatial sound.
  • Can the authors identify the sound-emitting object, and the authors can determine that object’s location, based on the time difference between when the sound reaches each ear (Interaural Time Difference, ITD) and the difference in sound level as it enters each ear (Interaural Level Difference, ILD).
  • Some animals capitalize on these cues by using echolocation—actively emitting sounds to perceive the 3D spatial layout of their surroundings [62]
  • Objectives:

    Whereas these methods aim to learn features generically useful for recognition, the objective is to learn features generically useful for spatial estimation tasks.
  • The authors' goals are to show that echoes convey spatial information, to learn visual representations by echolocation, and to leverage the learned representations for downstream visual spatial tasks
  • Methods:

    The authors present experiments to validate VisualEchoes for three tasks and three datasets. The goal is to examine the impact of the features compared to either learning features for that task from scratch or learning features with manual semantic supervision.

    Datasets: 1) Replica [67]: contains 18 3D scenes having 1,740 navigable locations × 4 orientations = 6, 960 agent states in total.
  • The authors use the standard splits of 464 scenes, and use 249 scenes for training and 215 for testing following [36].
  • The authors use the dataset split as formulated in [33].
  • 3) DIODE [72]: the first public dataset that includes RGB-D images of both indoor and outdoor scenes.
  • The authors only use the indoor scenes and use the official train/val split.
  • Results:

    The authors show some qualitative results for the downstream tasks described in the last section.
  • Fig. 6 and Fig. 7 show example results on monocular depth prediction and surface normal estimation, respectively.
  • Using the pre-trained VisualEchoes network as initialization leads to much more accurate depth prediction and surface normal estimation results compared to no pre-training, demonstrating the usefulness of the learned spatial features.
  • ImageNet Pre-trained [36] 0.555 0.126.
  • MIT Indoor Scene Pre-trained 0.711 0.180.
  • (a) Depth prediction results on NYU-V2.
  • Mean Dist.
  • ↓ Median Dist.
  • Conclusion:

    Conclusions and Future Work

    The authors presented an approach to learn spatial image representations via echolocation.
  • The authors showed that the learned spatial features can benefit three downstream vision tasks.
  • The authors' results show that the learned spatial features already benefit transfer to vision-only tasks in real photos outside of the scanned environments, indicating the realism of what the system learned.
  • The authors are interested in pursuing these ideas within a sequential model, such that the agent could actively decide when to emit chirps and what type of chirps to emit to get the most informative echo responses
Tables
  • Table1: Case study depth prediction results. ↓ lower better, ↑ higher better
  • Table2: Depth prediction on Replica, NYU-V2, and DIODE datasets. We use the RGB2Depth network from Sec. 3.2 for all models. Our VisualEchoes pretraining transfers well, consistently predicting depth better than the model trained from scratch. Furthermore, it is even competitive with the supervised models, whether they are pre-trained for ImageNet (1M manually labeled images) or MIT Indoor Scenes (16K manually labeled images). ↓ lower better, ↑ higher better. (Un)sup = (un)supervised. We boldface the best unsupervised method
  • Table3: Results for three downstream tasks. ↓ lower better, ↑ higher better
Download tables as Excel
Related work
  • Auditory Scene Analysis using Echoes Previous work shows that using echo responses only, one can predict the 2D [5] or 3D [14] room geometry. Additionally, echoes can complement vision, especially when vision-based depth estimates are not reliable, e.g., on transparent windows or featureless walls [42,80]. In dynamic environments, autonomous robots can leverage echoes for obstacle avoidance [71] and mapping and navigation [17] using a bat-like echolocation model. Concurrently with our work, predicting depth maps purely from echo

    1 person, robot, or simulated robot responses is also explored in [12] with a low-cost audio system called BatVision. Our work explores a novel direction for auditory scene analysis by employing echoes for spatial visual feature learning, and the features are applicable in the absence of any audio.

    Self-Supervised Image Representation Learning Self-supervised image feature learning methods leverage structured information within the data itself to generate labels for representation learning [63,33]. To this end, many “pretext” tasks have been explored—for example, predicting the rotation applied to an input image [31,1], discriminating image instances [19], colorizing images [45,82], solving a jigsaw puzzle from image patches [51], or multi-task learning using synthetic imagery [60]. Temporal information in videos also permits self-supervised tasks, for example, by predicting whether a frame sequence is in the correct order [49,20] or ensuring visual coherence of tracked objects [76,28,38]. Whereas these methods aim to learn features generically useful for recognition, our objective is to learn features generically useful for spatial estimation tasks. Accordingly, our echolocation objective is well-aligned with our target family of spatial tasks (depth, surfaces, navigation), consistent with findings that task similarity is important for positive transfer [81]. Furthermore, unlike any of the above, rather than learn from massive repositories of human-taken photos, the proposed approach learns from interactions with the scene via echolocation.
Reference
  • Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)
    Google ScholarFindings
  • Agrawal, P., Nair, A.V., Abbeel, P., Malik, J., Levine, S.: Learning to poke by poking: Experiential learning of intuitive physics. In: NeurIPS (2016) 3. Alameda-Pineda, X., Staiano, J., Subramanian, R., Batrinca, L., Ricci, E., Lepri, B., Lanz, O., Sebe, N.: Salsa: A novel dataset for multimodal group behavior analysis. TPAMI (2015)
    Google ScholarLocate open access versionFindings
  • 4. Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
    Findings
  • 5. Antonacci, F., Filos, J., Thomas, M.R., Habets, E.A., Sarti, A., Naylor, P.A., Tubaro, S.: Inference of room geometry from acoustic impulse responses. IEEE Transactions on Audio, Speech, and Language Processing (2012)
    Google ScholarLocate open access versionFindings
  • 6. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
    Google ScholarFindings
  • 7. Arandjelovic, R., Zisserman, A.: Objects that sound. In: ECCV (2018)
    Google ScholarFindings
  • 8. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations from unlabeled video. In: NeurIPS (2016)
    Google ScholarFindings
  • 9. Ban, Y., Li, X., Alameda-Pineda, X., Girin, L., Horaud, R.: Accounting for room acoustics in audio-visual multi-speaker tracking. In: ICASSP (2018)
    Google ScholarFindings
  • 10. Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3d: Learning from rgb-d data in indoor environments. 3DV (2017)
    Google ScholarLocate open access versionFindings
  • 11. Chen, C., Jain, U., Schissler, C., Gari, S.V.A., Al-Halah, Z., Ithapu, V.K., Robinson, P., Grauman, K.: Audio-Visual Embodied Navigation. arXiv preprint arXiv:1912.11474 (2019)
    Findings
  • 12. Christensen, J., Hornauer, S., Yu, S.: Batvision - learning to see 3d spatial layout with two ears. In: ICRA (2020)
    Google ScholarFindings
  • 13. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
    Google ScholarFindings
  • 14. Dokmanic, I., Parhizkar, R., Walther, A., Lu, Y.M., Vetterli, M.: Acoustic echoes reveal room shape. Proceedings of the National Academy of Sciences (2013)
    Google ScholarLocate open access versionFindings
  • 15. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
    Google ScholarFindings
  • 16. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)
    Google ScholarFindings
  • 17. Eliakim, I., Cohen, Z., Kosa, G., Yovel, Y.: A fully autonomous terrestrial bat-like acoustic robot. PLoS computational biology (2018)
    Google ScholarLocate open access versionFindings
  • 18. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T., Rubinstein, M.: Looking to listen at the cocktail party: A speakerindependent audio-visual model for speech separation. In: SIGGRAPH (2018)
    Google ScholarFindings
  • 19. Feng, Z., Xu, C., Tao, D.: Self-supervised representation learning by rotation feature decoupling. In: CVPR (2019)
    Google ScholarFindings
  • 20. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)
    Google ScholarFindings
  • 21. Fouhey, D.F., Gupta, A., Hebert, M.: Data-driven 3d primitives for single image understanding. In: ICCV (2013)
    Google ScholarFindings
  • 22. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR (2018)
    Google ScholarFindings
  • 23. Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: ICCV (2019)
    Google ScholarFindings
  • 24. Gandhi, D., Pinto, L., Gupta, A.: Learning to fly by crashing. In: IROS (2017)
    Google ScholarFindings
  • 25. Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: ECCV (2018)
    Google ScholarFindings
  • 26. Gao, R., Grauman, K.: 2.5d visual sound. In: CVPR (2019)
    Google ScholarFindings
  • 27. Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: ICCV (2019)
    Google ScholarFindings
  • 28. Gao, R., Jayaraman, D., Grauman, K.: Object-centric representation learning from unlabeled videos. In: ACCV (2016)
    Google ScholarFindings
  • 29. Garg, R., BG, V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: ECCV (2016)
    Google ScholarFindings
  • 30. Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: ICCV Workshops (2015)
    Google ScholarFindings
  • 31. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
    Google ScholarFindings
  • 32. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)
    Google ScholarFindings
  • 33. Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmarking selfsupervised visual representation learning. In: ICCV (2019)
    Google ScholarFindings
  • 34. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
    Google ScholarFindings
  • 35. Hershey, J.R., Movellan, J.R.: Audio vision: Using audio-visual synchrony to locate sounds. In: NeurIPS (2000)
    Google ScholarFindings
  • 36. Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In: WACV (2019)
    Google ScholarFindings
  • 37. Irie, G., Ostrek, M., Wang, H., Kameoka, H., Kimura, A., Kawanishi, T., Kashino, K.: Seeing through sounds: Predicting visual semantic segmentation results from multichannel audio signals. In: ICASSP (2019)
    Google ScholarFindings
  • 38. Jayaraman, D., Grauman, K.: Slow and steady feature analysis: Higher order temporal coherence in video. In: CVPR (2016)
    Google ScholarFindings
  • 39. Jayaraman, D., Grauman, K.: Learning image representations equivariant to egomotion. In: ICCV (2015)
    Google ScholarFindings
  • 40. Jiang, H., Larsson, G., Maire Greg Shakhnarovich, M., Learned-Miller, E.: Selfsupervised relative depth learning for urban scene understanding. In: ECCV (2018)
    Google ScholarFindings
  • 41. Karsch, K., Liu, C., Kang, S.B.: Depth transfer: Depth extraction from video using non-parametric sampling. TPAMI (2014)
    Google ScholarLocate open access versionFindings
  • 42. Kim, H., Remaggi, L., Jackson, P.J., Fazi, F.M., Hilton, A.: 3d room geometry reconstruction using audio-visual sensors. In: 3DV (2017)
    Google ScholarLocate open access versionFindings
  • 43. Korbar, B., Tran, D., Torresani, L.: Co-training of audio and video representations from self-supervised temporal synchronization. In: NeurIPS (2018)
    Google ScholarFindings
  • 44. Kuttruff, H.: Room Acoustics. CRC Press (2017)
    Google ScholarFindings
  • 45. Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR (2017)
    Google ScholarFindings
  • 46. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sanchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis (2017)
    Google ScholarFindings
  • 47. Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: CVPR (2015) 48. McGuire, K., De Wagter, C., Tuyls, K., Kappen, H., de Croon, G.: Minimal navigation solution for a swarm of tiny flying robots to explore an unknown environment. Science Robotics (2019)
    Google ScholarLocate open access versionFindings
  • 49. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: ECCV (2016)
    Google ScholarFindings
  • 50. Morgado, P., Vasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360◦ video. In: NeurIPS (2018)
    Google ScholarFindings
  • 51. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)
    Google ScholarFindings
  • 52. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)
    Google ScholarFindings
  • 53. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)
    Google ScholarFindings
  • 54. Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: ECCV (2016)
    Google ScholarFindings
  • 55. Palossi, D., Loquercio, A., Conti, F., Flamand, E., Scaramuzza, D., Benini, L.: A 64-mw dnn-based visual navigation engine for autonomous nano-drones. IEEE Internet of Things Journal (2019)
    Google ScholarLocate open access versionFindings
  • 56. Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: ICRA (2016)
    Google ScholarFindings
  • 57. Purushwalkam, S., Gupta, A., Kaufman, D.M., Russell, B.: Bounce and learn: Modeling scene dynamics with real-world bounces. In: ICLR (2019)
    Google ScholarFindings
  • 58. Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: CVPR (2009)
    Google ScholarFindings
  • 59. Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR (2019)
    Google ScholarFindings
  • 60. Ren, Z., Jae Lee, Y.: Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: CVPR (2018)
    Google ScholarFindings
  • 61. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
    Google ScholarFindings
  • 62. Rosenblum, L.D., Gordon, M.S., Jarquin, L.: Echolocating distance by moving and stationary listeners. Ecological Psychology (2000) 63. de Sa, V.R.: Learning classification with unlabeled data. In: NeurIPS (1994)
    Google ScholarLocate open access versionFindings
  • 64. Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., Batra, D.: Habitat: A Platform for Embodied AI Research. In: ICCV (2019)
    Google ScholarFindings
  • 65. Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: CVPR (2018)
    Google ScholarFindings
  • 66. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)
    Google ScholarFindings
  • 67. Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., et al.: The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)
    Findings
  • 68. Stroffregen, T.A., Pittenger, J.B.: Human echolocation as a basic form of perception and action. Ecological psychology (1995)
    Google ScholarLocate open access versionFindings
  • 69. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)
    Google ScholarFindings
  • 70. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and motion network for learning monocular stereo. In: CVPR (2017)
    Google ScholarFindings
  • 71. Vanderelst, D., Holderied, M.W., Peremans, H.: Sensorimotor model of obstacle avoidance in echolocating bats. PLoS computational biology (2015)
    Google ScholarLocate open access versionFindings
  • 72. Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F.Z., Daniele, A.F., Mostajabi, M., Basart, S., Walter, M.R., Shakhnarovich, G.: DIODE: A Dense Indoor and Outdoor DEpth Dataset. arXiv preprint arXiv:1908.00463 (2019)
    Findings
  • 73. Veach, E., Guibas, L.: Bidirectional estimators for light transport. In: Photorealistic Rendering Techniques (1995)
    Google ScholarFindings
  • 74. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: Sfmnet: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017)
    Findings
  • 75. Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: CVPR (2015)
    Google ScholarFindings
  • 76. Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV (2015)
    Google ScholarFindings
  • 77. Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson env: Realworld perception for embodied agents. In: CVPR (2018)
    Google ScholarFindings
  • 78. Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In: CVPR (2017)
    Google ScholarFindings
  • 79. Yang, Z., Wang, P., Xu, W., Zhao, L., Nevatia, R.: Unsupervised learning of geometry with edge-aware depth-normal consistency. In: AAAI (2018)
    Google ScholarFindings
  • 80. Ye, M., Zhang, Y., Yang, R., Manocha, D.: 3d reconstruction in the presence of glasses by acoustic and stereo fusion. In: ICCV (2015)
    Google ScholarFindings
  • 81. Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: CVPR (2018)
    Google ScholarFindings
  • 82. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016)
    Google ScholarFindings
  • 83. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: ECCV (2018)
    Google ScholarFindings
  • 84. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
    Google ScholarFindings
  • 85. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
    Google ScholarFindings
  • 86. Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: Generating natural sound for videos in the wild. In: CVPR (2018)
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments