AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Disparity data generated from stereo images and videos is a promising source of supervision for depth prediction methods, but it can only approximate the true inverse depth

Structure-Guided Ranking Loss for Single Image Depth Prediction

CVPR, pp.608-617, (2020)

Cited by: 24|Views204
EI

Abstract

Single image depth prediction is a challenging task due to its ill-posed nature and challenges with capturing ground truth for supervision. Large-scale disparity data generated from stereo photos and 3D videos is a promising source of supervision, however, such disparity data can only approximate the inverse ground truth depth up to an af...More

Code:

Data:

0
Introduction
  • Single image depth prediction is a challenging task due to its ill-posed nature and challenges with capturing ground truth for supervision.
  • Large-scale disparity data generated from stereo photos and 3D videos is a promising source of supervision, such disparity data can only approximate the inverse ground truth depth up to an affine transformation.
  • To more effectively learn from such pseudo-depth data, the authors propose to use a simple pair-wise ranking loss with a novel sampling strategy.
  • The authors show that the pair-wise ranking loss, combined with the structureguided sampling strategies, can significantly improve the quality of depth map prediction.
  • The authors conduct crossdataset evaluation on six benchmark datasets and show that the method consistently improves over the baselines, leading to superior quantitative and qualitative results
Highlights
  • Single image depth prediction is a challenging task due to its ill-posed nature and challenges with capturing ground truth for supervision
  • We propose a structure-guided ranking loss formulation with two novel sampling strategies for monocular depth prediction
  • We introduce a large relative depth dataset of about 21K high resolution web stereo photos
  • We propose to train a model from web stereo images, where only derived disparity maps are present for supervision
  • Point pair sampling Edge- and instance-guided sampling can greatly enhance the local details of the depth prediction, but we find that they are not very effective in preserving global structures, such as ground planes, walls, etc
  • Disparity data generated from stereo images and videos is a promising source of supervision for depth prediction methods, but it can only approximate the true inverse depth
Methods
  • DIW [3], the authors use their released model that trained on DIW using a ranking loss.
  • RW [36] was trained with a ranking loss on the RW dataset which is derived from web stereo images.
  • Ibims εacc εcomp further improve the performance of depth consistency on object instances, the authors incorporate instance-guided sampling into the sampling strategy (i.e., Ours ERI).
  • The importance of instance-guided sampling is reflected in the improvements of Ours ERIM over Ours ERM, as the only difference between the two is whether using instance guidance or not.
Results
  • With the proposed loss and the new dataset, the model achieves state-of-the-art cross-dataset generalization performance.
  • Despite the fact that both MiDaS and YT3D mixed different sources of data for training, the model still achieves the best performance in this setting
Tables
  • Table1: Ordinal error (%) of zero-shot cross-dataset evaluation. Existing monodepth methods were trained on different sources of data: DIW, ReDWeb (RW), MegaDepth (MD), 3D Movies (MV), iPhone Depth (ID), YouTube3D (YT3D), and MannequinChallenge (MC); with different losses: pair-wise ranking loss (PR), affine-invariant MSE loss (AI), multi-scale gradient matching loss (MGM), L1 loss (L1), scale-invariant loss (SI), robust ordinal depth loss (ROD) and multi-scale edge-aware smoothness loss (MES). Our structure-guided ranking loss is denoted as SR. To disentangle the effect of datasets from that of losses, we also evaluate three baseline models: 1) Ours AI: using the same losses as MiDaS; 2) Ours†: using our final loss on the RW dataset; 3) Ours R: using the pair-wise ranking loss [<a class="ref-link" id="c36" href="#r36">36</a>]. To evaluate the robustness of trained models, we compare our models with the state-of-the-art methods on six RGBD datasets that were unseen during training. The lowest error is boldfaced and the second lowest is underlined
  • Table2: Quantitative evaluation, and an ablation of variants on our loss function, including: Ours AI: the baseline model trained on our data with affine invariant and multi-scale gradient losses as in [<a class="ref-link" id="c19" href="#r19">19</a>]; Ours R: random sampling ranking loss; Ours E: edgeguided sampling; Ours ER: edge-guided sampling + Ours R; Ours ERI: instance-guided sampling + Ours ER; Ours ERM: multi-scale gradient matching term + Ours ER; Ours ERIM: our model trained with our final loss functions. For all metrics, lower is better
Download tables as Excel
Related work
  • Monodepth methods Traditional monodepth methods rely on direct supervision [15, 23, 27] mainly through handcrafted features, to learn 3D priors from images. In recent years, supervised deep learning models [2, 5, 6, 7, 18, 20, 24, 25, 37, 38, 39] have achieved state-of-the-art performance in the task of monocular metric depth prediction. These methods, trained on RGB-D datasets, learn a mapping function from RGB to depth. Despite the fact that these models can predict accurate depth when testing on the same or similar datasets, they cannot be easily generalized to novel scenes. In addition to these supervised methods, unsupervised or semi-supervised algorithms have also been studied. The key idea behind these methods [9, 12, 17, 35, 40] is the image reconstruction loss for view synthesis, requiring calibrated stereo pairs or video sequences for training. However, these models share the same issue with supervised deep learning based methods. In other words, the model cannot be generalized to new datasets. To address this issue, multiple in-the-wild (-i.e., contains both indoor and outdoor scenes) RGB-D datasets [3, 4, 19, 21, 34, 22, 33, 36] have been proposed.
Funding
  • This work was supported in part by the National Natural Science Foundation of China (Grant No 61876211 and U1913602) and in part by the Adobe Gift
Reference
  • D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Proc. European Conf. on Computer Vision (ECCV), pages 611–625, 2012. 3, 5, 8
    Google ScholarLocate open access versionFindings
  • Ayan Chakrabarti, Jingyu Shao, and Gregory Shakhnarovich. Depth from a single image by harmonizing overcomplete local network predictions. In Advances in Neural Information Processing Systems, 2016. 2
    Google ScholarLocate open access versionFindings
  • Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Singleimage depth perception in the wild. In Advances in Neural Information Processing Systems, pages 730–738. 2016. 2, 3, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Weifeng Chen, Shengyi Qian, and Jia Deng. Learning singleimage depth from videos using quality assessment networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5604–5613, 2019. 1, 2, 3, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. Int. Conf. on Computer Vision (ICCV), 2012
    Google ScholarLocate open access versionFindings
  • David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, 2014. 2, 5
    Google ScholarLocate open access versionFindings
  • Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
    Google ScholarLocate open access versionFindings
  • A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. 3
    Google ScholarLocate open access versionFindings
  • Ravi Garg and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proc. European Conf. on Computer Vision (ECCV), 2016. 2
    Google ScholarLocate open access versionFindings
  • Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T Barron. Learning single camera depth estimation using dualpixels. arXiv preprint arXiv:1904.05822, 2019. 1, 2
    Findings
  • Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 1
    Google ScholarLocate open access versionFindings
  • Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2961–2969, 2017. 4
    Google ScholarLocate open access versionFindings
  • E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proc. Computer Vision and Pattern Recognition (CVPR), Jul 2017. 5
    Google ScholarLocate open access versionFindings
  • Kevin Karsch, Ce Liu, and Sing Bing Kang. Depthtransfer: Depth extraction from video using non-parametric sampling. Trans. Pattern Analysis and Machine Intelligence, 2014. 2
    Google ScholarLocate open access versionFindings
  • Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. In Proc. European Conf. on Computer Vision Workshop (ECCV-WS), pages 331–348, 2018. 1, 2, 5, 7, 8
    Google ScholarLocate open access versionFindings
  • Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semisupervised deep learning for monocular depth map prediction. In Proc. Computer Vision and Pattern Recognition (CVPR), July 202
    Google ScholarLocate open access versionFindings
  • Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In Proc. IEEE Int. Conf. 3D Vision (3DV), 2016. 2
    Google ScholarLocate open access versionFindings
  • Katrin Lasinger, Rene Ranftl, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. arXiv:1907.01341, 201, 2, 3, 6, 7, 8
    Findings
  • Jae-Han Lee, Minhyeok Heo, Kyung-Rae Kim, and ChangSu Kim. Single-image depth estimation based on fourier domain analysis. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
    Google ScholarLocate open access versionFindings
  • Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. Learning the depths of moving people by watching frozen people. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2019. 1, 2, 3, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Zhengqi Li and Noah Snavely. Megadepth: Learning singleview depth prediction from internet photos. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2, 3, 4, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Beyang Liu, Stephen Gould, and Daphne Koller. Single image depth estimation from predicted semantic labels. In Proc. Computer Vision and Pattern Recognition (CVPR), 2010. 2
    Google ScholarLocate open access versionFindings
  • Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. Trans. Pattern Analysis and Machine Intelligence, 2015. 2
    Google ScholarLocate open access versionFindings
  • Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
    Google ScholarLocate open access versionFindings
  • German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2016. 3
    Google ScholarLocate open access versionFindings
  • Ashutosh Saxena, Min Sun, and Andrew Y. Ng. Make3d: Learning 3d scene structure from a single still image. Trans. Pattern Analysis and Machine Intelligence, 2009. 2
    Google ScholarLocate open access versionFindings
  • Thomas Schops, Johannes L. Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with highresolution images and multi-camera videos. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. 2
    Google ScholarLocate open access versionFindings
  • Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proc. European Conf. on Computer Vision (ECCV), 2012. 1, 2, 5, 8
    Google ScholarLocate open access versionFindings
  • J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In Proc. Int. Conf. on Intelligent Robot Systems (IROS), 2012. 2, 5, 8
    Google ScholarLocate open access versionFindings
  • Jonas Uhrig, Nick Schneider, Lucas Schneider, Uwe Franke, Thomas Brox, and Andereas Geiger. Sparsity invariant cnns. In Proc. IEEE Int. Conf. on 3D Vision (3DV), 2017. 2, 5, 8
    Google ScholarLocate open access versionFindings
  • Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z. Dai, Andrea F. Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R. Walter, and Gregory Shakhnarovich. DIODE: A Dense Indoor and Outdoor DEpth Dataset. arxiv: 1908.00463, 2019. 2, 5, 8
    Findings
  • Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In Proc. IEEE Int. Conf. on 3D Vision (3DV), 2019. 1, 2, 3, 5
    Google ScholarLocate open access versionFindings
  • Lijun Wang, Xiaohui Shen, Jianming Zhang, Oliver Wang, Zhe Lin, Chih-Yao Hsieh, Sarah Kong, and Huchuan Lu. Deeplens: Shallow depth of field from a single image. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 37(6):6:1– 6:11, 2018. 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Jamie Watson, Michael Firman, Gabriel J. Brostow, and Daniyar Turmukhambetov. Self-supervised monocular depth hints. In Proc. Int. Conf. on Computer Vision (ICCV), 2019. 2
    Google ScholarLocate open access versionFindings
  • Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth perception with web stereo data supervision. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2018. 1, 2, 3, 4, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
    Google ScholarLocate open access versionFindings
  • Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and Elisa Ricci. Structured attention guided convolutional neural fields for monocular depth estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
    Google ScholarLocate open access versionFindings
  • Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. In Proc. Int. Conf. on Computer Vision (ICCV), 2019. 2
    Google ScholarLocate open access versionFindings
  • Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and ego-motion from video. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. 2
    Google ScholarLocate open access versionFindings
  • Daniel Zoran, Phillip Isola, Dilip Krishnan, and William T Freeman. Learning ordinal relationships for mid-level vision. In Proc. Int. Conf. on Computer Vision (ICCV), 2015. 5
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科