AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Disparity data generated from stereo images and videos is a promising source of supervision for depth prediction methods, but it can only approximate the true inverse depth
Structure-Guided Ranking Loss for Single Image Depth Prediction
CVPR, pp.608-617, (2020)
Single image depth prediction is a challenging task due to its ill-posed nature and challenges with capturing ground truth for supervision. Large-scale disparity data generated from stereo photos and 3D videos is a promising source of supervision, however, such disparity data can only approximate the inverse ground truth depth up to an af...More
PPT (Upload PPT)
- Single image depth prediction is a challenging task due to its ill-posed nature and challenges with capturing ground truth for supervision.
- Large-scale disparity data generated from stereo photos and 3D videos is a promising source of supervision, such disparity data can only approximate the inverse ground truth depth up to an affine transformation.
- To more effectively learn from such pseudo-depth data, the authors propose to use a simple pair-wise ranking loss with a novel sampling strategy.
- The authors show that the pair-wise ranking loss, combined with the structureguided sampling strategies, can significantly improve the quality of depth map prediction.
- The authors conduct crossdataset evaluation on six benchmark datasets and show that the method consistently improves over the baselines, leading to superior quantitative and qualitative results
- Single image depth prediction is a challenging task due to its ill-posed nature and challenges with capturing ground truth for supervision
- We propose a structure-guided ranking loss formulation with two novel sampling strategies for monocular depth prediction
- We introduce a large relative depth dataset of about 21K high resolution web stereo photos
- We propose to train a model from web stereo images, where only derived disparity maps are present for supervision
- Point pair sampling Edge- and instance-guided sampling can greatly enhance the local details of the depth prediction, but we find that they are not very effective in preserving global structures, such as ground planes, walls, etc
- Disparity data generated from stereo images and videos is a promising source of supervision for depth prediction methods, but it can only approximate the true inverse depth
- DIW , the authors use their released model that trained on DIW using a ranking loss.
- RW  was trained with a ranking loss on the RW dataset which is derived from web stereo images.
- Ibims εacc εcomp further improve the performance of depth consistency on object instances, the authors incorporate instance-guided sampling into the sampling strategy (i.e., Ours ERI).
- The importance of instance-guided sampling is reflected in the improvements of Ours ERIM over Ours ERM, as the only difference between the two is whether using instance guidance or not.
- With the proposed loss and the new dataset, the model achieves state-of-the-art cross-dataset generalization performance.
- Despite the fact that both MiDaS and YT3D mixed different sources of data for training, the model still achieves the best performance in this setting
- Table1: Ordinal error (%) of zero-shot cross-dataset evaluation. Existing monodepth methods were trained on different sources of data: DIW, ReDWeb (RW), MegaDepth (MD), 3D Movies (MV), iPhone Depth (ID), YouTube3D (YT3D), and MannequinChallenge (MC); with different losses: pair-wise ranking loss (PR), affine-invariant MSE loss (AI), multi-scale gradient matching loss (MGM), L1 loss (L1), scale-invariant loss (SI), robust ordinal depth loss (ROD) and multi-scale edge-aware smoothness loss (MES). Our structure-guided ranking loss is denoted as SR. To disentangle the effect of datasets from that of losses, we also evaluate three baseline models: 1) Ours AI: using the same losses as MiDaS; 2) Ours†: using our final loss on the RW dataset; 3) Ours R: using the pair-wise ranking loss [<a class="ref-link" id="c36" href="#r36">36</a>]. To evaluate the robustness of trained models, we compare our models with the state-of-the-art methods on six RGBD datasets that were unseen during training. The lowest error is boldfaced and the second lowest is underlined
- Table2: Quantitative evaluation, and an ablation of variants on our loss function, including: Ours AI: the baseline model trained on our data with affine invariant and multi-scale gradient losses as in [<a class="ref-link" id="c19" href="#r19">19</a>]; Ours R: random sampling ranking loss; Ours E: edgeguided sampling; Ours ER: edge-guided sampling + Ours R; Ours ERI: instance-guided sampling + Ours ER; Ours ERM: multi-scale gradient matching term + Ours ER; Ours ERIM: our model trained with our final loss functions. For all metrics, lower is better
- Monodepth methods Traditional monodepth methods rely on direct supervision [15, 23, 27] mainly through handcrafted features, to learn 3D priors from images. In recent years, supervised deep learning models [2, 5, 6, 7, 18, 20, 24, 25, 37, 38, 39] have achieved state-of-the-art performance in the task of monocular metric depth prediction. These methods, trained on RGB-D datasets, learn a mapping function from RGB to depth. Despite the fact that these models can predict accurate depth when testing on the same or similar datasets, they cannot be easily generalized to novel scenes. In addition to these supervised methods, unsupervised or semi-supervised algorithms have also been studied. The key idea behind these methods [9, 12, 17, 35, 40] is the image reconstruction loss for view synthesis, requiring calibrated stereo pairs or video sequences for training. However, these models share the same issue with supervised deep learning based methods. In other words, the model cannot be generalized to new datasets. To address this issue, multiple in-the-wild (-i.e., contains both indoor and outdoor scenes) RGB-D datasets [3, 4, 19, 21, 34, 22, 33, 36] have been proposed.
- This work was supported in part by the National Natural Science Foundation of China (Grant No 61876211 and U1913602) and in part by the Adobe Gift
- D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Proc. European Conf. on Computer Vision (ECCV), pages 611–625, 2012. 3, 5, 8
- Ayan Chakrabarti, Jingyu Shao, and Gregory Shakhnarovich. Depth from a single image by harmonizing overcomplete local network predictions. In Advances in Neural Information Processing Systems, 2016. 2
- Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Singleimage depth perception in the wild. In Advances in Neural Information Processing Systems, pages 730–738. 2016. 2, 3, 5, 6, 7, 8
- Weifeng Chen, Shengyi Qian, and Jia Deng. Learning singleimage depth from videos using quality assessment networks. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 5604–5613, 2019. 1, 2, 3, 6, 7, 8
- David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. Int. Conf. on Computer Vision (ICCV), 2012
- David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, 2014. 2, 5
- Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
- A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proc. Computer Vision and Pattern Recognition (CVPR), 2016. 3
- Ravi Garg and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proc. European Conf. on Computer Vision (ECCV), 2016. 2
- Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T Barron. Learning single camera depth estimation using dualpixels. arXiv preprint arXiv:1904.05822, 2019. 1, 2
- Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 1
- Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. 2
- Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proc. Int. Conf. on Computer Vision (ICCV), pages 2961–2969, 2017. 4
- E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proc. Computer Vision and Pattern Recognition (CVPR), Jul 2017. 5
- Kevin Karsch, Ce Liu, and Sing Bing Kang. Depthtransfer: Depth extraction from video using non-parametric sampling. Trans. Pattern Analysis and Machine Intelligence, 2014. 2
- Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Korner. Evaluation of cnn-based single-image depth estimation methods. In Proc. European Conf. on Computer Vision Workshop (ECCV-WS), pages 331–348, 2018. 1, 2, 5, 7, 8
- Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semisupervised deep learning for monocular depth map prediction. In Proc. Computer Vision and Pattern Recognition (CVPR), July 202
- Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In Proc. IEEE Int. Conf. 3D Vision (3DV), 2016. 2
- Katrin Lasinger, Rene Ranftl, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. arXiv:1907.01341, 201, 2, 3, 6, 7, 8
- Jae-Han Lee, Minhyeok Heo, Kyung-Rae Kim, and ChangSu Kim. Single-image depth estimation based on fourier domain analysis. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
- Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. Learning the depths of moving people by watching frozen people. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2019. 1, 2, 3, 5, 6, 7, 8
- Zhengqi Li and Noah Snavely. Megadepth: Learning singleview depth prediction from internet photos. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2, 3, 4, 6, 7, 8
- Beyang Liu, Stephen Gould, and Daphne Koller. Single image depth estimation from predicted semantic labels. In Proc. Computer Vision and Pattern Recognition (CVPR), 2010. 2
- Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields. Trans. Pattern Analysis and Machine Intelligence, 2015. 2
- Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
- German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2016. 3
- Ashutosh Saxena, Min Sun, and Andrew Y. Ng. Make3d: Learning 3d scene structure from a single still image. Trans. Pattern Analysis and Machine Intelligence, 2009. 2
- Thomas Schops, Johannes L. Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with highresolution images and multi-camera videos. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. 2
- Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proc. European Conf. on Computer Vision (ECCV), 2012. 1, 2, 5, 8
- J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In Proc. Int. Conf. on Intelligent Robot Systems (IROS), 2012. 2, 5, 8
- Jonas Uhrig, Nick Schneider, Lucas Schneider, Uwe Franke, Thomas Brox, and Andereas Geiger. Sparsity invariant cnns. In Proc. IEEE Int. Conf. on 3D Vision (3DV), 2017. 2, 5, 8
- Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z. Dai, Andrea F. Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R. Walter, and Gregory Shakhnarovich. DIODE: A Dense Indoor and Outdoor DEpth Dataset. arxiv: 1908.00463, 2019. 2, 5, 8
- Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In Proc. IEEE Int. Conf. on 3D Vision (3DV), 2019. 1, 2, 3, 5
- Lijun Wang, Xiaohui Shen, Jianming Zhang, Oliver Wang, Zhe Lin, Chih-Yao Hsieh, Sarah Kong, and Huchuan Lu. Deeplens: Shallow depth of field from a single image. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 37(6):6:1– 6:11, 2018. 2, 5, 6, 7
- Jamie Watson, Michael Firman, Gabriel J. Brostow, and Daniyar Turmukhambetov. Self-supervised monocular depth hints. In Proc. Int. Conf. on Computer Vision (ICCV), 2019. 2
- Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth perception with web stereo data supervision. In Proc. Computer Vision and Pattern Recognition (CVPR), June 2018. 1, 2, 3, 4, 5, 6, 7
- Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
- Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and Elisa Ricci. Structured attention guided convolutional neural fields for monocular depth estimation. In Proc. Computer Vision and Pattern Recognition (CVPR), 2018. 2
- Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. In Proc. Int. Conf. on Computer Vision (ICCV), 2019. 2
- Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and ego-motion from video. In Proc. Computer Vision and Pattern Recognition (CVPR), 2017. 2
- Daniel Zoran, Phillip Isola, Dilip Krishnan, and William T Freeman. Learning ordinal relationships for mid-level vision. In Proc. Int. Conf. on Computer Vision (ICCV), 2015. 5