AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We present a semantic divide-and-conquer strategy to reduce monocular depth estimation into that of individual semantic segments

SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation

CVPR, pp.538-547, (2020)

被引用12|浏览163
EI
下载 PDF 全文
引用
微博一下

摘要

Monocular depth estimation is an ill-posed problem, and as such critically relies on scene priors and semantics. Due to its complexity, we propose a deep neural network model based on a semantic divide-and-conquer approach. Our model decomposes a scene into semantic segments, such as object instances and background stuff classes, and then...更多

代码

数据

0
简介
  • Depth estimation is an important component of 3D perception. Compared to reconstruction techniques based on active sensors or multi-view geometry, monocular depth estimation is significantly more ill-posed, and is critically reliant on learning strong scene priors and semantics.

    Recent works studying this problem [4, 14, 39] have achieved significant progresses using deep convolutional neural networks (CNNs) supervised by depth data, showing that they are able to capture complex high-level scene semantics.
  • Compared to reconstruction techniques based on active sensors or multi-view geometry, monocular depth estimation is significantly more ill-posed, and is critically reliant on learning strong scene priors and semantics.
  • Recent works studying this problem [4, 14, 39] have achieved significant progresses using deep convolutional neural networks (CNNs) supervised by depth data, showing that they are able to capture complex high-level scene semantics.
重点内容
  • Depth estimation is an important component of 3D perception
  • We propose a Semantic Divide-and-Conquer Network (SDC-Depth Net) for monocular depth estimation
  • We present SDC-Depth Net, an end-to-end trainable depth prediction network based on the aforementioned Semantic Divide-and-Conquer strategy
  • The depth estimation module infers a category-specific depth map in a canonical space, as well as scale and shift parameters based on the global context
  • The fully convolutional network network performs semantic segmentation for C categories, where the first K categories are object classes and the rest belong to stuff classes
  • We present a semantic divide-and-conquer strategy to reduce monocular depth estimation into that of individual semantic segments
方法
  • Laina et al [14] Xu et al [39]

    Zhang et al [41] Ours Error

    RMSE Abs Rel

    Accuracy δ < 1.252 Ours is worse Ours is better.
  • The authors compare the method against three state-of-the-art approaches, including Chen et al [2], Xian et al [37], and Xu et al [39], where Xu et al [39] is trained on both DIW and COCO dataset using the same training strategy as ours.
  • The authors' proposed method adopt a divide-and-conquer strategy to estimate depth for each segments independently, achieves the best performance.
结果
  • Category and Instance Specific Depth

    Depth Output segments. For each semantic segment, the depth estimation module infers a category-specific depth map in a canonical space, as well as scale and shift parameters based on the global context.
  • Cityscapes [3] is a large dataset for urban scene understanding, containing both depth and panoptic segmentation annotations of 20 semantic categories.
  • Since DIW dataset does not contain segmentation annotations and the COCO panoptic segmentation dataset [23] contains images of unconstrained scenes, the authors simultaneously train the model on DIW and COCO for relative depth estimation and segmentation, respectively.
  • In order to reduce computational complexity, the authors adopt the super-class annotation of COCO dataset to train the segmentation module, containing 15 stuff and 12 object classes.
  • The authors sequentially feed training images from both datasets to the network in each itera-
结论
  • The authors present a semantic divide-and-conquer strategy to reduce monocular depth estimation into that of individual semantic segments.
  • Based on this idea, the SDC-Depth Net is designed, which decomposes an input images into segments of different categories and instances, and infers the canonical depth as well as the scale-and-shift transformation for each segment using trained parameters.
  • Experiments on three popular benchmarks demonstrates the effectiveness of the method
总结
  • Introduction:

    Depth estimation is an important component of 3D perception. Compared to reconstruction techniques based on active sensors or multi-view geometry, monocular depth estimation is significantly more ill-posed, and is critically reliant on learning strong scene priors and semantics.

    Recent works studying this problem [4, 14, 39] have achieved significant progresses using deep convolutional neural networks (CNNs) supervised by depth data, showing that they are able to capture complex high-level scene semantics.
  • Compared to reconstruction techniques based on active sensors or multi-view geometry, monocular depth estimation is significantly more ill-posed, and is critically reliant on learning strong scene priors and semantics.
  • Recent works studying this problem [4, 14, 39] have achieved significant progresses using deep convolutional neural networks (CNNs) supervised by depth data, showing that they are able to capture complex high-level scene semantics.
  • Methods:

    Laina et al [14] Xu et al [39]

    Zhang et al [41] Ours Error

    RMSE Abs Rel

    Accuracy δ < 1.252 Ours is worse Ours is better.
  • The authors compare the method against three state-of-the-art approaches, including Chen et al [2], Xian et al [37], and Xu et al [39], where Xu et al [39] is trained on both DIW and COCO dataset using the same training strategy as ours.
  • The authors' proposed method adopt a divide-and-conquer strategy to estimate depth for each segments independently, achieves the best performance.
  • Results:

    Category and Instance Specific Depth

    Depth Output segments. For each semantic segment, the depth estimation module infers a category-specific depth map in a canonical space, as well as scale and shift parameters based on the global context.
  • Cityscapes [3] is a large dataset for urban scene understanding, containing both depth and panoptic segmentation annotations of 20 semantic categories.
  • Since DIW dataset does not contain segmentation annotations and the COCO panoptic segmentation dataset [23] contains images of unconstrained scenes, the authors simultaneously train the model on DIW and COCO for relative depth estimation and segmentation, respectively.
  • In order to reduce computational complexity, the authors adopt the super-class annotation of COCO dataset to train the segmentation module, containing 15 stuff and 12 object classes.
  • The authors sequentially feed training images from both datasets to the network in each itera-
  • Conclusion:

    The authors present a semantic divide-and-conquer strategy to reduce monocular depth estimation into that of individual semantic segments.
  • Based on this idea, the SDC-Depth Net is designed, which decomposes an input images into segments of different categories and instances, and infers the canonical depth as well as the scale-and-shift transformation for each segment using trained parameters.
  • Experiments on three popular benchmarks demonstrates the effectiveness of the method
表格
  • Table1: Comparison with state-of-the-art methods on Cityscapes test set [<a class="ref-link" id="c3" href="#r3">3</a>]. Best results are in bold font, second best are underlined
  • Table2: Comparison with state-of-the-art methods on DIW dataset [<a class="ref-link" id="c2" href="#r2">2</a>]. The best result is in bold font
  • Table3: Comparison with state-of-the-art methods on NYU-Depth V2 dataset [<a class="ref-link" id="c32" href="#r32">32</a>]
  • Table4: Ablation study on Cityscapes dataset [<a class="ref-link" id="c3" href="#r3">3</a>]. Components tested are category (Cat.) and instance (Ins.) depth estimation, and disentangling canonical depth and scale inference (DEnt). The best results are in bold font
Download tables as Excel
相关工作
  • Single Image Depth Prediction There has been a long history of methods that have attempted to predict depth from a single image [11, 31, 24, 26]. Recently monocular depth estimation has gained popularity due to the ability of CNNs to learn strong priors from images corresponding to geometric layout. Among others, Laina et al [14] propose a fully convolutional architecture with up-projection blocks to handle high-dimensional depth regression. In [19], a twostream convolutional network is proposed, which simultaneously predicts depth and depth gradients to preserve more depth details. Besides using deep networks alone, recent works have shown that the combination of deep networks and shallow models [18, 25, 36, 40, 30] can also deliver superior depth estimation performance. Meanwhile, different forms of supervision and learning techniques have also been explored in recent works to improve the generalization ability of depth estimation models, including self-supervised learning with photometric losses from stereo images [6, 8] or multiple views [43, 34, 7], transfer learning using synthetic images [42, 42, 1], and those using sparse [2, 37] or dense [21, 35, 33, 20] relative depth as supervisions.
基金
  • This work is supported by National Key R&D Program of China (2018AAA0102001), National Natural Science Foundation of China (61725202, U1903215, 61829102, 91538201, 61751212, 61906031), Fundamental Research Funds for the Central Universities (DUT19GJ201), China Postdoctoral Science Foundation (2019M661095), National Postdoctoral Program for Innovative Talent (BX20190055), and Adobe Research
引用论文
  • Amir Atapour-Abarghouei and Toby P Breckon. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2800–2810, 2018. 2
    Google ScholarLocate open access versionFindings
  • Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Singleimage depth perception in the wild. In Advances in neural information processing systems, pages 730–738, 2016. 1, 2, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2011, 5
    Google ScholarLocate open access versionFindings
  • Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018. 5, 7
    Google ScholarLocate open access versionFindings
  • Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–75Springer, 2016. 2
    Google ScholarLocate open access versionFindings
  • Clement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3828–3838, 2019. 2
    Google ScholarLocate open access versionFindings
  • Clement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with leftright consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 270–279, 2017. 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 3, 4, 5
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3
    Google ScholarLocate open access versionFindings
  • Derek Hoiem, Alexei A Efros, and Martial Hebert. Geometric context from a single image. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 1, pages 654–661. IEEE, 2005. 2
    Google ScholarLocate open access versionFindings
  • Jianbo Jiao, Ying Cao, Yibing Song, and Rynson Lau. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the European Conference on Computer Vision (ECCV), pages 53–69, 2018. 2
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5
    Findings
  • Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pages 239– 248. IEEE, 2016. 1, 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Jae-Han Lee, Minhyeok Heo, Kyung-Rae Kim, and ChangSu Kim. Single-image depth estimation based on fourier domain analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 330–339, 2018. 7
    Google ScholarLocate open access versionFindings
  • Jae-Han Lee and Chang-Su Kim. Monocular depth estimation using relative depth maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2019. 7
    Google ScholarLocate open access versionFindings
  • Charles Eric Leiserson, Ronald L Rivest, Thomas H Cormen, and Clifford Stein. Introduction to algorithms, volume 6. MIT press Cambridge, MA, 2001. 2
    Google ScholarFindings
  • Bo Li, Chunhua Shen, Yuchao Dai, Anton Van Den Hengel, and Mingyi He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1119–1127, 2015. 2
    Google ScholarLocate open access versionFindings
  • Jun Li, Reinhard Klein, and Angela Yao. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, pages 3372–3380, 2017. 2
    Google ScholarLocate open access versionFindings
  • Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. Learning the depths of moving people by watching frozen people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4521–4530, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zhengqi Li and Noah Snavely. Megadepth: Learning singleview depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018. 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017. 3
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2, 5
    Google ScholarLocate open access versionFindings
  • Beyang Liu, Stephen Gould, and Daphne Koller. Single image depth estimation from predicted semantic labels. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1253–1260. IEEE, 2010. 2
    Google ScholarLocate open access versionFindings
  • Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5162–5170, 2015. 2
    Google ScholarLocate open access versionFindings
  • Miaomiao Liu, Mathieu Salzmann, and Xuming He. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 716–723, 2014. 2
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. 5
    Google ScholarLocate open access versionFindings
  • Yue Meng, Yongxi Lu, Aman Raj, Samuel Sunarjo, Rui Guo, Tara Javidi, Gaurav Bansal, and Dinesh Bharadia. Signet: Semantic instance aided unsupervised 3d geometry perception. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9810–9820, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 283–291, 2018. 7
    Google ScholarLocate open access versionFindings
  • Anirban Roy and Sinisa Todorovic. Monocular depth estimation using neural regression forest. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5506–5514, 2016. 2
    Google ScholarLocate open access versionFindings
  • Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2008. 2
    Google ScholarLocate open access versionFindings
  • Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012. 2, 5, 7
    Google ScholarLocate open access versionFindings
  • Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes, 2019. 2
    Google ScholarFindings
  • Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2022– 2030, 2018. 2
    Google ScholarLocate open access versionFindings
  • Lijun Wang, Xiaohui Shen, Jianming Zhang, Oliver Wang, Zhe L. Lin, Chih-Yao Hsieh, Sarah Kong, and Huchuan Lu. Deeplens: shallow depth of field from a single image. ACM Trans. Graph., 37(6):245:1–245:11, 2018. 2, 8
    Google ScholarLocate open access versionFindings
  • Peng Wang, Xiaohui Shen, Zhe Lin, Scott Cohen, Brian Price, and Alan L Yuille. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2800–2809, 2015. 2
    Google ScholarLocate open access versionFindings
  • Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth perception with web stereo data supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 311–320, 2018. 2, 6
    Google ScholarLocate open access versionFindings
  • Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8818–8826, 2019. 3
    Google ScholarLocate open access versionFindings
  • Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 675–684, 2018. 1, 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and Elisa Ricci. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3917–3925, 2018. 2, 7
    Google ScholarLocate open access versionFindings
  • Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 235–251, 2018. 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. T2net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 767–783, 2018. 2
    Google ScholarLocate open access versionFindings
  • Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851–1858, 2017. 2
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
小科