Semantically-Guided Representation Learning for Self-Supervised Monocular Depth

ICLR, 2020.

Cited by: 1|Bibtex|Views41
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
This paper introduces a novel architecture for self-supervised monocular depth estimation that leverages semantic information from a fixed pretrained network to guide the generation of multi-level depth features via pixel-adaptive convolutions

Abstract:

Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more ...More
Introduction
  • Accurate depth estimation is a key problem in computer vision and robotics, as it is instrumental for perception, navigation, and planning.
  • Depth from a single image is an ill-posed inverse problem, monocular depth networks are able to make accurate predictions by learning representations connecting the appearance of scenes and objects with their geometry in Euclidean 3D space.
  • Current depth estimation methods either do not leverage this structure explicitly or rely on strong semantic supervision to jointly optimize geometric consistency and a semantic proxy task in a multi-task objective (Ochs et al, 2019; Chen et al, 2019), departing from the self-supervised paradigm
Highlights
  • Accurate depth estimation is a key problem in computer vision and robotics, as it is instrumental for perception, navigation, and planning
  • Our method improves upon the state of the art in selfsupervised monocular depth estimation on the standard KITTI benchmark (Geiger et al, 2013), both on average over pixels, over classes, and for dynamic categories in particular
  • In Figure 5 we present qualitative results showing the improvements in depth estimation generated by our proposed framework, compared to our baseline
  • This paper introduces a novel architecture for self-supervised monocular depth estimation that leverages semantic information from a fixed pretrained network to guide the generation of multi-level depth features via pixel-adaptive convolutions
  • Our experiments on challenging real-world data shows that our proposed architecture consistently improves the performance of different monodepth architectures, establishing a new state of the art in self-supervised monocular depth estimation
Results
  • 5.1 DATASETS

    We use the standard KITTI benchmark (Geiger et al, 2013) for self-supervised training and evaluation.
  • We adopt the training, validation and test splits used in Eigen et al (2014) with the pre-processing from Zhou et al (2017) to remove static frames, which is more suitable for
Conclusion
  • This paper introduces a novel architecture for self-supervised monocular depth estimation that leverages semantic information from a fixed pretrained network to guide the generation of multi-level depth features via pixel-adaptive convolutions.
  • Our monodepth network learns semantic-aware geometric representations that can disambiguate photometric ambiguities in a self-supervised learning structure-from-motion context.
  • Our experiments on challenging real-world data shows that our proposed architecture consistently improves the performance of different monodepth architectures, establishing a new state of the art in self-supervised monocular depth estimation.
  • Future directions of research include leveraging other sources of guidance, as well as avenues for self-supervised fine-tuning of the semantic network
Summary
  • Introduction:

    Accurate depth estimation is a key problem in computer vision and robotics, as it is instrumental for perception, navigation, and planning.
  • Depth from a single image is an ill-posed inverse problem, monocular depth networks are able to make accurate predictions by learning representations connecting the appearance of scenes and objects with their geometry in Euclidean 3D space.
  • Current depth estimation methods either do not leverage this structure explicitly or rely on strong semantic supervision to jointly optimize geometric consistency and a semantic proxy task in a multi-task objective (Ochs et al, 2019; Chen et al, 2019), departing from the self-supervised paradigm
  • Results:

    5.1 DATASETS

    We use the standard KITTI benchmark (Geiger et al, 2013) for self-supervised training and evaluation.
  • We adopt the training, validation and test splits used in Eigen et al (2014) with the pre-processing from Zhou et al (2017) to remove static frames, which is more suitable for
  • Conclusion:

    This paper introduces a novel architecture for self-supervised monocular depth estimation that leverages semantic information from a fixed pretrained network to guide the generation of multi-level depth features via pixel-adaptive convolutions.
  • Our monodepth network learns semantic-aware geometric representations that can disambiguate photometric ambiguities in a self-supervised learning structure-from-motion context.
  • Our experiments on challenging real-world data shows that our proposed architecture consistently improves the performance of different monodepth architectures, establishing a new state of the art in self-supervised monocular depth estimation.
  • Future directions of research include leveraging other sources of guidance, as well as avenues for self-supervised fine-tuning of the semantic network
Tables
  • Table1: Quantitative performance comparison of our proposed architecture on KITTI for depths up to 80m. M refers to methods that train using monocular images, S refers to methods that train using stereo pairs, D refers to methods that use ground-truth depth supervision, Sem refers to methods that include semantic information, and Inst refers to methods that include semantic and instance information. MR indicates 640 x 192 input images, and HR indicates 1280 x 384 input images. Our proposed architecture is able to further improve the current state of the art in self-supervised monocular depth estimation, and outperforms other methods that exploit semantic information (including ground truth labels) by a substantial margin
  • Table2: Ablative analysis of our semantic guidance (SEM) and two-stage-training (TST) contributions. The last column indicates class-average Abs. Rel. obtained by averaging all class-specific depth errors in Figure 4, while other columns indicate pixel-average metrics
  • Table3: Analysis of the impact of pre-training the semantic segmentation network. On the PreTrain column, I indicates ImageNet (<a class="ref-link" id="cDeng_et+al_2009_a" href="#rDeng_et+al_2009_a">Deng et al, 2009</a>) pretraining and CS indicates CityScapes (<a class="ref-link" id="cCordts_et+al_2016_a" href="#rCordts_et+al_2016_a">Cordts et al, 2016</a>) pretraining, with 1/2 indicating the use of only half the dataset (samples chosen randomly). In the Fine-Tune column, D indicates fine-tuning the depth network and S indicates finetuning the semantic network (note that this is a self-supervised fine-tuning for the depth task, using the objective described in Section 3.1)
  • Table4: Generalization capability of different networks, trained on both KITTI and CityScapes datasets and evaluated on the NuScenes (<a class="ref-link" id="cCaesar_et+al_2019_a" href="#rCaesar_et+al_2019_a">Caesar et al, 2019</a>) dataset. Our proposed semanticallyguided architecture is able to further improve upon the baseline from <a class="ref-link" id="cGuizilini_et+al_2019_a" href="#rGuizilini_et+al_2019_a">Guizilini et al (2019</a>), which only used unlabeled image sequences for self-supervised depth training
Download tables as Excel
Related work
  • Since the seminal work of Eigen et al (2014), substantial progress has been done to improve the accuracy of supervised depth estimation from monocular images, including the use of Conditional Random Fields (CRFs) (Li et al, 2015), joint optimization of surface normals (Qi et al, 2018), fusion of multiple depth maps (Lee et al, 2018), and ordinal classification (Fu et al, 2018). Consequently, as supervised techniques for depth estimation advanced rapidly, the availability of largescale depth labels became a bottleneck, especially for outdoor applications. Garg et al (2016) and Godard et al (2017) provided an alternative self-supervised strategy involving stereo cameras, where Spatial Transformer Networks (Jaderberg et al, 2015) can be used to geometrically warp, in a differentiable way, the right image into a synthesized left image, using the predicted depth from the left image. The photometric consistency loss between the resulting synthesized and original left images can then be minimized in an end-to-end manner using a Structural Similarity term (Wang et al, 2004) and additional depth regularization terms. Following Godard et al (2017) and Ummenhofer et al (2017), Zhou et al (2017) generalized this to the purely monocular setting, where a depth and a pose networks are simultaneously learned from unlabeled monocular videos. Rapid progress in terms of architectures and objective functions (Yin & Shi, 2018; Mahjourian et al, 2018; Casser et al, 2019; Zou et al, 2018; Klodt & Vedaldi, 2018; Wang et al, 2018; Yang et al, 2018) have since then turned monocular depth estimation into one of the most successful applications of self-supervised learning, even outperforming supervised methods (Guizilini et al, 2019). The introduction of semantic information to improve depth estimates has been explored in prior works, and can be broadly divided into two categories. The first one uses semantic (or instance) information to mask out or properly model dynamic portions of the image, which are not accounted for in the photometric loss calculation. Guney & Geiger (2015) leveraged object knowledge in a Markov Random Field (MRF) to resolve stereo ambiguities, while Bai et al (2016) used a conjunction of instance-level segmentation and epipolar constraints to reduce uncertainty in optical flow estimation. Casser et al (2019) used instance-level masks to estimate motion models for different objects in the environment, and account for their external motion in the resulting warped image. The second category attempts to learn both tasks in a single framework, and uses consistency losses to ensure that both are optimized simultaneously and regularize each other, so the information contained in one task can be transferred to improve the other. For instance, Ochs et al (2019) estimated depth with an ordinal classification loss similar to the standard semantic classification loss, and used empirical weighting to combine them into a single loss for optimization. Similarly, Chen et al (2019) used a unified conditional decoder that can generate either semantic or depth estimates, and both outputs are used to generate a series of losses also combined using empirical weighting to generate the final loss to be optimized. Our approach focuses instead on representation learning, exploiting semantic features into the selfsupervised depth network by using a pretrained semantic segmentation network to guide the generation of depth features. This is done using pixel-adaptive convolutions, recently proposed in Su et al (2019) and applied to tasks such as depth upsampling using RGB images for feature guidance. We show that different depth networks can be readily modified to leverage this semantic feature guidance, ranging from widely used ResNets (He et al, 2016) to the current state-of-the-art PackNet (Guizilini et al, 2019), with a consistent gain in performance across these architectures.
Reference
  • Min Bai, Wenjie Luo, Kaustav Kundu, and Raquel Urtasun. Exploiting semantic information and deep matching for optical flow. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. CoRR, 2019. 13, 14
    Google ScholarLocate open access versionFindings
  • Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. AAAI, 2019. 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • Po-Yi Chen, Alexander H. Liu, Yen-Cheng Liu, and Yu-Chiang Frank Wang. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016. 4
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016, 12
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009. 6, 12
    Google ScholarLocate open access versionFindings
  • David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, 2014. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011, 2018. 2
    Google ScholarLocate open access versionFindings
  • Springer, 2016. 2, 6
    Google ScholarFindings
  • Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 1, 5
    Google ScholarLocate open access versionFindings
  • Clement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, pp. 7, 2017. 2, 3, 6
    Google ScholarLocate open access versionFindings
  • Clement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. Digging into selfsupervised monocular depth prediction. arXiv:1806.01260, 2018. 3, 5, 6, 14
    Findings
  • Vitor Guizilini, Sudeep Pillai, Rares Ambrus, and Adrien Gaidon. Packnet-sfm: 3d packing for self-supervised monocular depth estimation. arXiv preprint arXiv:1905.02693, 2019. 1, 2, 3, 6, 8, 14
    Findings
  • Fatma Guney and Andreas Geiger. Displets: Resolving stereo ambiguities using object knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 202
    Google ScholarLocate open access versionFindings
  • Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and Xiaogang Wang. Learning monocular depth by distilling cross-domain stereo networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 484–500, 2018. 1
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. 2
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025, 2015. 2
    Google ScholarLocate open access versionFindings
  • Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482–7491, 2018. 1
    Google ScholarLocate open access versionFindings
  • Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6399–6408, 2019. 5
    Google ScholarLocate open access versionFindings
  • Springer, 2018. 2
    Google ScholarFindings
  • Jae-Han Lee, Minhyeok Heo, Kyung-Rae Kim, and Chang-Su Kim. Single-image depth estimation based on fourier domain analysis. In International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 330–339, 2018. 2
    Google ScholarLocate open access versionFindings
  • Kuan-Hui Lee, German Ros, Jie Li, and Adrien Gaidon. Spigan: Privileged adversarial learning from simulation. In ICLR, 2019. 1
    Google ScholarLocate open access versionFindings
  • Bo Li, Chunhua Shen, Yuchao Dai, Anton Van, and Mingyi He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1119–1127, 2015. 2
    Google ScholarLocate open access versionFindings
  • Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, and Adrien Gaidon. Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192, 2018. 5
    Findings
  • Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE TPAMI, 40(12):2935–2947, 2017. 3
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125, 2017. 5
    Google ScholarLocate open access versionFindings
  • Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and egomotion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675, 2018. 2
    Google ScholarLocate open access versionFindings
  • Fabian Manhardt, Wadim Kehl, and Adrien Gaidon. Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. International Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1
    Google ScholarLocate open access versionFindings
  • Jeff Michels, Ashutosh Saxena, and Andrew Y Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp. 593–600. ACM, 2005. 1
    Google ScholarLocate open access versionFindings
  • Matthias Ochs, Adrian Kretz, and Rudolf Mester. Sdnet: Semantically guided depth estimation network. In arXiv:1907.10659, 2019. 1, 2, 6, 7
    Findings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. 6
    Google ScholarLocate open access versionFindings
  • Sudeep Pillai, Rares Ambrus, and Adrien Gaidon. Superdepth: Self-supervised, super-resolved monocular depth estimation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2019. 1
    Google ScholarLocate open access versionFindings
  • Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, and Peter Kontschieder. Seamless scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8277–8286, 2019. 5
    Google ScholarLocate open access versionFindings
  • Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 283–291, 2018. 2
    Google ScholarLocate open access versionFindings
  • Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik Learned-Miller, and Jan Kautz. Pixeladaptive convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR), volume 5, pp. 6, 2017. 2
    Google ScholarLocate open access versionFindings
  • Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2022–2030, 2018. 2
    Google ScholarLocate open access versionFindings
  • Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 2004. 2, 3
    Google ScholarLocate open access versionFindings
  • Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the 15th European Conference on Computer Vision (ECCV), pp. 3–19, 2018. 4
    Google ScholarLocate open access versionFindings
  • Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8818–8826, 2019. 5
    Google ScholarLocate open access versionFindings
  • Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. arXiv:1807.02570, 2018. 2
    Findings
  • Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. Unsupervised learning of geometry with edge-aware depth-normal consistency. arXiv preprint arXiv:1711.03665, 2017. 1
    Findings
  • Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2018. 2
    Google ScholarLocate open access versionFindings
  • Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349, 2018. 6
    Google ScholarLocate open access versionFindings
  • Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, pp. 7, 2017. 2, 3, 5
    Google ScholarLocate open access versionFindings
  • Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision, 2018. 1, 2, 3, 6
    Google ScholarLocate open access versionFindings
  • In the first two rows an untrained semantic network is utilized, with only its encoder initialized from ImageNet (Deng et al., 2009) weights. Two different scenarios are explored: in the first one (D) only the depth network is fine-tuned in a self-supervised fashion, while in D+S both networks are fine-tuned together in the same way. As expected, using untrained features as guidance leads to significantly worse results, since there is no structure encoded in the secondary network and the primary network needs to learn to filter out all this spurious information. When both networks are fine-tuned simultaneously, results improve because now the added complexity from the secondary network can be leveraged for the task of depth estimation, however there is still no improvement over the baseline.
    Google ScholarLocate open access versionFindings
  • Next, the semantic network was pre-trained on only half of the CityScapes (Cordts et al., 2016) dataset (samples chosen randomly), leading to a worse semantic segmentation performance (validation mIoU of around 70% vs. 75% for the fully trained one). This partial pre-training stage was enough to enable the transfer of useful information between networks, leading to improvements over the baseline. Interestingly, fine-tuning both networks for the task of depth estimation actually hurt performance this time, which we attribute to forgetting the information contained in the secondary network, as both networks are optimized for the depth task. When the semantic network is pretrained with all of CityScapes (last two rows), these effects are magnified, with fine-tuning only the depth network leading to our best reported performance (Table 1) and fine-tuning both networks again leading to results similar to the baseline.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments