AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose a new convolutional network architecture for self-supervised monocular depth estimation: PackNet

3D Packing for Self-Supervised Monocular Depth Estimation

CVPR, pp.2482-2491, (2020)

Cited by: 0|Views92
EI

Abstract

Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception. In this work, we propose a novel self-supervised monocular depth estimation method combining geometry with a new deep network, PackNet, learned only from unlabeled monocular videos. Our architecture leverages novel symm...More

Code:

Data:

0
Introduction
  • Accurate depth estimation is a key prerequisite in many robotics tasks, including perception, navigation, and planning.
  • Going beyond image classification models like ResNet [21], the main contribution is a new convolutional network architecture, called PackNet, for high-resolution self-supervised monocular depth estimation.
  • The authors' third contribution is a new dataset: Dense Depth for Automated Driving (DDAD)
  • It leverages diverse logs from a fleet of well-calibrated self-driving cars equipped with cameras and high-accuracy long-range LiDARs. Compared to existing benchmarks, DDAD enables much more accurate depth evaluation at range, which is key for high resolution monocular depth estimation methods
Highlights
  • Accurate depth estimation is a key prerequisite in many robotics tasks, including perception, navigation, and planning
  • We propose new packing and unpacking blocks that jointly leverage 3D convolutions to learn representations that maximally propagate dense appearance and geometric information while still being able to run in real time
  • Dense Depth for Automated Driving enables much more accurate depth evaluation at range, which is key for high resolution monocular depth estimation methods (cf
  • We evaluate on the recent NuScenes dataset [5] models trained on a combination of CityScapes and KITTI (CS+K), without any fine-tuning
  • We propose a new convolutional network architecture for self-supervised monocular depth estimation: PackNet
  • Purely trained on unlabeled monocular videos, our approach outperforms other existing selfand semi-supervised methods and is even competitive with fully-supervised methods while able to run in real-time
Methods
  • Abs Rel Sq Rel RMSE RMSElog δ1.25

    Monodepth2 (R18) Monodepth2‡ (R18)

    Monodepth2 (R50) Monodepth2‡ (R50) PackNet-SfM

    5.5.
  • To ensure that the gain in performance shown in the experiments is due to an increase in model capacity, the authors compare different variations of the PackNet architecture against available ResNet architectures.
  • These results are depicted in Fig. 8 and show that, while the ResNet family stabilizes with diminishing returns as the number of parameters increase, the PackNet family matches its performance at around 70M parameters and further improves as more complexity is added.
  • ResNet50 ResNet50‡ PackNet vehicles and countries (Germany for CS+K, USA + Singapore for NuScenes), outperforming standard architectures in all considered metrics without the need for large-scale supervised pretraining on ImageNet
Conclusion
  • The authors propose a new convolutional network architecture for self-supervised monocular depth estimation: PackNet.
  • Purely trained on unlabeled monocular videos, the approach outperforms other existing selfand semi-supervised methods and is even competitive with fully-supervised methods while able to run in real-time.
  • It generalizes better to different datasets and unseen environments without the need for ImageNet pretraining, especially when considering longer depth ranges, as assessed up to 200m on the new DDAD dataset.
  • By leveraging during training only weak velocity information, the authors are able to make the model scale-aware, i.e. producing metrically accurate depth maps from a single image
Summary
  • Introduction:

    Accurate depth estimation is a key prerequisite in many robotics tasks, including perception, navigation, and planning.
  • Going beyond image classification models like ResNet [21], the main contribution is a new convolutional network architecture, called PackNet, for high-resolution self-supervised monocular depth estimation.
  • The authors' third contribution is a new dataset: Dense Depth for Automated Driving (DDAD)
  • It leverages diverse logs from a fleet of well-calibrated self-driving cars equipped with cameras and high-accuracy long-range LiDARs. Compared to existing benchmarks, DDAD enables much more accurate depth evaluation at range, which is key for high resolution monocular depth estimation methods
  • Methods:

    Abs Rel Sq Rel RMSE RMSElog δ1.25

    Monodepth2 (R18) Monodepth2‡ (R18)

    Monodepth2 (R50) Monodepth2‡ (R50) PackNet-SfM

    5.5.
  • To ensure that the gain in performance shown in the experiments is due to an increase in model capacity, the authors compare different variations of the PackNet architecture against available ResNet architectures.
  • These results are depicted in Fig. 8 and show that, while the ResNet family stabilizes with diminishing returns as the number of parameters increase, the PackNet family matches its performance at around 70M parameters and further improves as more complexity is added.
  • ResNet50 ResNet50‡ PackNet vehicles and countries (Germany for CS+K, USA + Singapore for NuScenes), outperforming standard architectures in all considered metrics without the need for large-scale supervised pretraining on ImageNet
  • Conclusion:

    The authors propose a new convolutional network architecture for self-supervised monocular depth estimation: PackNet.
  • Purely trained on unlabeled monocular videos, the approach outperforms other existing selfand semi-supervised methods and is even competitive with fully-supervised methods while able to run in real-time.
  • It generalizes better to different datasets and unseen environments without the need for ImageNet pretraining, especially when considering longer depth ranges, as assessed up to 200m on the new DDAD dataset.
  • By leveraging during training only weak velocity information, the authors are able to make the model scale-aware, i.e. producing metrically accurate depth maps from a single image
Tables
  • Table1: Summary of our PackNet architecture for selfsupervised monocular depth estimation. The Packing and Unpacking blocks are described in Fig. 3, with kernel size K = 3 and D = 8. Conv2d blocks include GroupNorm [<a class="ref-link" id="c46" href="#r46">46</a>] with G = 16 and ELU non-linearities [<a class="ref-link" id="c8" href="#r8">8</a>]. InvDepth blocks include a 2D convolutional layer with K = 3 and sigmoid non-linearities. Each ResidualBlock is a sequence of 3 2D convolutional layers with K = 3/3/1 and ELU non-linearities, followed by GroupNorm with G = 16 and Dropout [<a class="ref-link" id="c41" href="#r41">41</a>] of 0.5 in the final layer. Upsample is a nearest-neighbor resizing operation. Numbers in parentheses indicate input layers, with ⊕ as channel concatenation. Bold numbers indicate the four inverse depth output scales
  • Table2: Depth Evaluation on DDAD, for 640 x 384 resolution and distances up to 200m. While the ResNet family heavily relies on large-scale supervised ImageNet [<a class="ref-link" id="c10" href="#r10">10</a>] pretraining (denoted by ‡), PackNet achieves significantly better results despite being trained from scratch
  • Table3: Quantitative performance comparison of PackNet-SfM on the KITTI dataset for distances up to 80m. For Abs Rel, Sq Rel, RMSE and RMSElog lower is better, and for δ < 1.25, δ < 1.252 and δ < 1.253 higher is better. In the Dataset column, CS+K refers to pretraining on CityScapes (CS) and fine-tuning on KITTI (K). M refers to methods that train using monocular (M) images, and M+v refers to added velocity weak supervision (v), as shown in Section 3.2. ‡ indicates
  • Table4: Ablation study on the PackNet architecture, on the standand KITTI benchmark for 640 x 192 resolution. ResNetXX indicates that specific architecture [<a class="ref-link" id="c21" href="#r21">21</a>] as encoder, with and without ImageNet [<a class="ref-link" id="c10" href="#r10">10</a>] pretraining (denoted with ‡). We also show results with the proposed PackNet architecture, first without packing and unpacking (replaced respectively with convolutional striding and bilinear upsampling) and then with increasing numbers of 3D convolutional filters (D = 0 indicates no 3D convolutions and the corresponding reshape operations)
  • Table5: Generalization capability of different depth networks, trained on both KITTI and CityScapes and evaluated on NuScenes [<a class="ref-link" id="c5" href="#r5">5</a>], for 640 x 192 resolution and distances up to 80m. ‡ denotes ImageNet [<a class="ref-link" id="c10" href="#r10">10</a>] pretraining
  • Table6: Average Absolute Trajectory Error (ATE) in meters on the KITTI Odometry Benchmark [<a class="ref-link" id="c17" href="#r17">17</a>]: All methods are trained on Sequences 00-08 and evaluated on Sequences 09-10. The ATE numbers are averaged over all overlapping 5-frame snippets in the test sequences. M+v refers to velocity supervision (v) in addition to monocular images (M). The GT checkmark indicates the use of ground-truth translation to scale the estimates at test-time
Download tables as Excel
Related work
  • Depth estimation from a single image poses several challenges due to its ill-posed and ambiguous nature. However, modern convolutional networks have shown that it is possible to successfully leverage appearance-based patterns in large scale datasets in order to make accurate predictions.

    Depth Network Architectures Eigen et al [14] proposed one of the earliest works in convolutional-based depth estimation using a multi-scale deep network trained on RGBD sensor data to regress the depth directly from single images. Subsequent works extended these network architectures to perform two-view stereo disparity estimation [36] using techniques developed in the flow estimation literature [13]. Following [13, 36], Umenhofer et al [43] applied these concepts to simultaneously train a depth and pose network to predict depth and camera ego-motion between successive unconstrained image pairs.

    Independently, dense pixel-prediction networks [3, 32, 49] have made significant progress towards improving the flow of information between encoding and decoding layers. Fractional pooling [20] was introduced to amortize the rapid spatial reduction during downsampling. Lee et al [30] generalized the pooling function to allow the learning of more complex patterns, including linear combinations and learnable pooling operations. Shi et al [40] used sub-pixel convolutions to perform Single-Image-Super-Resolution, synthesizing and super-resolving images beyond their input resolutions, while still operating at lower resolutions. Recent works [39, 52] in self-supervised monocular depth estimation use this concept to super-resolve estimates and further improve performance. Here, we go one step further and introduce new operations relying on 3D convolutions for learning to preserve and process spatial information in the features of encoding and decoding layers.
Funding
  • Proposes a novel self-supervised monocular depth estimation method combining geometry with a new deep network, PackNet, learned only from unlabeled monocular videos
  • Addresses the problem of jointly estimating scene structure and camera motion across RGB image sequences using a self-supervised deep network
  • Proposes new packing and unpacking blocks that jointly leverage 3D convolutions to learn representations that maximally propagate dense appearance and geometric information while still being able to run in real time
  • Shows that, by using the instantaneous velocity of the camera during training, are able to learn a scale-aware depth and pose model, alleviating the impractical need to use LiDAR ground-truth depth measurements at test-time
  • Introduces PackNet as a novel depth network, and optionally include weak velocity supervision at training time to produce scale-aware depth and pose models
Reference
  • TensorRT python library. https://developer.nvidia.com/tensorrt. Accessed:2019-11-09.8
    Findings
  • Rares Ambrus, Vitor Guizilini, Jie Li, Sudeep Pillai, and Adrien Gaidon. Two stream networks for self-supervised ego-motion estimation. In Proceedings of the Conference on Robot Learning (CoRL), 2019. 9
    Google ScholarLocate open access versionFindings
  • Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, and Deva Ramanan. Pixelnet: Representation of the pixels, by the pixels, and for the pixels. arXiv preprint arXiv:1702.06506, 2017. 2
    Findings
  • Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jorn-Henrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018. 4
    Findings
  • Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. CoRR, 2019. 2, 5, 8
    Google ScholarLocate open access versionFindings
  • Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In AAAI, 2019. 1, 2, 3, 6, 7, 9
    Google ScholarLocate open access versionFindings
  • Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1256–1272, 2014
    Google ScholarLocate open access versionFindings
  • Djork-Arne Clevert, Thomas Unterthiner, and Sepp Hochre-Figure 10: DDAD sample from Tokyo, Japan.
    Google ScholarFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 5, 6
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009. 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In ICLR, 2017. 4
    Google ScholarLocate open access versionFindings
  • Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):295– 307, Feb. 2016. 4
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015. 2
    Google ScholarLocate open access versionFindings
  • David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 202, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian
    Google ScholarFindings
  • Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
    Google ScholarFindings
  • Clement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with leftright consistency. In CVPR, volume 2, page 7, 2017. 2,
    Google ScholarLocate open access versionFindings
  • Clement Godard, Oisin Mac Aodha, Michael Firman, and
    Google ScholarFindings
  • Benjamin Graham.
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
    Google ScholarFindings
  • Jrn-Henrik Jacobsen, Arnold W.M. Smeulders, and Edouard
    Google ScholarLocate open access versionFindings
  • Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015. 2
    Google ScholarLocate open access versionFindings
  • Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5
    Findings
  • Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural
    Google ScholarLocate open access versionFindings
  • Maria Klodt and Andrea Vedaldi. Supervising the new with the old: Learning sfm from sfm. In European Conference on
    Google ScholarLocate open access versionFindings
  • Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019. 1
    Findings
  • Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semisupervised deep learning for monocular depth map prediction. In IEEE Conference on Computer Vision and Pattern
    Google ScholarLocate open access versionFindings
  • Chen-Yu Lee, Patrick Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016. 2
    Google ScholarLocate open access versionFindings
  • Kuan-Hui Lee, German Ros, Jie Li, and Adrien Gaidon. Spigan: Privileged adversarial learning from simulation. In ICLR, 2019. 1
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. 2
    Google ScholarLocate open access versionFindings
  • C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille. Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. arXiv preprint arXiv:1810.06125, 2018. 7
    Findings
  • Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5667–5675, 2018. 1, 2, 3, 6, 7, 9
    Google ScholarLocate open access versionFindings
  • Fabian Manhardt, Wadim Kehl, and Adrien Gaidon. Roi10d: Monocular lifting of 2d detection to 6d pose and metric shape. IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1
    Google ScholarLocate open access versionFindings
  • Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016. 2, 5
    Google ScholarLocate open access versionFindings
  • Jeff Michels, Ashutosh Saxena, and Andrew Y Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In 22nd international conference on Machine learning, pages 593–600. ACM, 2005. 1
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. 5
    Google ScholarLocate open access versionFindings
  • Sudeep Pillai, Rares Ambrus, and Adrien Gaidon. Superdepth: Self-supervised, super-resolved monocular depth estimation. In Robotics and Automation (ICRA), 2019 IEEE International Conference on, 2018. 2, 6
    Google ScholarLocate open access versionFindings
  • Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. 4
    Google ScholarLocate open access versionFindings
  • J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger. Sparsity invariant cnns. 3DV, 2017. 5, 6, 7
    Google ScholarFindings
  • Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR), volume 5, page 6, 2017. 2
    Google ScholarLocate open access versionFindings
  • Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2022–2030, 2018. 2, 7
    Google ScholarLocate open access versionFindings
  • Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 2, 3
    Google ScholarLocate open access versionFindings
  • Yuxin Wu and Kaiming He. Group normalization. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pages 3–19, 2018. 4
    Google ScholarLocate open access versionFindings
  • Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. arXiv preprint arXiv:1807.02570, 2018. 2
    Findings
  • Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2018. 1, 2, 7, 9
    Google ScholarLocate open access versionFindings
  • Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In The IEEE Conference on Computer
    Google ScholarLocate open access versionFindings
  • Hao Zhang and Jianwei Ma. Hartley spectral pooling for deep learning. Computing Research Repository, abs/1810.04028, 2018. 4
    Google ScholarLocate open access versionFindings
  • Junsheng Zhou, Yuwang Wang, Naiyan Wang, and Wenjun Zeng. Unsupervised high-resolution depth learning from videos with dual networks. In Inter. Conf. on Computer Vision. IEEE, IEEE, 2019. 7
    Google ScholarLocate open access versionFindings
  • Lipu Zhou, Jiamin Ye, Montiel Abello, Shengze Wang, and Michael Kaess. Unsupervised learning of monocular depth estimation with bundle adjustment, super-resolution and clip loss. arXiv preprint arXiv:1812.03368, 2018. 2
    Findings
  • Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, volume 2, page 7, 2017. 2, 3, 5, 6, 7, 9
    Google ScholarLocate open access versionFindings
  • Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In ECCV, 2018. 1, 2, 7, 9
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科