AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We proposed the first end-to-end hierarchical Neural Architecture Search framework for deep stereo matching, which incorporates task-specific human knowledge into the architecture search framework

Hierarchical Neural Architecture Search for Deep Stereo Matching

NIPS 2020, (2020)

Cited by: 0|Views79
EI
Full Text
Bibtex
Weibo

Abstract

To reduce the human efforts in neural network design, Neural Architecture Search (NAS) has been applied with remarkable success to various high-level vision tasks such as classification and semantic segmentation. The underlying idea for the NAS algorithm is straightforward, namely, to enable the network the ability to choose among a set...More

Code:

Data:

0
Introduction
  • Stereo matching attempts to find dense correspondences between a pair of rectified stereo images and estimate a dense disparity map.
  • Since MC-CNN [2], a large number of deep neural network architectures [3, 4, 5, 6] have been proposed for solving the stereo matching problem.
  • Direct regression methods are based on direct regression of dense per-pixel disparity from the input images, without taking into account the geometric constraints in stereo matching [7].
  • While enjoying a fully data-driven approach, recent studies raise some concerns about the generalization ability of the direct regression methods.
  • The DispNet [3] fails the random dot stereo tests [8]
Highlights
  • Stereo matching attempts to find dense correspondences between a pair of rectified stereo images and estimate a dense disparity map
  • Different from previous Neural Architecture Search (NAS) algorithms that only have a single encoder / encoder-decoder architecture [18, 25, 20], our algorithm enables us to search over the structure of both networks, the size of the feature maps, the size of the feature volume and the size of the output disparity
  • We proposed the first end-to-end hierarchical NAS framework for deep stereo matching, which incorporates task-specific human knowledge into the architecture search framework
  • Our searched network outperforms all state-of-the-art deep stereo matching architectures and is ranked at the top 1 accuracy on KITTI stereo 2012, 2015 and Middlebury benchmarks while showing substantial improvement on the network size and inference speed
  • From traditional methods to deep learning based methods, people keep setting a new state-of-the-art through these years
  • Deep learning based methods become more popular than traditional methods since deep methods are more accurate and faster
Methods
  • GCNet [4] iResNet[5] PSMNet [11] GANet-deep [6] AANet [26] AutoDispNet [25] LEAStereo Params [M] EPE [px] 0.78 bad 1.0 [%] Runtime [s].
  • KITTI benchmarks As shown in Table 2 and the leader board, the LEAStereo achieves top 1 rank among other human designed architectures on KITTI 2012 and KITTI 2015 benchmarks.
  • AutoDispNet-CSS [25] GANet-deep [6]
Results
  • Extensive experiments show that the searched network outperforms all state-of-the-art deep stereo matching architectures and is ranked at the top 1 accuracy on KITTI stereo 2012, 2015 and Middlebury benchmarks, as well as the top 1 on SceneFlow dataset with a substantial improvement on the size of the network and the speed of inference.
  • The authors' searched network outperforms all state-of-the-art deep stereo matching architectures and is ranked at the top 1 accuracy on KITTI stereo 2012, 2015 and Middlebury benchmarks while showing substantial improvement on the network size and inference speed
Conclusion
  • LEAStereo vs. AutoDispNet AutoDispNet [25] has a very different network design philosophy than ours.
  • The authors' design benefits from task-specific physics and inductive bias, ie.,the gold standard pipeline for deep stereo matching and the refine search space, achieves full architecture search within current physical constraints.
  • Larger feature volumesIn this paper, the authors proposed the first end-to-end hierarchical NAS framework for deep stereo matching, which incorporates task-specific human knowledge into the architecture search framework.
  • Rather than designing a handcrafted architecture with trial and error, the authors propose to allow the network to learn a good architecture by itself in an end-to-end manner.
  • The authors' method reduces more than 2/3 of searching time than previous method [25] and has much better performance, saves lots of energy consumption and good for the planet by reducing massive carbon footprints
Summary
  • Introduction:

    Stereo matching attempts to find dense correspondences between a pair of rectified stereo images and estimate a dense disparity map.
  • Since MC-CNN [2], a large number of deep neural network architectures [3, 4, 5, 6] have been proposed for solving the stereo matching problem.
  • Direct regression methods are based on direct regression of dense per-pixel disparity from the input images, without taking into account the geometric constraints in stereo matching [7].
  • While enjoying a fully data-driven approach, recent studies raise some concerns about the generalization ability of the direct regression methods.
  • The DispNet [3] fails the random dot stereo tests [8]
  • Objectives:

    Drawing inspirations from [18], the authors aim to find an optimal path within a pre-defined L-layer trellis as shown in Figure 3.
  • Methods:

    GCNet [4] iResNet[5] PSMNet [11] GANet-deep [6] AANet [26] AutoDispNet [25] LEAStereo Params [M] EPE [px] 0.78 bad 1.0 [%] Runtime [s].
  • KITTI benchmarks As shown in Table 2 and the leader board, the LEAStereo achieves top 1 rank among other human designed architectures on KITTI 2012 and KITTI 2015 benchmarks.
  • AutoDispNet-CSS [25] GANet-deep [6]
  • Results:

    Extensive experiments show that the searched network outperforms all state-of-the-art deep stereo matching architectures and is ranked at the top 1 accuracy on KITTI stereo 2012, 2015 and Middlebury benchmarks, as well as the top 1 on SceneFlow dataset with a substantial improvement on the size of the network and the speed of inference.
  • The authors' searched network outperforms all state-of-the-art deep stereo matching architectures and is ranked at the top 1 accuracy on KITTI stereo 2012, 2015 and Middlebury benchmarks while showing substantial improvement on the network size and inference speed
  • Conclusion:

    LEAStereo vs. AutoDispNet AutoDispNet [25] has a very different network design philosophy than ours.
  • The authors' design benefits from task-specific physics and inductive bias, ie.,the gold standard pipeline for deep stereo matching and the refine search space, achieves full architecture search within current physical constraints.
  • Larger feature volumesIn this paper, the authors proposed the first end-to-end hierarchical NAS framework for deep stereo matching, which incorporates task-specific human knowledge into the architecture search framework.
  • Rather than designing a handcrafted architecture with trial and error, the authors propose to allow the network to learn a good architecture by itself in an end-to-end manner.
  • The authors' method reduces more than 2/3 of searching time than previous method [25] and has much better performance, saves lots of energy consumption and good for the planet by reducing massive carbon footprints
Tables
  • Table1: Quantitative results on Scene Flow dataset. Our method achieves state-of-the-art performance with only a fraction of parameters. The parentheses indicate the test set is used for hyperparameters tuning
  • Table2: Quantitative results on the KITTI 2012 and 2015 benchmark. Bold indicates the best
  • Table3: Quantitative results on the Middlebury 2014 Benchmark. Bold indicates the best. The red number on the top right of each number indicates the actual ranking on the benchmark
  • Table4: Ablation Studies of different searching strategies. The input resolution is 576 × 960, and EPE is measured on total SceneFlow test set
  • Table5: LEAStereo vs. AutoDispNet
Download tables as Excel
Related work
  • Deep Stereo Matching MC-CNN [2] is the first deep learning based stereo matching method. It replaces handcrafted features with learned features and achieves better performance. DispNet [3] is the first end-to-end deep stereo matching approach. It tries to directly regress the disparity maps from stereo pairs. The overall architecture is a large U-shape encoder-decoder network with skip connections. Since it does not leverage on pre-acquired human knowledge in stereo matching, this network is totally data-driven, and requires large training data and often hard to train. GC-Net [4] used a 4D feature volume to mimic the first step of conventional stereo matching pipeline and a soft-argmin process to mimic the second step. By encoding such human knowledge in network design, training becomes easier while maintaining high accuracies. Similar to our work, GC-Net also consists of two sub-networks to predict disparities. GA-Net [33] proposes a semi-global aggregation layer and a local guided aggregation layer to capture the local and the whole-image cost dependencies respectively. Generally speaking and as alluded to earlier, designing a good structure for stereo matching is very difficult, despite considerable effort put in by the vision community.
Funding
  • Acknowledgments and Disclosure of Funding Yuchao Dai’s research was supported in part by Natural Science Foundation of China (61871325, 61671387) and National Key Research and Development Program of China under Grant 2018AAA0102803
  • Hongdong Li’s research was supported in part by the ARC Centre of Excellence for Robotics Vision (CE140100016) and ARC-Discovery (DP 190102261), ARC-LIEF (190100080) grants
  • Zongyuan Ge and Xuelian Cheng were supported by Airdoc Research Australia Centre Funding
  • Zongyuan Ge was also supported by Monash-NVIDIA joint Research Centre
Study subjects and analysis
image pairs: 20000
We use the “finalpass” version as it is more realistic. We randomly select 20,000 image pairs from the training set as our searchtraining-set, and another 1,000 image pairs from the training set are used as the search-validation-set following [18]. Implementation: We implement our LEAStereo network in Pytorch

Reference
  • D. Marr and T. Poggio. Cooperative computation of stereo disparity. Science, 1976.
    Google ScholarLocate open access versionFindings
  • Jure Žbontar and Yann LeCun. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res., 17(1):2287–2318, January 2016.
    Google ScholarLocate open access versionFindings
  • N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression. In Proc. IEEE Int. Conf. Comp. Vis. (ICCV), pages 66–75, 2017.
    Google ScholarLocate open access versionFindings
  • Zhengfa Liang, Yiliu Feng, Yulan Guo, Hengzhu Liu, Wei Chen, Linbo Qiao, Li Zhou, and Jianfeng Zhang. Learning for disparity estimation through feature constancy. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end-to-end stereo matching. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 185–194, 2019.
    Google ScholarLocate open access versionFindings
  • R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
    Google ScholarFindings
  • Yiran Zhong, Hongdong Li, and Yuchao Dai. Open-world stereo video matching with deep RNN. In Proc. Eur. Conf. Comp. Vis. (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell., 30(2):328–341, 2007.
    Google ScholarLocate open access versionFindings
  • Yiran Zhong, Yuchao Dai, and Hongdong Li. Self-supervised learning for stereo matching with selfimproving ability. In arXiv:1709.00930, 2017.
    Findings
  • Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 5410–5418, 2018.
    Google ScholarLocate open access versionFindings
  • Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, and Shahram Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 573–590, 2018.
    Google ScholarLocate open access versionFindings
  • Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In Proc. Int. Conf. Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 10734–10742, 2019.
    Google ScholarLocate open access versionFindings
  • Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. Densely connected search space for more flexible neural architecture search. arXiv preprint arXiv:1906.09607, 2019.
    Findings
  • Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. arXiv preprint arXiv:1911.09070, 2019.
    Findings
  • Junran Peng, Ming Sun, ZHAO-XIANG ZHANG, Tieniu Tan, and Junjie Yan. Efficient neural architecture transformation search in channel-level for object detection. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pages 14290–14299, 2019.
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 82–92, 2019.
    Google ScholarLocate open access versionFindings
  • Yiheng Zhang, Zhaofan Qiu, Jingen Liu, Ting Yao, Dong Liu, and Tao Mei. Customizable architecture search for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 11641– 11650, 2019.
    Google ScholarLocate open access versionFindings
  • Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, and Zhangyang Wang. Fasterseg: Searching for faster real-time semantic segmentation. International Conference on Learning Representation, 2020.
    Google ScholarLocate open access versionFindings
  • Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
    Findings
  • Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 8697–8710, 2018.
    Google ScholarLocate open access versionFindings
  • Krizhevsky Alex. Learning multiple layers of features from tiny images. In Tech Report, 2009.
    Google ScholarFindings
  • Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. GA-Net: Guided aggregation net for end-to-end stereo matching. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 185–194, 2019.
    Google ScholarLocate open access versionFindings
  • Tonmoy Saikia, Yassine Marrakchi, Arber Zela, Frank Hutter, and Thomas Brox. Autodispnet: Improving disparity estimation with automl. In Proc. IEEE Int. Conf. Comp. Vis. (ICCV), pages 1812–1823, 2019.
    Google ScholarLocate open access versionFindings
  • Haofei Xu and Juyong Zhang. Aanet: Adaptive aggregation network for efficient stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
    Google ScholarLocate open access versionFindings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Autonomous Driving? The KITTI vision benchmark suite. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), 2012.
    Google ScholarLocate open access versionFindings
  • Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), 2015.
    Google ScholarLocate open access versionFindings
  • Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Nešic, Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition, pages 31–42.
    Google ScholarLocate open access versionFindings
  • Duggal Shivam, Wang Shenlong, Ma1 Wei-Chiu, Hu Rui, and Urtasun Raquel. DeepPruner: Learning efficient stereo matching via differentiable PatchMatch. In Proc. IEEE Int. Conf. Comp. Vis. (ICCV), Nov 2019.
    Google ScholarLocate open access versionFindings
  • Gengshan Yang, Joshua Manela, Michael Happold, and Deva Ramanan. Hierarchical deep stereo matching on high-resolution images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), June 2019.
    Google ScholarLocate open access versionFindings
  • Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pages 4053–4061. Curran Associates, Inc., 2016.
    Google ScholarLocate open access versionFindings
  • Damien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Trémeau, and Christian Wolf. Residual conv-deconv grid network for semantic segmentation. In Proc. Brit. Mach. Vis. Conf. (BMVC), 2017.
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multi-scale architectures for dense image prediction. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pages 8699–8710, 2018.
    Google ScholarLocate open access versionFindings
  • Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid. Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (CVPR), pages 9126–9135, 2019.
    Google ScholarLocate open access versionFindings
  • Jianyuan Wang, Yiran Zhong, Yuchao Dai, Kaihao Zhang, Pan Ji, and Hongdong Li. Displacementinvariant matching cost learning for accurate optical flow estimation. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2020.
    Google ScholarLocate open access versionFindings
  • Y. Dai, Z. Zhu, Z. Rao, and B. Li. MVS2: Deep unsupervised multi-view stereo with multi-view symmetry. In International Conference on 3D Vision, pages 1–8, 2019.
    Google ScholarLocate open access versionFindings
Author
Your rating :
0

 

Tags
Comments
小科