AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We present one of the first attempts to extend Neural Architecture Search beyond image classification to dense image prediction problems
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation.
CVPR, (2019): 82-92
Recently, Neural Architecture Search (NAS) has successfully identified neural network architectures that exceed human designed ones on large-scale image classification. In this paper, we study NAS for semantic image segmentation. Existing works often focus on searching the repeatable cell structure, while hand-designing the outer network ...More
PPT (Upload PPT)
- Deep neural networks have been proved successful across a large variety of artificial intelligence tasks, including image recognition [38, 25], speech recognition , machine translation [73, 81] etc.
- While better optimizers  and better normalization techniques [32, 80] certainly played an important role, a lot of the progress comes from the design of neural network architectures.
- This holds true for both image classification [38, 72, 75, 76, 74, 25, 85, 31, 30] and dense image prediction [16, 51, 7, 64, 56, 55].
- Deep neural networks have been proved successful across a large variety of artificial intelligence tasks, including image recognition [38, 25], speech recognition , machine translation [73, 81] etc
- For the outer network level (Sec. 3.2), we propose a novel search space based on observation and summarization of many popular designs
- We report our architecture search implementation details as well as the search results
- We report semantic segmentation results on benchmark datasets with our best found architecture
- We present one of the first attempts to extend Neural Architecture Search beyond image classification to dense image prediction problems
- The result of the search, Auto-DeepLab, is evaluated by training on benchmark semantic segmentation datasets from scratch
- The authors begin by introducing a continuous relaxation of the discrete architectures that exactly matches the hierarchical architecture search described above.
- Continuous Relaxation of Architectures.
- The authors reuse the continuous relaxation described in .
- Every block’s output tensor Hil is connected to all hidden states in Iil: Hil = Oj→i(Hjl ) (1) Hjl ∈Iil. In addition, the authors approximate each Oj→i with its continuous relaxation Oj→i, defined as: Oj→i(Hjl ) = αjk→iOk(Hjl ) (2)
- The authors report the architecture search implementation details as well as the search results.
- The authors report semantic segmentation results on benchmark datasets with the best found architecture.
- Atr + sep.
- 3x3 sep sep 3x3 sep 5x5 Hl-2 Hl-1.
- Hl sep 3x3 atr 5x5 atr 3x3 sep 5x5.
- Architecture Search Implementation Details.
- The authors evaluate the performance of the found best architecture (Fig. 3) on Cityscapes , PASCAL VOC 2012 , and ADE20K  datasets
- The authors present one of the first attempts to extend Neural Architecture Search beyond image classification to dense image prediction problems.
- Instead of fixating on the cell level, the authors acknowledge the importance of spatial resolution changes, and embrace the architectural variations by incorporating the network level into the search space.
- The authors develop a differentiable formulation that allows efficient architecture search over the two-level hierarchical search space.
- On Cityscapes, Auto-DeepLab significantly outperforms the previous state-of-the-art by 8.6%, and performs comparably with ImageNet-pretrained top models when exploiting the coarse annotations.
- On PASCAL VOC 2012 and ADE20K, Auto-DeepLab outperforms several ImageNet-pretrained state-of-the-art models
- Table1: Comparing our work against other CNN architectures with two-level hierarchy. The main differences include: (1) we directly search CNN architecture for semantic segmentation, (2) we search the network level architecture as well as the cell level one, and (3) our efficient search only requires 3 P100 GPU days
- Table2: Cityscapes validation set results with different Auto-DeepLab model variants. F : the filter multiplier controlling the model capacity. All our models are trained from scratch and with single-scale input during inference
- Table3: Cityscapes validation set results. We experiment with the effect of adopting different training iterations (500K, 1M, and 1.5M iterations) and the Scheduled Drop Path method (SDP). All models are trained from scratch
- Table4: Cityscapes test set results with multi-scale inputs during inference. ImageNet: Models pretrained on ImageNet. Coarse: Models exploit coarse annotations
- Table5: PASCAL VOC 2012 validation set results. We experiment with the effect of adopting multi-scale inference (MS) and COCO-pretrained checkpoints (COCO). Without any pretraining, our best model (Auto-DeepLab-L) outperforms DropBlock by 20.36%. All our models are not pretrained with ImageNet images
- Table6: PASCAL VOC 2012 test set results. Our AutoDeepLab-L attains comparable performance with many state-of-the-art models which are pretrained on both ImageNet and COCO datasets. We refer readers to the official leader-board for other state-of-the-art models
- Table7: ADE20K validation set results. We employ multiscale inputs during inference. †: Results are obtained from their up-to-date model zoo websites respectively. ImageNet: Models pretrained on ImageNet. Avg: Average of mIOU and Pixel-Accuracy
- Semantic Image Segmentation Convolutional neural networks  deployed in a fully convolutional manner (FCNs [68, 51]) have achieved remarkable performance on several semantic segmentation benchmarks. Within the state-of-the-art systems, there are two essential components: multi-scale context module and neural network design. It has been known that context information is crucial for pixel labeling tasks [26, 70, 37, 39, 16, 54, 14, 10]. Therefore, PSPNet  performs spatial pyramid pooling [21, 41, 24] at several grid scales (including image-level pooling ), while DeepLab [8, 9] applies several parallel atrous convolution [28, 20, 68, 57, 7] with different rates. On the other hand, the improvement of neural network design has significantly driven the performance from AlexNet , VGG , Inception [32, 76, 74], ResNet  to more recent architectures, such as Wide ResNet , ResNeXt , DenseNet  and Xception [12, 61]. In addition to adopting those networks as backbones for semantic segmentation, one could employ the encoder-decoder structures [64, 2, 55, 44, 60, 58, 33, 79, 18, 11, 87, 83] which efficiently captures the long-range context information while keeping the detailed object boundaries. Nevertheless, most of the models require initialization from the ImageNet  pretrained checkpoints except FRRN  and GridNet  for the task of semantic segmentation. Specifically, FRRN  employs a two-stream system, where full-resolution information is carried in one stream and context information in the other pooling stream. GridNet, building on top of a similar idea, contains multiple streams with different resolutions. In this work, we apply neural architecture search for network backbones specific for semantic segmentation. We further show state-of-the-art performance without ImageNet pretraining, and significantly outperforms FRRN  and GridNet  on Cityscapes .
- Our light-weight model attains the performance only 1.2% lower than DeepLabv3+ , while requiring 76.7% fewer parameters and being 4.65 times faster in Multi-Adds
- On PASCAL VOC 2012 and ADE20K, our best model also outperforms several state-of-the-art models
- Our models outperform some state-of-the-art models, including RefineNet , UPerNet , and PSPNet (ResNet-152) ; however, without any ImageNet  pretraining, our performance is lagged behind the latest work of 
- K. Ahmed and L. Torresani. Maskconnect: Connectivity learning by gradient descent. In ECCV, 2018. 3
- V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561, 2015. 2
- B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. In ICLR, 2017. 3
- S. R. Bulo, L. Porzi, and P. Kontschieder. In-place activated batchnorm for memory-optimized training of dnns. In CVPR, 2018. 2, 7
- H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Efficient architecture search by network transformation. In AAAI, 2018. 3
- L.-C. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens. Searching for efficient multi-scale architectures for dense image prediction. In NIPS, 2018. 1, 2, 3, 7, 8
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. 1, 2
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017. 2, 7
- L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017. 2, 4, 6, 7
- L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016. 2
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. 1, 2, 7, 8
- F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017. 2
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 2, 3, 6, 7
- J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015. 2
- M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge a retrospective. IJCV, 2014. 2, 3, 6, 7
- C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. PAMI, 2013. 1, 2
- D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Tremeau, and C. Wolf. Residual conv-deconv grid network for semantic segmentation. In BMVC, 202, 3, 7
- J. Fu, J. Liu, Y. Wang, and H. Lu. Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943, 2017. 2
- G. Ghiasi, T.-Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutional networks. In NIPS, 2018. 7, 8
- A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In ICIP, 2013. 2
- K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In ICCV, 2005. 2
- K. Greff, R. K. Srivastava, J. Koutnık, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. arXiv:1503.04069, 2015. 3
- B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011. 7
- K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 2
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 2
- X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscale conditional random fields for image labeling. In CVPR, 2004. 2
- G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012. 1
- M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space, pages 289–297. Springer, 1989. 2
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017. 7
- J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, 2018. 1
- G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017. 1, 2
- S. Ioffe and C. Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 1, 2, 7
- M. A. Islam, M. Rochan, N. D. Bruce, and Y. Wang. Gated feedback refinement network for dense image labeling. In CVPR, 2017. 2
- R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures. In ICML, 2015. 3
- A. Kae, K. Sohn, H. Lee, and E. Learned-Miller. Augmenting crfs with boltzmann machine shape priors for image labeling. In CVPR, 2013. 3
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 1, 6
- P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. IJCV, 82(3):302–324, 2009. 2
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 2
- L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associative hierarchical crfs for object class image segmentation. In ICCV, 2009. 2
- G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017. 7
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 2
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989. 2
- D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang. Multi-scale context intertwining for semantic segmentation. In ECCV, 2018. 8
- G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multipath refinement networks with identity mappings for highresolution semantic segmentation. In CVPR, 2017. 2, 8
- T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 4
- T.-Y. Lin et al. Microsoft coco: Common objects in context. In ECCV, 2014. 7
- C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In ECCV, 2018. 1, 2, 3
- H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representations for efficient architecture search. In ICLR, 2018. 3
- H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv:1806.09055, 2018. 1, 2, 3, 5
- W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015. 2, 7
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 2
- R. Luo, F. Tian, T. Qin, and T.-Y. Liu. Neural architecture optimization. In NIPS, 2018. 3
- R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat. Evolving deep neural networks. arXiv:1703.00548, 2017. 3
- M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward semantic segmentation with zoom-out features. In CVPR, 2015. 2
- A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 1, 2, 4
- H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015. 1, 4
- G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In CVPR, 2015. 2
- C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In CVPR, 2017. 2
- H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. In ICML, 2018. 2, 3
- T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Fullresolution residual networks for semantic segmentation in street scenes. In CVPR, 2017. 2, 3, 7
- H. Qi, Z. Zhang, B. Xiao, H. Hu, B. Cheng, Y. Wei, and J. Dai. Deformable convolutional networks – coco detection and segmentation challenge 2017 entry. ICCV COCO Challenge Workshop, 2017. 2
- E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv:1802.01548, 2018. 1, 2, 3
- E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin. Large-scale evolution of image classifiers. In ICML, 2017. 2, 3
- O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 1, 2, 4
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. 2, 7, 8
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018. 7
- S. Saxena and J. Verbeek. Convolutional neural fabrics. In NIPS, 2016. 3
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014. 2
- R. Shin, C. Packer, and D. Song. Differentiable neural network architecture search. In ICLR Workshop, 2018. 2, 3
- J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 2009. 2
- A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851, 2016. 4
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 1, 2
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. 1
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017. 1, 2
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. 1
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 1, 2
- M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv:1807.11626, 2018. 3, 8
- P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. In WACV, 2018. 7
- Z. Wojna, V. Ferrari, S. Guadarrama, N. Silberman, L.-C. Chen, A. Fathi, and J. Uijlings. The devil is in the decoder. In BMVC, 2017. 2
- Y. Wu and K. He. Group normalization. In ECCV, 2018. 1
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016. 1
- Z. Wu, C. Shen, and A. van den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv:1611.10080, 2016. 2, 7, 8
- T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified perceptual parsing for scene understanding. In ECCV, 2018. 2, 8
- L. Xie and A. Yuille. Genetic cnn. In ICCV, 2017. 3
- S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. 1, 2
- S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016. 2
- Z. Zhang, X. Zhang, C. Peng, D. Cheng, and J. Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In ECCV, 2018. 2
- H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017. 2, 7, 8
- Z. Zhong, J. Yan, W. Wu, J. Shao, and C.-L. Liu. Practical block-wise neural network architecture generation. In CVPR, 2018. 3
- B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In CVPR, 2017. 2, 3, 6, 8
- Y. Zhuang, F. Yang, L. Tao, C. Ma, Z. Zhang, Y. Li, H. Jia, X. Xie, and W. Gao. Dense relation network: Learning consistent and context-aware representation for semantic image segmentation. In ICIP, 2018. 7
- B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017. 2, 3
- B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018. 1, 2, 3, 7