Self-Ensembling Attention Networks: Addressing Domain Shift for Semantic Segmentation

national conference on artificial intelligence, 2019.

Cited by: 7|Bibtex|Views41
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
Since the labels from the target domain are not available in most practical situations, this study focuses on unsupervised domain adaptation, where the source domain supplies both images and annotations and the target domain only supplies unlabeled images

Abstract:

Recent years have witnessed the great success of deep learning models in semantic segmentation. Nevertheless, these models may not generalize well to unseen image domains due to the phenomenon of domain shift. Since pixel-level annotations are laborious to collect, developing algorithms which can adapt labeled data from source domain to t...More

Code:

Data:

0
Introduction
  • Semantic segmentation is a fundamental task in computer vision, which assigns the class label for each pixel in a given image (Tao et al 2017).
  • Given a series of networks trained on source-domain data, the ensemble prediction of these networks on target-domain images is likely to be closer to the ground truth.
  • The target-domain images are sent to the teacher network to obtain domain-invariant segmentation maps as shown in Figure 1 (b).
Highlights
  • Semantic segmentation is a fundamental task in computer vision, which assigns the class label for each pixel in a given image (Tao et al 2017)
  • Deep learning is a powerful tool for semantic segmentation task
  • Since the labels from the target domain are not available in most practical situations, this study focuses on unsupervised domain adaptation, where the source domain supplies both images and annotations and the target domain only supplies unlabeled images
  • The main contributions of this study are summarized as follows: (1) We propose a self-ensembling model to address domain shift in semantic segmentation task for the first time
  • While the aforementioned domain adaptation methods mainly utilize adversarial training to reduce the domain gap, we propose a self-ensembling model to address this problem, which provides a different viewpoint on how to learn domain-invariant features for semantic segmentation
Results
  • The main contributions of this study are summarized as follows: (1) The authors propose a self-ensembling model to address domain shift in semantic segmentation task for the first time.
  • The learnt attention maps are further utilized to guide the calculation of consistency loss in the target domain which improves the performance of the model.
  • Another related work is curriculum-style learning where the curriculum domain adaptation solves easy tasks first to infer necessary properties about the target domain, such as label distributions over images and local distributions over landmark super-pixels.
  • A segmentation network is trained with regularization that its predictions in the target domain follow those inferred properties (Zhang, David, and Gong 2017).
  • While the aforementioned domain adaptation methods mainly utilize adversarial training to reduce the domain gap, the authors propose a self-ensembling model to address this problem, which provides a different viewpoint on how to learn domain-invariant features for semantic segmentation.
  • The source-domain images are only fed into the student network to calculate the segmentation loss.
  • The target-domain images are input to both student and teacher networks to calculate the consistency loss.
  • Owing to the regularization of the consistency loss, the student network can thereby learn from the output of the teacher network, which is likely to be closer to the ground truth in the target domain.
  • The target-domain images are sent to the teacher network to accomplish the semantic segmentation.
  • Optimization Given an image Xs and a corresponding label map Ys in the source domain, the authors first define the segmentation loss with cross-entropy as: Lseg (Xs)
Conclusion
  • PS (g (Xt))(u,v,c) − PT (g (Xt))(u,v,c) 2 where PT denotes the probability map generated by the teacher network and Xt denotes the target-domain image.
  • Curriculum domain adaptation (CDA, in ICCV 2017) (Zhang, David, and Gong 2017) proposes a curriculum-style learning approach to minimize the domain gap in semantic segmentation.
  • According to the results in Table 1, even without domain adaptation, the segmentation network directly trained on source domain can get an IoU of 66.8% for the sky class in target domain, which is a very high score.
Summary
  • Semantic segmentation is a fundamental task in computer vision, which assigns the class label for each pixel in a given image (Tao et al 2017).
  • Given a series of networks trained on source-domain data, the ensemble prediction of these networks on target-domain images is likely to be closer to the ground truth.
  • The target-domain images are sent to the teacher network to obtain domain-invariant segmentation maps as shown in Figure 1 (b).
  • The main contributions of this study are summarized as follows: (1) The authors propose a self-ensembling model to address domain shift in semantic segmentation task for the first time.
  • The learnt attention maps are further utilized to guide the calculation of consistency loss in the target domain which improves the performance of the model.
  • Another related work is curriculum-style learning where the curriculum domain adaptation solves easy tasks first to infer necessary properties about the target domain, such as label distributions over images and local distributions over landmark super-pixels.
  • A segmentation network is trained with regularization that its predictions in the target domain follow those inferred properties (Zhang, David, and Gong 2017).
  • While the aforementioned domain adaptation methods mainly utilize adversarial training to reduce the domain gap, the authors propose a self-ensembling model to address this problem, which provides a different viewpoint on how to learn domain-invariant features for semantic segmentation.
  • The source-domain images are only fed into the student network to calculate the segmentation loss.
  • The target-domain images are input to both student and teacher networks to calculate the consistency loss.
  • Owing to the regularization of the consistency loss, the student network can thereby learn from the output of the teacher network, which is likely to be closer to the ground truth in the target domain.
  • The target-domain images are sent to the teacher network to accomplish the semantic segmentation.
  • Optimization Given an image Xs and a corresponding label map Ys in the source domain, the authors first define the segmentation loss with cross-entropy as: Lseg (Xs)
  • PS (g (Xt))(u,v,c) − PT (g (Xt))(u,v,c) 2 where PT denotes the probability map generated by the teacher network and Xt denotes the target-domain image.
  • Curriculum domain adaptation (CDA, in ICCV 2017) (Zhang, David, and Gong 2017) proposes a curriculum-style learning approach to minimize the domain gap in semantic segmentation.
  • According to the results in Table 1, even without domain adaptation, the segmentation network directly trained on source domain can get an IoU of 66.8% for the sky class in target domain, which is a very high score.
Tables
  • Table1: Results of semantic segmentation by adapting from SYTNHIA to CITYSCAPES. MCD (<a class="ref-link" id="cSaito_et+al_2018_a" href="#rSaito_et+al_2018_a">Saito et al 2018</a>) and CyCADA (<a class="ref-link" id="cHoffman_et+al_2018_a" href="#rHoffman_et+al_2018_a">Hoffman et al 2018</a>) do not report the experimental results on GTA-5 dataset with VGG-16 backbone networks. Thus we omit them in this table. The IoUs of wall, fence, and pole in CCA are not reported (Chen et al 2017). For the remaining 13 classes, the mean IoU of CCA is 35.7%, while our method achieves 43.6% in this case
  • Table2: Table 2
  • Table3: Parameter analysis of the weighting factor λcon for
Download tables as Excel
Related work
  • Semantic Segmentation Different from traditional image classification task where each image is labeled with only one label, semantic segmentation requires pixel-level predictions, which is more challenging. Inspired by the work in (Long, Shelhamer, and Darrell 2015), numerous deep models have been proposed to tackle semantic segmentation task with fully convolutional networks (Noh, Hong, and Han 2015; Chen et al 2018). To train these deep models, abundant pixel-level annotations are usually required, which are hard to be collected in real world applications (Tao et al 2017).

    An alternative approach is to train these deep models with synthetic data. Recent researches have made it feasible to generate dense pixel-accurate semantic label maps for photo-realistic images extracted from computer games automatically. It is reported that the labeling process for 25 thousand images obtained from the game Grand Theft Auto V costs only 49 hours, which dramatically reduces the amount of human effort required (Richter et al 2016). However, due to the phenomenon of domain shift, models trained on these synthetic data can hardly yield satisfactory performance in real scenarios (Chang et al 2017).
Funding
  • This work is supported by the National Natural Science Foundation of China under Grant 61822113, Grant 41431175, Grant 61471274, and Grant U1536204
Reference
  • Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3213–3223.
    Google ScholarLocate open access versionFindings
  • Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and FeiFei, L. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 248–255.
    Google ScholarLocate open access versionFindings
  • Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2):303–338.
    Google ScholarLocate open access versionFindings
  • French, G.; Mackiewicz, M.; and Fisher, M. 2018. Selfensembling for visual domain adaptation. In International Conference on Learning Representations (ICLR). Hoffman, J.; Wang, D.; Yu, F.; and Darrell, T. 2016. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649.
    Findings
  • Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.-Y.; Isola, P.; Saenko, K.; Efros, A. A.; and Darrell, T. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML). Killian, T.; Daulton, S.; Konidaris, G.; and Doshivelez, F. 2017. Robust and efficient transfer learning with hiddenparameter markov decision processes. In AAAI Conference on Artificial Intelligence (AAAI). Kinga, D., and Adam, J. B. 201Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), volume 5.
    Google ScholarLocate open access versionFindings
  • Laine, S., and Aila, T. 201Temporal ensembling for semisupervised learning. arXiv preprint.
    Google ScholarFindings
  • Liu, J.; Wang, Y.; and Qiao, Y. 201Sparse deep transfer learning for convolutional neural network. In AAAI Conference on Artificial Intelligence (AAAI). Long, M.; Cao, Y.; Wang, J.; and Jordan, M. I. 2015. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791.
    Findings
  • Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3431–3440.
    Google ScholarLocate open access versionFindings
  • Mnih, V.; Heess, N.; Graves, A.; et al. 2014. Recurrent models of visual attention. In International Conference on Neural Information Processing Systems (NIPS), 2204–2212.
    Google ScholarLocate open access versionFindings
  • Noh, H.; Hong, S.; and Han, B. 2015. Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision (ICCV), 1520–1528.
    Google ScholarLocate open access versionFindings
  • Richter, S. R.; Vineet, V.; Roth, S.; and Koltun, V. 2016. Playing for data: Ground truth from computer games. In European Conference on Computer Vision (ECCV), 102–118. Springer.
    Google ScholarLocate open access versionFindings
  • Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; and Lopez, A. M. 2016. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3234–3243.
    Google ScholarLocate open access versionFindings
  • Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
    Findings
  • Singh, S.; Hoiem, D.; and Forsyth, D. 2016. Swapout: Learning an ensemble of deep architectures. In International Conference on Neural Information Processing Systems (NIPS). Tan, B.; Zhang, Y.; Pan, S. J.; and Yang, Q. 2017. Distant domain transfer learning. In AAAI Conference on Artificial Intelligence (AAAI). Tao, Z.; Liu, H.; Fu, H.; and Fu, Y. 2017. Image cosegmentation via saliency-guided constrained clustering with cosine similarity. In AAAI Conference on Artificial Intelligence (AAAI). Tarvainen, A., and Valpola, H. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In International Conference on Neural Information Processing Systems (NIPS), 1195–1204.
    Google ScholarLocate open access versionFindings
  • Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; and Darrell, T. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474.
    Findings
  • Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; and Tang, X. 2017. Residual attention network for image classification. arXiv preprint arXiv:1704.06904.
    Findings
  • Xu, H.; Gao, Y.; Yu, F.; and Darrell, T. 20End-to-end learning of driving models from large-scale video datasets. arXiv preprint.
    Google ScholarFindings
  • Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; and Agrawal, A. 20Context encoding for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Zhang, Y.; David, P.; and Gong, B. 2017. Curriculum domain adaptation for semantic segmentation of urban scenes. In IEEE International Conference on Computer Vision (ICCV). Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments