Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sixiao Zheng
Sixiao Zheng
Jiachen Lu
Jiachen Lu
Hengshuang Zhao
Hengshuang Zhao
Zekun Luo
Zekun Luo
Yabiao Wang
Yabiao Wang
Jianfeng Feng
Jianfeng Feng
Tao Xiang
Tao Xiang
Cited by: 0|Bibtex|Views23
Other Links: arxiv.org
Weibo:
We can see that our model SEgmentation TRansformer-PUP is superior to fully convolutional network baselines, and FCN plus attention based approaches, such as Non-local and CCNet; and its performance is on par with the best results reported so far

Abstract:

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have be...More

Code:

Data:

0
Introduction
  • Since the seminal work of [37], existing semantic segmentation models have been dominated by those based on fully convolutional network (FCN).
  • Due to concerns on computational cost, the resolution of feature maps is reduced progressively and the encoder is able to learn more abstract/semantic visual concepts with a gradually increased receptive field.
  • Such a design is popular due to two favourable merits, namely translation equivariance and locality.
  • It raises a fundamental limitation that learning long-range dependency information, critical for semantic segmentation in unconstrained scene images [2,50], becomes challenging due to still limited receptive fields
Highlights
  • Since the seminal work of [37], existing semantic segmentation models have been dominated by those based on fully convolutional network (FCN)
  • Our SETRMLA achieves superior mean Intersection over Union (mIoU) of 48.64% with single-scale (SS) inference with a big margin over ACNet
  • When multiscale inference is adopted, our method achieves a new state of the art with mIoU hitting 50.28%
  • Using the same training schedule, our proposed SEgmentation TRansformer (SETR) significantly outperforms this baseline, achieving mIoU of 54.40% (SETR-PUP) and 54.87% (SETR-Multi-Level feature Aggregation (MLA))
  • We can see that our model SETR-PUP is superior to FCN baselines, and FCN plus attention based approaches, such as Non-local [48] and CCNet [26]; and its performance is on par with the best results reported so far
  • In contrast to existing FCN based methods that enlarge the receptive field typically with dilated convolutions and attention modules at the component level, we made a step change at the architectural level to completely eliminate the reliance on FCN and elegantly solve the limited receptive field challenge
Methods
  • Due to the locality nature of convolution operation, the receptive field increases linearly along the depth of layers, conditional on the kernel sizes.
  • Only higher layers with big receptive fields can model long-range dependencies in this FCN architecture.
  • Having limited receptive fields for context modeling is an intrinsic limitation of the vanilla FCN architecture.
  • FPN [40] Semantic FPN [40] SETR-MLA.
  • SETR-Hybrid (21k-21k) 76.90 SETR-Naive-S SETR-MLA-S SETR-PUP-S.
  • FCN [40] FCN [40] FCN SETR-Naive SETR-MLA SETR-PUP
Results
  • Results on

    ADE20K Table 3 presents the results on the more challenging ADE20K dataset.
  • The authors can see that the model SETR-PUP is superior to FCN baselines, and FCN plus attention based approaches, such as Non-local [48] and CCNet [26]; and its performance is on par with the best results reported so far
  • On this dataset the authors can compare with the closely related Axial-DeepLab [11, 47] which aims to use an attentionalone model but still follows the basic structure of FCN.
Conclusion
  • The authors have presented an alternative perspective for semantic segmentation in images by introducing a sequence-to-sequence prediction framework.
  • Along with a set of decoder designs in different complexity, strong segmentation models are established with none of the bells and whistles deployed by recent methods.
  • Extensive experiments demonstrate that the models set new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
  • The authors' method is ranked the 1st (44.42% mIoU) place in the highly competitive ADE20K test server leaderboard
Summary
  • Introduction:

    Since the seminal work of [37], existing semantic segmentation models have been dominated by those based on fully convolutional network (FCN).
  • Due to concerns on computational cost, the resolution of feature maps is reduced progressively and the encoder is able to learn more abstract/semantic visual concepts with a gradually increased receptive field.
  • Such a design is popular due to two favourable merits, namely translation equivariance and locality.
  • It raises a fundamental limitation that learning long-range dependency information, critical for semantic segmentation in unconstrained scene images [2,50], becomes challenging due to still limited receptive fields
  • Objectives:

    The authors aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task.
  • The authors aim to provide a rethinking to the semantic segmentation model design and contribute an alternative
  • Methods:

    Due to the locality nature of convolution operation, the receptive field increases linearly along the depth of layers, conditional on the kernel sizes.
  • Only higher layers with big receptive fields can model long-range dependencies in this FCN architecture.
  • Having limited receptive fields for context modeling is an intrinsic limitation of the vanilla FCN architecture.
  • FPN [40] Semantic FPN [40] SETR-MLA.
  • SETR-Hybrid (21k-21k) 76.90 SETR-Naive-S SETR-MLA-S SETR-PUP-S.
  • FCN [40] FCN [40] FCN SETR-Naive SETR-MLA SETR-PUP
  • Results:

    Results on

    ADE20K Table 3 presents the results on the more challenging ADE20K dataset.
  • The authors can see that the model SETR-PUP is superior to FCN baselines, and FCN plus attention based approaches, such as Non-local [48] and CCNet [26]; and its performance is on par with the best results reported so far
  • On this dataset the authors can compare with the closely related Axial-DeepLab [11, 47] which aims to use an attentionalone model but still follows the basic structure of FCN.
  • Conclusion:

    The authors have presented an alternative perspective for semantic segmentation in images by introducing a sequence-to-sequence prediction framework.
  • Along with a set of decoder designs in different complexity, strong segmentation models are established with none of the bells and whistles deployed by recent methods.
  • Extensive experiments demonstrate that the models set new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
  • The authors' method is ranked the 1st (44.42% mIoU) place in the highly competitive ADE20K test server leaderboard
Tables
  • Table1: Ablation studies. All methods are evaluated using mean IoU (%), single scale test protocol. Unless otherwise specified, all models are trained on Cityscapes train fine set with 40,000 iterations and batch size 8, and evaluated on the Cityscapes validation set
  • Table2: Configuration of Transformer variants
  • Table3: State-of-the-art comparison on the ADE20K dataset. Performances of different model variants and batch size (e.g., 8 or 16) are reported. SS: Single-scale inference. MS: Multi-scale inference
  • Table4: State-of-the-art comparison on the Pascal Context dataset. Performances of different model variants and batch sizes
  • Table5: State-of-the-art comparison on the Cityscapes validation set. Performances of different training schedules (e.g., 40k and 80k) are reported. SS: Single-scale inference. MS: Multiscale inference
  • Table6: Comparison on the Cityscapes test set. ‡: trained on fine and coarse annotated data
Download tables as Excel
Related work
  • Semantic segmentation Semantic image segmentation has been significantly boosted with the development of deep neural networks. By removing fully connected layers, the fully convolutional networks (FCN) [37] is able to achieve pixel-wise predictions. While the predictions of FCN are relatively coarse, several CRF/MRF [6, 36, 62] based approaches are developed to help refine the coarse predictions. To address the inherent tension between semantics and location [37], coarse and fine layers need to be aggregated for both the encoder and decoder. This leads to different variants of the encoder-decoder structures [2, 39, 43] for multilevel feature fusion.

    Many recent efforts have been focused on addressing the limited receptive field/context modeling problem in FCN. To enlarge the receptive field, DeepLab [7] and Dilation [53] introduce the dilated convolution. Alternatively, context modeling is the focus of PSPNet [60] and DeepLabV2 [9]. The former proposes the PPM module to obtain different region’s contextual information while the latter develops ASPP module that adopts pyramid dilated convolutions with different dilation rates. Decomposed large kernels [41] are also utilized for context capturing. Recently, attention based models are popular for capturing long range context information. PSANet [61] develops the pointwise spatial attention module for dynamically capturing the long range context. DANet [18] embeds both spatial attention and channel attention. CCNet [27] alternatively focuses on economizing the heavy computation budget introduced by full spatial attention. DGMN [57] builds a dynamic graph message passing network for scene modeling and it can significantly reduce the computational complexity. Note that all these approaches are still based on FCNs where the feature encoding and extraction part are based on classical ConvNets like VGG [44] and ResNet [21]. In this work, we alternatively rethink the semantic segmentation task in a different perspective.
Funding
  • We achieve the first (44.42% mIoU) position in the highly competitive ADE20K test server leaderboard
  • Our SETRMLA achieves superior mIoU of 48.64% with single-scale (SS) inference with a big margin over ACNet
  • When multiscale inference is adopted, our method achieves a new state of the art with mIoU hitting 50.28%
  • Using the same training schedule, our proposed SETR significantly outperforms this baseline, achieving mIoU of 54.40% (SETR-PUP) and 54.87% (SETR-MLA)
  • SETRMLA further improves the performance to 55.83% when multi-scale (MS) inference is adopted, outperforming the nearest rival APCNet with a clear margin
Study subjects and analysis
datasets: 3
We adopt a polynomial learning rate decay schedule [60] and employ SGD as the optimizer. Momentum and weight decay are set to 0.9 and 0 respectively for all the experiments on the three datasets. We set initial learning rate 0.001 on ADE20K and Pascal Context, and 0.01 on Cityscapes

Reference
  • Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. arXiv preprint, 2020. 11
    Google ScholarLocate open access versionFindings
  • Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V Le. Attention augmented convolutional networks. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. arXiv preprint, 2019. 7
    Google ScholarFindings
  • Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In ECCV, 2020. 2
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. ICLR, 2015. 2
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015. 2
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2011
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018. 2
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint, 2017. 7
    Google ScholarFindings
  • Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020. 8
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 5, 7
    Google ScholarLocate open access versionFindings
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL, 2019. 2
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019. 2
    Google ScholarLocate open access versionFindings
  • Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. Semantic correlation promoted shape-variant context for segmentation. In CVPR, 2019. 6
    Google ScholarLocate open access versionFindings
  • Xiaohan Ding, Yuchen Guo, Guiguang Ding, and Jungong Han. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In CVPR, 2019. 6, 8
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint, 2020. 2, 3, 6, 7
    Google ScholarFindings
  • Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In CVPR, 2019. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multiscale filters for semantic segmentation. In ICCV, 206
    Google ScholarLocate open access versionFindings
  • Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context network for semantic segmentation. In CVPR, 2019. 6
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 2, 3
    Google ScholarLocate open access versionFindings
  • Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint, 2019. 3
    Google ScholarFindings
  • Matthias Holschneider, Richard Kronland-Martinet, Jean Morlet, and Ph Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, 1990. 1
    Google ScholarLocate open access versionFindings
  • Qinbin Hou, Li Zhang, Ming-Ming Cheng, and Jiashi Feng. Strip pooling: Rethinking spatial pooling for scene parsing. In CVPR, 2020. 6
    Google ScholarLocate open access versionFindings
  • Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local relation networks for image recognition. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019. 1, 3, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In CVPR, 2019. 4, 5, 6
    Google ScholarLocate open access versionFindings
  • Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai Tong. Improving semantic segmentation via decoupled body and edge supervision. In ECCV, 2020. 1
    Google ScholarFindings
  • Xiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, and Yunhai Tong. Global aggregation then local distribution in fully convolutional networks. In BMVC, 2019. 1
    Google ScholarLocate open access versionFindings
  • Xiangtai Li, Houlong Zhao, Lei Han, Yunhai Tong, and Kuiyuan Yang. Gff: Gated fully fusion for semantic segmentation. In AAAI, 2020. 6
    Google ScholarLocate open access versionFindings
  • Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, and Hong Liu. Expectation-maximization attention networks for semantic segmentation. In CVPR, 2019. 6
    Google ScholarLocate open access versionFindings
  • Zhaoshuo Li, Xingtong Liu, Francis X Creighton, Russell H Taylor, and Mathias Unberath. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. arXiv preprint, 2020. 3
    Google ScholarFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 4
    Google ScholarLocate open access versionFindings
  • Ruijin Liu, Zejian Yuan, Tie Liu, and Zhiliang Xiong. End-to-end lane shape prediction with transformers. arXiv preprint, 2020. 3
    Google ScholarFindings
  • Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 2, 3, 6
    Google ScholarLocate open access versionFindings
  • Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014. 5
    Google ScholarLocate open access versionFindings
  • Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • OpenMMLab. mmsegmentation. https://github.com/open-mmlab/mmsegmentation, 2020.5, 6, 7
    Findings
  • Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters — improve semantic segmentation by global convolutional network. In CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. In NeurIPS, 2019. 2
    Google ScholarLocate open access versionFindings
  • Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. MICCAI, 2015. 2
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 2
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2
    Google ScholarLocate open access versionFindings
  • Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018. 4
    Google ScholarLocate open access versionFindings
  • Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Standalone axial-attention for panoptic segmentation. In ECCV, 2020. 1, 2, 3, 7, 8
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018. 1, 2, 7, 8
    Google ScholarLocate open access versionFindings
  • Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In ICLR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018. 1, 7
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019. 2
    Google ScholarLocate open access versionFindings
  • Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 2018. 7
    Google ScholarLocate open access versionFindings
  • Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. ICLR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Yuhui Yuan and Jingdong Wang. Ocnet: Object context network for scene parsing. arXiv preprint, 2018. 6, 7
    Google ScholarFindings
  • Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In CVPR, 2018. 6
    Google ScholarLocate open access versionFindings
  • Li Zhang, Xiangtai Li, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, and Philip HS Torr. Dual graph convolutional network for semantic segmentation. In BMVC, 2019. 3
    Google ScholarLocate open access versionFindings
  • Li Zhang, Dan Xu, Anurag Arnab, and Philip HS Torr. Dynamic graph message passing networks. In CVPR, 2020. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • Richard Zhang. Making convolutional networks shiftinvariant again. In ICML, 2019. 1
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In CVPR, 2020. 2
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017. 1, 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise spatial attention network for scene parsing. In ECCV, 2018. 2, 7
    Google ScholarLocate open access versionFindings
  • Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. arXiv preprint, 2016. 5
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments