AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We propose a new operator, Instance Mask Projection, which projects the results of instance segmentation as a feature representation for semantic segmentation

IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things.

CoRR, (2019): 5178-5187

被引用4|浏览102
EI
下载 PDF 全文
引用
微博一下

摘要

In this work, we present a new operator, called Instance Mask Projection (IMP), which projects a predicted Instance Segmentation as a new feature for semantic segmentation. It also supports back propagation so is trainable end-to-end. Our experiments show the effectiveness of IMP on both Clothing Parsing (with complex layering, large de...更多

代码

数据

0
简介
  • This paper addresses producing pixel-accurate semantic segmentations.
  • Tection results, bounding box and instance mask prediction, as in Mask R-CNN [18], with semantic segmentation.
  • The core of the approach is a new operator, Instance Mask Projection (IMP), that projects the predicted masks from Mask R-CNN for each detection into a feature map to use as an auxiliary input for semantic segmentation, significantly increasing accuracy.
  • In the implementations the semantic segmentation pipeline shares a trunk with the detector, as in Panoptic FPN [21], resulting in a fast solution
重点内容
  • This paper addresses producing pixel-accurate semantic segmentations
  • Many other potential applications can be envisioned, especially in real-world scenarios where intelligent agents are using vision to perceive their surrounding environments, but for this paper we focus on two areas, street scenes and fashion outfits, as two widely differing settings to demonstrate the generality of our method
  • In our implementations the semantic segmentation pipeline shares a trunk with the detector, as in Panoptic Feature Pyramid Network (FPN) [21], resulting in a fast solution
  • Adding our proposed Instance Mask Projection (IMP) operator significantly increases semantic segmentation performance when incorporated into each of these base models, improving absolute performance of Panoptic-P2 by 9.45 mIOU and 1.42 in mAcc, and improving Panoptic-FPN by 2.02 mIOU and 4.44 in mAcc
  • We propose a new operator, Instance Mask Projection, which projects the results of instance segmentation as a feature representation for semantic segmentation
  • On the ModaNet clothing parsing dataset, we show a dramatic improvement of 20.4% compared to existing baseline semantic segmentation results
  • Experiments adding IMP to Panoptic-P2/Panotpic-FPN show consistent improvements, with negligible increases in inference time. We only apply it to PanopticP2/Panoptic-FPN, this operator can generally be applied to other architectures as well
方法
  • PSANet101 [41] Mapillary [3] DeeplabV3+ [8] Panoptic FPN [21]

    Ours:Panoptic-FPN-IMP

    Backbone ResNet-101-D8 WideResNet-38-D8 X-71-D16

    ResNet-101-FPN ResNeXt-101-FPN ResNet-50-FPN ResNet-101-FPN ResNeXt-101-FPN

    4.4.
  • PSANet101 [41] Mapillary [3] DeeplabV3+ [8] Panoptic FPN [21].
  • Due to the different number of instance classes and input resolutions, the speed performance of models can vary.
  • The authors find the results are quite consistent and very efficient, adding IMP only costs ∼1-2 ms in inference on top of each baseline model.
  • The inference time of all the models used in the experiments can be found in Table 6 in the Appendix
结果
  • On the ModaNet clothing parsing dataset, the authors show a dramatic improvement of 20.4% compared to existing baseline semantic segmentation results.
  • Showing the best results on ModaNet [43], improving mean IOU from 51% for DeepLabV3+ to 71.4%.
  • The authors' model achieves 20.4% absolute mIOU improvement compared to the best performing semantic segmentation algorithm, DeepLabV3+, provided by ModaNet. Plus, the authors achieve more consistent results, scoring over 50% IOU for each class.
  • Compared to the baseline results, the model does extremely well on small objects, e.g. belt, sunglasses, headwear, scarf&tie
结论
  • The authors propose a new operator, Instance Mask Projection, which projects the results of instance segmentation as a feature representation for semantic segmentation.
  • It combines top-down and bottom-up information for semantic segmentation.
  • This operator is simple but powerful.
  • Experiments adding IMP to Panoptic-P2/Panotpic-FPN show consistent improvements, with negligible increases in inference time.
  • The authors only apply it to PanopticP2/Panoptic-FPN, this operator can generally be applied to other architectures as well
总结
  • Introduction:

    This paper addresses producing pixel-accurate semantic segmentations.
  • Tection results, bounding box and instance mask prediction, as in Mask R-CNN [18], with semantic segmentation.
  • The core of the approach is a new operator, Instance Mask Projection (IMP), that projects the predicted masks from Mask R-CNN for each detection into a feature map to use as an auxiliary input for semantic segmentation, significantly increasing accuracy.
  • In the implementations the semantic segmentation pipeline shares a trunk with the detector, as in Panoptic FPN [21], resulting in a fast solution
  • Objectives:

    The authors' goal is to develop a joint instance/semantic segmentation framework that can directly integrate predictions from instance segmentation to produce a more accurate semantic segmentation labeling.
  • Methods:

    PSANet101 [41] Mapillary [3] DeeplabV3+ [8] Panoptic FPN [21]

    Ours:Panoptic-FPN-IMP

    Backbone ResNet-101-D8 WideResNet-38-D8 X-71-D16

    ResNet-101-FPN ResNeXt-101-FPN ResNet-50-FPN ResNet-101-FPN ResNeXt-101-FPN

    4.4.
  • PSANet101 [41] Mapillary [3] DeeplabV3+ [8] Panoptic FPN [21].
  • Due to the different number of instance classes and input resolutions, the speed performance of models can vary.
  • The authors find the results are quite consistent and very efficient, adding IMP only costs ∼1-2 ms in inference on top of each baseline model.
  • The inference time of all the models used in the experiments can be found in Table 6 in the Appendix
  • Results:

    On the ModaNet clothing parsing dataset, the authors show a dramatic improvement of 20.4% compared to existing baseline semantic segmentation results.
  • Showing the best results on ModaNet [43], improving mean IOU from 51% for DeepLabV3+ to 71.4%.
  • The authors' model achieves 20.4% absolute mIOU improvement compared to the best performing semantic segmentation algorithm, DeepLabV3+, provided by ModaNet. Plus, the authors achieve more consistent results, scoring over 50% IOU for each class.
  • Compared to the baseline results, the model does extremely well on small objects, e.g. belt, sunglasses, headwear, scarf&tie
  • Conclusion:

    The authors propose a new operator, Instance Mask Projection, which projects the results of instance segmentation as a feature representation for semantic segmentation.
  • It combines top-down and bottom-up information for semantic segmentation.
  • This operator is simple but powerful.
  • Experiments adding IMP to Panoptic-P2/Panotpic-FPN show consistent improvements, with negligible increases in inference time.
  • The authors only apply it to PanopticP2/Panoptic-FPN, this operator can generally be applied to other architectures as well
表格
  • Table1: Ablation Study on Varied Clothing Datasetwith ResNet-50 as the backbone network. We train the model with different settings, Panoptic-P2 vs Panoptic-FPN, w/wo Instance Mask Projection(IMP), w/wo BBox/Mask prediction head. For the BBox, and Mask, we use the COCO evaluation metric. For the semantic segmentation metric, we use mean IOU and mean Accuracy
  • Table2: Comparison to the baseline models provided by ModaNet. Our model shows 20.4% absolute improvement for mean IOU. For certain categories, especially those whose size is quite small such as belt, sunglasses, headwear and scarf & tie, our models show dramatic improvement. For simplicity, we use R50 and R101 to represent ResNet0-50 and ResNet-101
  • Table3: Results on ModaNet with ResNet-50 as the backbone model. Panoptic-P2-IMP and Mask R-CNN-IMP both provide improvements on semantic segmentation compared to Semantic-P2 and Panoptic-P2
  • Table4: Comparisons of per Class IOU with and without IMP on Cityscapes. We show two scenarios without (top) and with (bottom) data augmentation. We see Instance Mask Projection(IMP) improves both scenarios. For Thing classes, we see 4.2/3.2 mIOU improvement with/without all data augmentation
  • Table5: Comparisons on Cityscapes val set. Our models obtain 0.6 and 0.3 mIOU improvement over PanopticFPN [<a class="ref-link" id="c21" href="#r21">21</a>] on the same backbone architectures
Download tables as Excel
相关工作
  • Our work builds on current state-of-the-art object detection and semantic segmentation models which have benefited greatly from recent advances in convolution neural network architectures. In this section, we first review recent progress on object localization and semantic segmentation. Then, we describe how our proposed model fits in with other works which integrate both object detection and semantic segmentation.

    2.1. Localizing Things

    Initially, methods to localize objects in images mainly focused on predicting a tight bounding box around each object of interest. As the accuracy matured, research in object localization has expanded to not only produce a rectangular bounding box but also an instance segmentation, identifying which pixels corresponding to each object.

    Object Detection: R-CNN [16] has been one of the most foundational lines of research driving recent developments in detection, initiating work on using the feature representations learned in CNNs for localization. Many related works continued this progress in two-stage detection approaches, including SPP Net [19], Fast R-CNN, [15] and Faster RCNN [34]. In addition, single-shot detectors, YOLO [33] and SSD [28], have been proposed to achieve real-time speed. Many other recent methods have been proposed to improve accuracy. R-FCN [11] pools position-sensitive class maps to make predictions more robust. FPN [24] and DSSD [14] add top-down connections to bring semantic information from deep layers to shallow layers. FocalLoss [25] reduces the extreme class imbalance by decreasing influence from well-predicted examples.
基金
  • On the ModaNet clothing parsing dataset, we show a dramatic improvement of 20.4% compared to existing baseline semantic segmentation results
  • Showing the best results on ModaNet [43], improving mean IOU from 51% for DeepLabV3+ to 71.4%
  • Adding our proposed IMP operator significantly increases semantic segmentation performance when incorporated into each of these base models (rows 6 and 7), improving absolute performance of Panoptic-P2 by 9.45 mIOU and 1.42 in mAcc, and improving Panoptic-FPN by 2.02 mIOU and 4.44 in mAcc
  • Adding IMP to Panoptic-P2, PanopticP2-IMP achieves a semantic performance of 69.65%, outperforming Panoptic-P2 by 3.72% mIOU, and PanopticFPN-IMP even further improves mIOU to 71.41%
  • Our model achieves 20.4% absolute mIOU improvement compared to the best performing semantic segmentation algorithm, DeepLabV3+, provided by ModaNet
  • Plus, we achieve more consistent results, scoring over 50% IOU for each class
  • Compared to the baseline results, our model does extremely well on small objects, e.g. belt, sunglasses, headwear, scarf&tie (on scarf&tie we achieve 97.9% mIOU)
研究对象与分析
datasets: 3
See sec. 4.2. • Across three datasets, using features from IMP improves significantly over a Panoptic segmentation baseline (the same system without IMP) and produces state of the art results. See Sec. 4.3

引用论文
  • Yagız Aksoy, Tae-Hyun Oh, Sylvain Paris, Marc Pollefeys, and Wojciech Matusik. Semantic Soft Segmentation. ACM Trans. Graph. (Proc. SIGGRAPH), 2018. 3
    Google ScholarLocate open access versionFindings
  • Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. PAMI, 2017. 3
    Google ScholarLocate open access versionFindings
  • Samuel Rota Bulo, Lorenzo Porzi, and Peter Kontschieder. In-Place Activated BatchNorm for Memory-Optimized Training of DNNs. In CVPR, 2018. 3, 8
    Google ScholarLocate open access versionFindings
  • Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCOStuff: Thing and Stuff Classes in Context. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, Jonathan T. Barron, George Papandreou, Kevin Murphy, and Alan L. Yuille. Semantic Image Segmentation with Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform. In CVPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • Liang-Chieh* Chen, George* Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. In ICLR, 2015. 3
    Google ScholarLocate open access versionFindings
  • Liang-Chieh* Chen, George* Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. PAMI, 2018. 3
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In ECCV, 2013, 7, 8
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, 2016. 1, 6, 7
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware Semantic Segmentation via Multi-task Network Cascades. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In NeurIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable Convolutional Networks. ICCV, 2017. 3
    Google ScholarLocate open access versionFindings
  • Nikita Dvornik, Konstantin Shmelkov, Julien Mairal, and Cordelia Schmid. BlitzNet: A Real-Time Deep Network for Scene Understanding. In ICCV, 2017. 3
    Google ScholarLocate open access versionFindings
  • Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. DSSD: Deconvolutional Single Shot Detector. arXiv:1701.06659, 2017. 2, 3
    Findings
  • Ross Girshick. Fast R-CNN. In ICCV, 202
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 2
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677, 205
    Findings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015. 2
    Google ScholarLocate open access versionFindings
  • Sina Honari, Jason Yosinski, Pascal Vincent, and Christopher Pal. Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation. In CVPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic Feature Pyramid Networks. In CVPR, 2019. 1, 3, 4, 5, 8
    Google ScholarLocate open access versionFindings
  • Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollar. Panoptic Segmentation. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully Convolutional Instance-aware Semantic Segmentation. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature Pyramid Networks for Object Detection. In CVPR, 2017. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal Loss for Dense Object Detection. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014. 5
    Google ScholarLocate open access versionFindings
  • Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path Aggregation Network for Instance Segmentation. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single Shot MultiBox Detector. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark, 2018. Accessed:[03/22/2019].5
    Findings
  • Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation. In ECCV, 2016. 3
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NeurIPS-W, 2017. 5
    Google ScholarLocate open access versionFindings
  • Pedro O. Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollr. Learning to Refine Object Segments. In ECCV, 2016. 3
    Google ScholarLocate open access versionFindings
  • Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS, 2015. 2
    Google ScholarLocate open access versionFindings
  • Olaf Ronneberger, Philipp Fischer, and Thomas Brox. UNet: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 2015. 3
    Google ScholarLocate open access versionFindings
  • Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully Convolutional Networks for Semantic Segmentation. PAMI, 2016. 3, 7
    Google ScholarLocate open access versionFindings
  • Yuxin Wu and Kaiming He. Group Normalization. In ECCV, 2018. 5
    Google ScholarLocate open access versionFindings
  • Yuwen Xiong*, Renjie Liao*, Hengshuang Zhao*, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. UPSNet: A Unified Panoptic Segmentation Network. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Fisher Yu and Vladlen Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. In ICLR, 2016. 3
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid Scene Parsing Network. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. PSANet: Pointwise Spatial Attention Network for Scene Parsing. In ECCV, 2018. 3, 8
    Google ScholarLocate open access versionFindings
  • Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional Random Fields as Recurrent Neural Networks. In ICCV, 2015. 7
    Google ScholarLocate open access versionFindings
  • Shuai Zheng, Fan Yang, M. Hadi Kiapour, and Robinson Piramuthu. ModaNet: A Large-Scale Street Fashion Dataset with Polygon Annotations. In ACM Multimedia, 2018. 1, 2, 7, 11
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科