AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our architecture dramatically decreases computational complexity associated with instance segmentation in conventional panoptic segmentation algorithms

Real-Time Panoptic Segmentation from Dense Detections

CVPR, pp.8520-8529, (2019)

Cited by: 0|Views47
EI

Abstract

Panoptic segmentation is a complex full scene parsing task requiring simultaneous instance and semantic segmentation at high resolution. Current state-of-the-art approaches cannot run in real-time, and simplifying these architectures to improve efficiency severely degrades their accuracy. In this paper, we propose a new single-shot pano...More

Code:

Data:

0
Introduction
  • Scene understanding is the basis of many real-life applications, including autonomous driving, robotics, and image editing.
  • Panoptic segmentation, proposed by Kirillov et al [14], aims to provide a complete 2D description of a scene
  • This task requires each pixel in an input image to be assigned to a semantic class and each object instance to be identified and segmented.
  • Pixels are categorized in two high level classes: stuff representing amorphous and uncountable regions, and things covering countable objects.
  • Most recent approaches use a single backbone for feature extraction and add various branches on top of the shared representations to perform each downstream task separately, generating the final panoptic prediction with fusion heuristics [13, 36, 28]
Highlights
  • Scene understanding is the basis of many real-life applications, including autonomous driving, robotics, and image editing
  • Our main contributions are threefold: (i) we introduce a novel panoptic segmentation method extending dense object detection and semantic segmentation by reusing discarded object detection outputs via parameter-free global self-attention; we propose a single-shot framework for real-time panoptic segmentation that achieves comparable performance with the current state of the art as depicted in Figure 1, but with up to 4x faster inference; we provide a natural extension to our proposed method that works in a weakly supervised scenario
  • Given S and B, we introduce a parameter-free mask reconstruction algorithm to produce instance masks based on a global self-attention mechanism
  • In order to refine instance masks, we introduce a loss that aims to reduce False Positive (FP) and False Negative (FN) pixel counts in predicted masks: Lmask
  • We provide standard metrics on sub-tasks, including Mean Intersection over Union for semantic segmentation and average over AP r [9] for instance segmentation
  • Our architecture dramatically decreases computational complexity associated with instance segmentation in conventional panoptic segmentation algorithms
Methods
  • Panoptic-FPN [13] AdaptIS† [30] AUNet [16] UPSNet [36].
  • DeeperLab [38] SSAP [8] Ours Backbone.
  • ResNet-50-FPN 39.6 ResNet-50-FPN 42.5 Xcep-71 ResNet-50 36.5.
  • ResNet-50-FPN 37.1 PQth PQst Inf. Time
Results
  • The authors' experiments on the Cityscapes and COCO benchmarks show that the network works at 30 FPS on 1024x2048 resolution, trading a 3% relative performance degradation from the current state of the art for up to 440% faster inference.
Conclusion
  • The authors propose a single-stage panoptic segmentation framework that achieves real-time inference with a performance competitive with the current state of the art.
  • The authors first introduce a novel parameter-free mask construction operation that reuses predictions from dense object detection via a global self-attention mechanism.
  • The authors' architecture dramatically decreases computational complexity associated with instance segmentation in conventional panoptic segmentation algorithms.
  • The authors develop an explicit mask loss which improves panoptic segmentation quality.
  • The authors evaluate the potential of the method in weakly supervised settings, showing that it can outperform some recent fully supervised methods
Summary
  • Introduction:

    Scene understanding is the basis of many real-life applications, including autonomous driving, robotics, and image editing.
  • Panoptic segmentation, proposed by Kirillov et al [14], aims to provide a complete 2D description of a scene
  • This task requires each pixel in an input image to be assigned to a semantic class and each object instance to be identified and segmented.
  • Pixels are categorized in two high level classes: stuff representing amorphous and uncountable regions, and things covering countable objects.
  • Most recent approaches use a single backbone for feature extraction and add various branches on top of the shared representations to perform each downstream task separately, generating the final panoptic prediction with fusion heuristics [13, 36, 28]
  • Methods:

    Panoptic-FPN [13] AdaptIS† [30] AUNet [16] UPSNet [36].
  • DeeperLab [38] SSAP [8] Ours Backbone.
  • ResNet-50-FPN 39.6 ResNet-50-FPN 42.5 Xcep-71 ResNet-50 36.5.
  • ResNet-50-FPN 37.1 PQth PQst Inf. Time
  • Results:

    The authors' experiments on the Cityscapes and COCO benchmarks show that the network works at 30 FPS on 1024x2048 resolution, trading a 3% relative performance degradation from the current state of the art for up to 440% faster inference.
  • Conclusion:

    The authors propose a single-stage panoptic segmentation framework that achieves real-time inference with a performance competitive with the current state of the art.
  • The authors first introduce a novel parameter-free mask construction operation that reuses predictions from dense object detection via a global self-attention mechanism.
  • The authors' architecture dramatically decreases computational complexity associated with instance segmentation in conventional panoptic segmentation algorithms.
  • The authors develop an explicit mask loss which improves panoptic segmentation quality.
  • The authors evaluate the potential of the method in weakly supervised settings, showing that it can outperform some recent fully supervised methods
Tables
  • Table1: Performance on Cityscapes validation set. We bold the best number across single-stage methods and underline the best number across the two categories. †: method includes multiple-forward passes. ∗: Our replicated result from official sources using the same evaluation environment as our model
  • Table2: Performance on COCO-validation. We bold the best single-stage methods and underline the best across the two categories.†: methods including multiple-forward passes. ∗: Our replicated result from official sources using the same evaluation environment as our model
  • Table3: Ablative analysis. We compare the impact of different key modules/designs in our proposed network. We also present a weakly supervised model trained without using instance masks
  • Table4: Per-class Performance on Cityscapes
  • Table5: Performance of Our Proposed Framework with Different Configuration on CityScapes
  • Table6: Per-class Performance on COCO
Download tables as Excel
Related work
  • 2.1. Instance Segmentation

    Instance segmentation requires distinct object instances in images to be localized and segmented. Recent works can be categorized into two types: two-stage and singlestage methods. Represented by Mask R-CNN [10] and its variations [22, 4, 12], two-stage algorithms currently claim the state of the art in accuracy. The first stage proposes a set of regions of interest (RoIs) and the second predicts instance masks from features extracted using RoIAlign [10]. This feature re-pooling and re-sampling operation results in large computational costs that significantly decrease efficiency, rendering two-stage models challenging to deploy in real-time systems. Single-stage methods, on the other hand, predict instance location and shape simultaneously. Some single-stage methods follow the detect-then-segment approach, with additional convolutional heads attached to single-stage object detectors to predict mask shapes [37, 32, 3, 35]. Others learn representations for each foreground pixel and perform pixel clustering to assemble instance masks during post-processing [24, 8, 6, 25, 17]. The final representation can be either explicit instance-aware features [24, 17], implicitly learned embeddings [6, 25], or affinity maps with surrounding locations at each pixel [8].
Funding
  • Proposes a new single-shot panoptic segmentation network that leverages dense detections and a global self-attention mechanism to operate in realtime with performance approaching the state of the art
  • Introduces a novel parameter-free mask construction method that substantially reduces computational complexity by efficiently reusing information from the object detection and semantic segmentation sub-tasks
  • Identifies two key opportunities for streamlining existing frameworks
  • Explores how to maximally reuse information in a single-shot, fully-convolutional panoptic segmentation framework that achieves real-time inference speeds while obtaining performance comparable with the state of the art
  • Introduces a novel panoptic segmentation method extending dense object detection and semantic segmentation by reusing discarded object detection outputs via parameter-free global self-attention; proposes a single-shot framework for real-time panoptic segmentation that achieves comparable performance with the current state of the art as depicted in Figure 1, but with up to 4x faster inference; provides a natural extension to our proposed method that works in a weakly supervised scenario
  • Our experiments on the Cityscapes and COCO benchmarks show that our network works at 30 FPS on 1024 × 2048 resolution, trading a 3% relative performance degradation from the current state of the art for up to 440% faster inference
Reference
  • Ours results are in the middle for row 2,5 and on the right for row 1,3,4.
    Google ScholarFindings
  • TensorRT python library. https://developer.nvidia.com/tensorrt.
    Findings
  • Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. YOLACT: real-time instance segmentation. CoRR, abs/1904.02689, 2019.
    Findings
  • Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4974–4983, 2019.
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
    Google ScholarLocate open access versionFindings
  • Bert De Brabandere, Davy Neven, and Luc Van Gool. Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551, 2017.
    Findings
  • Daan de Geus, Panagiotis Meletis, and Gijs Dubbelman. Fast panoptic segmentation network. arXiv preprint arXiv:1910.03892, 2019.
    Findings
  • Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. Ssap: Single-shot instance segmentation with affinity pyramid. arXiv preprint arXiv:1909.01616, 2019.
    Findings
  • Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In European Conference on Computer Vision, pages 297–312.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask scoring r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6409–6418, 2019.
    Google ScholarLocate open access versionFindings
  • Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6399–6408, 2019.
    Google ScholarLocate open access versionFindings
  • Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollar. Panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
    Google ScholarLocate open access versionFindings
  • Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, and Adrien Gaidon. Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192, 2018.
    Findings
  • Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xingang Wang. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7026–7035, 2019.
    Google ScholarLocate open access versionFindings
  • Xiaodan Liang, Liang Lin, Yunchao Wei, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Proposal-free network for instance-level object segmentation. IEEE transactions on pattern analysis and machine intelligence, 40(12):2978– 2991, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
    Google ScholarLocate open access versionFindings
  • Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu Liu, Gang Yu, and Wei Jiang. An end-to-end network for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6181, 2019.
    Google ScholarLocate open access versionFindings
  • Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8759–8768, 2018.
    Google ScholarLocate open access versionFindings
  • Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 4990– 4999, 2017.
    Google ScholarLocate open access versionFindings
  • Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8837–8845, 2019.
    Google ScholarLocate open access versionFindings
  • Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems, pages 2277–2287, 2017.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe. Full-resolution residual networks for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4151–4160, 2017.
    Google ScholarLocate open access versionFindings
  • Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, and Peter Kontschieder. Seamless scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8277–8286, 2019.
    Google ScholarLocate open access versionFindings
  • Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
    Findings
  • Konstantin Sofiiuk, Olga Barinova, and Anton Konushin. Adaptis: Adaptive instance selection network. In Proceedings of the IEEE International Conference on Computer Vision, pages 7355–7363, 2019.
    Google ScholarLocate open access versionFindings
  • Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355, 2019.
    Findings
  • Jonas Uhrig, Eike Rehder, Bjorn Frohlich, Uwe Franke, and Thomas Brox. Box2pix: Single-shot instance segmentation by assigning pixels to object boxes. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 292–299. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Mark Weber, Jonathon Luiten, and Bastian Leibe. Singleshot panoptic segmentation, 2019.
    Google ScholarFindings
  • Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Bridging category-level and instance-level semantic image segmentation. arXiv preprint arXiv:1605.06885, 2016.
    Findings
  • Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. Polarmask: Single shot instance segmentation with polar representation. arXiv preprint arXiv:1909.13226, 2019.
    Findings
  • Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8818–8826, 2019.
    Google ScholarLocate open access versionFindings
  • Wenqiang Xu, Haiyang Wang, Fubo Qi, and Cewu Lu. Explicit shape encoding for real-time instance segmentation. arXiv preprint arXiv:1908.04067, 2019.
    Findings
  • Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen. Deeperlab: Single-shot image parser. CoRR, abs/1902.05093, 2019.
    Findings
  • Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM international conference on Multimedia, pages 516–520. ACM, 2016.
    Google ScholarLocate open access versionFindings
Author
Bhargava Arjun
Bhargava Arjun
Lynch Jerome
Lynch Jerome
Your rating :
0

 

Tags
Comments
小科