AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We present a new fine-grained transparent object segmentation dataset with 11 common categories, termed Trans10K-v2, where the data is based on the previous Trans10K

Segmenting Transparent Objects in the Wild with Transformer.

IJCAI, pp.1194-1200, (2021)

被引用2|浏览11
EI
下载 PDF 全文
引用
微博一下

摘要

This work presents a new fine-grained transparent object segmentation dataset, termed Trans10K-v2, extending Trans10K-v1, the first large-scale transparent object segmentation dataset. Unlike Trans10K-v1 that only has two limited categories, our new dataset has several appealing benefits. (1) It has 11 fine-grained categories of transpa...更多

代码

数据

0
简介
  • Mainly mobile robots and mechanical manipulators, would benefit a lot from the efficient perception of the transparent objects in residential environments since the environments vary drastically.
  • The increasing utilization of glass wall and transparent door in the building interior and the glass cups and bottles in residential rooms has resulted in the wrong detection in various range sensors.
  • Most systems perceive the environment by multi-data sensor fusion via sonars or lidars.
  • The sensors are relatively consistent in detecting opaque objects but are still affected by the scan mismatching due to transparent objects.
  • Feature of reflection, refraction, and light projection from the transparent objects may confuse the sensors.
  • A reliable vision-based method, which is much cheaper and more robust than high-precision sensors, would be efficient
重点内容
  • Modern robots, mainly mobile robots and mechanical manipulators, would benefit a lot from the efficient perception of the transparent objects in residential environments since the environments vary drastically
  • By formulating semantic segmentation as a problem of dictionary look-up, we design a set of learnable prototypes as the query of Trans2Seg’s transformer decoder, where each prototype learns the statistics of one category in the whole dataset
  • We proposes a fine-grained transparent object segmentation dataset termed Trans10K-v2 with more elaborately defined categories
  • We evaluate more than 20 semantic segmentation methods on Trans10K-v2, and our Trans2Seg significantly outperforms these methods
  • We present a new fine-grained transparent object segmentation dataset with 11 common categories, termed Trans10K-v2, where the data is based on the previous Trans10K
  • In Trans2Seg, the transformer encoder provides global receptive field, which is essential for transparent objects segmentation
结果
  • From these results, the authors can observe that our Trans2Seg outputs very high-quality transparent object segmentation masks than other methods.
  • In such a case, even humans would fail to distinguish these transparent objects
结论
  • The authors present a new fine-grained transparent object segmentation dataset with 11 common categories, termed Trans10K-v2, where the data is based on the previous Trans10K.
  • The authors discuss the challenging and practical of the proposed dataset.
  • The authors propose a transformerbased pipeline, termed Trans2Seg, to solve this challenging task.
  • In Trans2Seg, the transformer encoder provides global receptive field, which is essential for transparent objects segmentation.
  • The authors model the segmentation as dictionary look up with a set of learnable queries, where each query represents one category.
  • The authors evaluate more than 20 mainstream semantic segmentation methods and shows our Trans2Seg clearly surpass these CNN-based segmentation methods
表格
  • Table1: Statistic information of Translabv2. ‘CMCC’ denotes Mean Connected Components of each category. ‘image num’ denotes the image number. ‘pixel ratio’ is the pixel number of a certain category accounts in all the pixels of transparent objects in Trans10K-v2
  • Table2: Effectiveness of Transformer encoder and decoder. ‘Trans.’ indicates Transformer. ‘Enc.’ and ‘Dec.’ means encoder and decoder
  • Table3: Performance of Transformer at different scales. ‘e{a}-n{b}-m{c}’ means the transformer with number of ‘a’ embedding dims, ‘b’ layers and ‘c’ mlp ratio
  • Table4: Evaluated state-of-the-art semantic segmentation methods. Sorted by FLOPs. Our proposes Trans2Seg surpasses all the other methods in pixel accuracy and mean IoU, as well as most of the category IoUs (8 in 11)
  • Table5: The upper part of this table: the number of the scene. The lower part of this table: the interaction pattern of each category
Download tables as Excel
相关工作
  • Semantic Segmentation. In deep learning era, convolutional neural network (CNN) puts forwards the development of semantic segmentation in various datasets, such as ADE20K, CityScapes and PASCAL VOC. One of the pioneer works approaches, FCN [Long et al, 2015], transfers semantic segmentation into an end-to-end fully convolutional classification network. For improving the performance, especially around object boundaries, [Chen et al, 2017; Lin et al, 2016; Zheng et al, 2015] propose to use structured prediction module, conditional random fields (CRFs) [Chen et al, 2014], to refine network output. Dramatic improvements in performance and inference speed have been driven by aggregating features at multiples scales, for example, PSPNet [Zhao et al, 2017] and DeepLab [Chen et al, 2017; Chen et al, 2018b], and propagating structured information across intermediate CNN representations [Gadde et al, 2016; Liu et al, 2017; Wang et al, 2018].

    Transparent Object Datasets. [Xu et al, 2015] introduces TransCut dataset which only contain 49 images of 7 unique objects. To generate the segmentation result, [Xu et al, 2015] optimized an energy function based on LF-linearity which also need to utilize the light-field cameras. [Chen et al, 2018a] proposed TOM-Net. It contains 876 real images and 178K synthetic images which are generated by POVRay. However, only 4 unique objects are used in synthesizing the training data. Recnetly, [Xie et al, 2020] introduce a first large-scale real-world transparent object segmentation dataset, termed Trans10K. It has 10K+ images. However, there are two categories in this dataset, which limits its practical use. In this work, our Trans10K-v2 inherited the data and annotates 11 fine-grained categories.
基金
  • As shown in Figure 2, the FCN baseline without transformer encoder achieves 62.7% mIoU, when adding transformer encoder, the mIoU directly improves 6.1%, achieving 66.8% mIoU
  • With our transformer decoder, the mIoU boosts up to 72.1% with 3.3% improvement
  • Our method is 2.1% higher than TransLab, which is the previous SOTA method
  • We also find that our method tend to performs much better on small objects, such as ‘bottle’ and ’eyeglass’ (10.0% and 5.0% higher than previous SOTA)
引用论文
  • [Carion et al., 2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End object detection with transformers. In ECCV, 2020.
    Google ScholarLocate open access versionFindings
  • [Chao et al., 2019] Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. Hardnet: A low memory traffic network. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2014] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv, 2014.
    Google ScholarFindings
  • [Chen et al., 2017] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2018a] Guanying Chen, Kai Han, and KwanYee K. Wong. Tom-net: Learning transparent object matting from a single image. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2018b] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2018c] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [Chen et al., 2020] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364, 2020.
    Findings
  • [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
    Google ScholarLocate open access versionFindings
  • [Dosovitskiy et al., 2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    Findings
  • [Everingham and Winn, 2011] Mark Everingham and John Winn. The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep, 2011.
    Google ScholarLocate open access versionFindings
  • [Foster et al., 2013] Paul Foster, Zhenghong Sun, Jong Jin Park, and Benjamin Kuipers. Visagge: Visible angle grid for glass environments. In 2013 IEEE International Conference on Robotics and Automation, pages 2213–2220. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • [Fu et al., 2019] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3146–3154, 2019.
    Google ScholarLocate open access versionFindings
  • [Gadde et al., 2016] Raghudeep Gadde, Varun Jampani, Martin Kiefel, Daniel Kappler, and Peter V Gehler. Superpixel convolutional networks using bilateral inceptions. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • [Gehring et al., 2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
    Findings
  • [Han et al., 2020] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on visual transformer. arXiv preprint arXiv:2012.12556, 2020.
    Findings
  • [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • [Jin et al., 2019] Qiangguo Jin, Zhaopeng Meng, Tuan D Pham, Qi Chen, Leyi Wei, and Ran Su. Dunet: A deformable network for retinal vessel segmentation. Knowledge-Based Systems, 2019.
    Google ScholarLocate open access versionFindings
  • [Kim and Chung, 2016] Jiwoong Kim and Woojin Chung. Localization of a mobile robot using a laser range finder in a glass-walled environment. IEEE Transactions on Industrial Electronics, 63(6):3616–3627, 2016.
    Google ScholarLocate open access versionFindings
  • [Klank et al., 2011] Ulrich Klank, Daniel Carton, and Michael Beetz. Transparent object detection and reconstruction on a mobile platform. In IEEE International Conference on Robotics & Automation, 2011.
    Google ScholarLocate open access versionFindings
  • [Li et al., 2019a] Gen Li, Inyoung Yun, Jonghyun Kim, and Joongkyu Kim. Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv, 2019.
    Google ScholarFindings
  • [Li et al., 2019b] Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9522–9531, 2019.
    Google ScholarLocate open access versionFindings
  • [Lin et al., 2016] Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid. Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • [Lin et al., 2017] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [Liu and Yin, 2019] Mengyu Liu and Hujun Yin. Feature pyramid encoding network for real-time semantic segmentation. arXiv, 2019.
    Google ScholarFindings
  • [Liu et al., 2017] Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. Learning affinity via spatial propagation networks. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • [Long et al., 2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • [Mehta et al., 2019] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi. Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9190–9200, 2019.
    Google ScholarLocate open access versionFindings
  • [Mei et al., 2020] Haiyang Mei, Xin Yang, Yang Wang, Yuanyuan Liu, Shengfeng He, Qiang Zhang, Xiaopeng Wei, and Rynson W.H. Lau. Don’t hit me! glass detection in real-world scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
    Google ScholarLocate open access versionFindings
  • [Meinhardt et al., 2021] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702, 2021.
    Findings
  • [Poudel et al., 2018] Rudra PK Poudel, Ujwal Bonde, Stephan Liwicki, and Christopher Zach. Contextnet: Exploring context and detail for semantic segmentation in real-time. arXiv, 2018.
    Google ScholarFindings
  • [Poudel et al., 2019] Rudra PK Poudel, Stephan Liwicki, and Roberto Cipolla. Fast-scnn: fast semantic segmentation network. arXiv, 2019.
    Google ScholarFindings
  • [Ronneberger et al., 2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
    Google ScholarLocate open access versionFindings
  • [Singh et al., 2018] Ravinder Singh, Kuldeep Singh Nagla, John Page, and John Page. Multi-data sensor fusion framework to detect transparent object for the efficient mobile robot mapping. International Journal of Intelligent Unmanned Systems, pages 00–00, 2018.
    Google ScholarLocate open access versionFindings
  • [Spataro et al., 2015] R. Spataro, R. Sorbello, S. Tramonte, G. Tumminello, M. Giardina, A. Chella, and V. La Bella. Reaching and grasping a glass of water by locked-in als patients through a bci-controlled humanoid robot. Frontiers in Human Neuroscience, 357:e48–e49, 2015.
    Google ScholarLocate open access versionFindings
  • [Sun et al., 2020] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020.
    Findings
  • [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2019a] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. arXiv, 2019.
    Google ScholarFindings
  • [Wang et al., 2019b] Yu Wang, Quan Zhou, Jia Liu, Jian Xiong, Guangwei Gao, Xiaofu Wu, and Longin Jan Latecki. Lednet: A lightweight encoder-decoder network for real-time semantic segmentation. In ICIP, 2019.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2020] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503, 2020.
    Findings
  • [Xie et al., 2020] Enze Xie, Wenjia Wang, Wenhai Wang, Mingyu Ding, Chunhua Shen, and Ping Luo. Segmenting transparent objects in the wild. arXiv preprint arXiv:2003.13948, 2020.
    Findings
  • [Xu et al., 2015] Yichao Xu, Hajime Nagahara, Atsushi Shimada, and Rin-ichiro Taniguchi. Transcut: Transparent object segmentation from a light-field image. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • [Yang et al., 2018] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [Yu et al., 2018] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [Yuan and Wang, 2018] Yuhui Yuan and Jingdong Wang. Ocnet: Object context network for scene parsing. arXiv, 2018.
    Google ScholarFindings
  • [Zhao et al., 2017] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [Zhao et al., 2018] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for realtime semantic segmentation on high-resolution images. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [Zheng et al., 2015] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • [Zheng et al., 2020] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-tosequence perspective with transformers. arXiv preprint arXiv:2012.15840, 2020.
    Findings
  • [Zhou et al., 2018] Zheming Zhou, Zhiqiang Sui, and Odest Chadwicke Jenkins. Plenoptic monte carlo object localization for robot grasping under layered translucency. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • [Zhu et al., 2020] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
    Findings
作者
Wenjia Wang
Wenjia Wang
Wenhai Wang
Wenhai Wang
Peize Sun
Peize Sun
Hang Xu
Hang Xu
Ding Liang
Ding Liang
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科