AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose an Attention Propagation Module, which is based on the non-local attention mechanism, but extended to deal with spatio-temporal variations for the video semantic segmentation task

Temporally Distributed Networks for Fast Video Semantic Segmentation

CVPR, pp.8815-8824, (2020)

Cited by: 16|Views113
EI

Abstract

We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we...More

Code:

Data:

0
Introduction
  • Video semantic segmentation aims to assign pixel-wise semantic labels to video frames.
  • The recent successes in dense labeling tasks [4, 20, 25, 28, 52, 56, 58, 61] have revealed that strong feature representations are critical for accurate segmentation results.
  • The most straightforward strategy for video semantic segmentation is to apply a deep image segmentation model to each frame independently, but this strategy does not leverage temporal information provided in the video dynamic scenes.
Highlights
  • Video semantic segmentation aims to assign pixel-wise semantic labels to video frames
  • We propose an Attention Propagation Module (APM), which is based on the non-local attention mechanism [49, 51, 59], but extended to deal with spatio-temporal variations for the video semantic segmentation task
  • In addition to transferring knowledge in the fullfeature space [13, 15, 29], we propose a grouped knowledge distillation loss to further transfer knowledge at the subspace level in order to make the information extracted from different paths more complementary to one another
  • We found that different orders of sub-networks achieve very similar mean Intersection-overUnion values, which indicates that Temporally Distributed Network is stable with respect to sub-feature paths
  • As most previous methods for video semantic segmentation do not evaluate on this dataset, we only find one related work to compare against; STD2P [14]
  • We presented a novel temporally distributed network for fast semantic video segmentation
Methods
  • The knowledge distillation based training loss (Eq 6) consistently helps to improve performance on the three datasets.
  • When combined with the grouped knowledge distillation, the performance can be still boosted with nearly a half percent in terms of mIoU.
  • This shows the effectiveness of
Results
  • Results Cityscapes Dataset

    The authors compare the method with the recent state-of-the-art models for semantic video segmentation in Table 1.
  • Compared with LVS [27], TD4-PSP18, achieves similar performance with only a half the average time cost, and TD2-PSP50 further improves accuracy by 3 percent in terms of mIoU.
  • With a similar total number of parameters as PSPNet101 [58], TD2-PSP50 reduces the per-frame time cost by half from 360ms to.
  • BiseNet∗34 [54] 76.0.
  • BiseNet∗101 [54] 76.5 TD4-Bise18
Conclusion
  • The authors presented a novel temporally distributed network for fast semantic video segmentation.
  • By computing the feature maps across different frames and merging them with a novel attention propagation module, the method retains high accuracy while significantly improving the latency of processing video frames.
  • The authors show that using a grouped knowledge distillation loss, further boost the performance.
  • TDNet consistently outperforms previous methods in both accuracy and efficiency.
  • The authors thank Kate Saenko for the useful discussions and suggestions.
Summary
  • Introduction:

    Video semantic segmentation aims to assign pixel-wise semantic labels to video frames.
  • The recent successes in dense labeling tasks [4, 20, 25, 28, 52, 56, 58, 61] have revealed that strong feature representations are critical for accurate segmentation results.
  • The most straightforward strategy for video semantic segmentation is to apply a deep image segmentation model to each frame independently, but this strategy does not leverage temporal information provided in the video dynamic scenes.
  • Methods:

    The knowledge distillation based training loss (Eq 6) consistently helps to improve performance on the three datasets.
  • When combined with the grouped knowledge distillation, the performance can be still boosted with nearly a half percent in terms of mIoU.
  • This shows the effectiveness of
  • Results:

    Results Cityscapes Dataset

    The authors compare the method with the recent state-of-the-art models for semantic video segmentation in Table 1.
  • Compared with LVS [27], TD4-PSP18, achieves similar performance with only a half the average time cost, and TD2-PSP50 further improves accuracy by 3 percent in terms of mIoU.
  • With a similar total number of parameters as PSPNet101 [58], TD2-PSP50 reduces the per-frame time cost by half from 360ms to.
  • BiseNet∗34 [54] 76.0.
  • BiseNet∗101 [54] 76.5 TD4-Bise18
  • Conclusion:

    The authors presented a novel temporally distributed network for fast semantic video segmentation.
  • By computing the feature maps across different frames and merging them with a novel attention propagation module, the method retains high accuracy while significantly improving the latency of processing video frames.
  • The authors show that using a grouped knowledge distillation loss, further boost the performance.
  • TDNet consistently outperforms previous methods in both accuracy and efficiency.
  • The authors thank Kate Saenko for the useful discussions and suggestions.
Tables
  • Table1: Evaluation on the Cityscapes dataset. The “Speed” and “Max Latency” represent the average and maximum per-frame time cost respectively
  • Table2: Evaluation of high-efficiency approaches on the Cityscapes dataset
  • Table3: Evaluation on the Camvid dataset
  • Table4: Evaluation on the NYUDepth dataset
  • Table5: The mIoU (%) for different components in our knowledge distillation loss (Eq 6) for TD4-PSP18
  • Table6: Effect of different downsampling stride n on Cityscapes
  • Table7: Comparisons on Cityscapes for using a shared subnetwork or independent sub-networks. The last column shows the baseline model corresponding to TDNet’s sub-network
  • Table8: Ablation study on TD4-PSP18 showing how performance decreases with progressively fewer sub-features accumulated
Download tables as Excel
Related work
  • Image semantic segmentation is an active area of research that has witnessed significant improvements in performance with the success of deep learning [12, 16, 28, 43]. As a pioneer work, the Fully Convolutional Network (FCN) [30] replaced the last fully connected layer for classification with convolutional layers, thus allowing for dense label prediction. Based on this formulation, follow-up methods have been proposed for efficient segmentation [24, 37, 38, 39, 54, 57] or high-quality segmentation [4, 7, 11, 26, 34, 40, 42, 45, 46, 47].

    Semantic segmentation has also been widely applied to videos [14, 23, 31, 48], with different approaches employed to balance the trade-off between quality and speed. A number of methods leverage temporal context in a video by repeatedly applying the same deep model to each frame and temporally aggregating features with additional network layers [10, 19, 35]. Although these methods improve accuracy over single frame approaches, they incur additional computation over a per-frame model.
Funding
  • This work was supported in part by DARPA and NSF, and a gift funding from Adobe Research
Reference
  • Gabriel J Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, 2008.
    Google ScholarLocate open access versionFindings
  • Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, and Simon Osindero. Massively parallel video networks. In ECCV, 2018.
    Google ScholarFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE T-PAMI, 2017.
    Google ScholarFindings
  • Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. Context contrasted feature and gated multiscale aggregation for scene segmentation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Howard Whitley Eves. Elementary matrix theory. 1980.
    Google ScholarFindings
  • Raghudeep Gadde, Varun Jampani, and Peter V Gehler. Semantic video cnns through representation warping. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multiscale filters for semantic segmentation. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, and Youliang Yan. Knowledge adaptation for efficient semantic segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Yang He, Wei-Chen Chiu, Margret Keuper, and Mario Fritz. Std2p: Rgbd semantic segmentation using spatio-temporal data-driven pooling. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Yani Ioannou, Duncan Robertson, Roberto Cipolla, and Antonio Criminisi. Deep roots: Improving cnn efficiency with hierarchical filter groups. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Samvit Jain, Xin Wang, and Joseph E Gonzalez. Accel: A corrective fusion network for efficient semantic segmentation on video. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, et al. Video scene parsing with predictive feature learning. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Haijie Tian Yong Li Yongjun Bao Zhiwei Fang and Hanqing Lu Jun Fu, Jing Liu. Dual attention network for scene segmentation. 2019.
    Google ScholarFindings
  • Ivan Kreso, Sinisa Segvic, and Josip Krapac. Ladder-style densenets for semantic segmentation of large natural images. In ICCV Workshop, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Abhijit Kundu, Vibhav Vineet, and Vladlen Koltun. Feature space optimization for semantic video segmentation. In CVPR, 2016.
    Google ScholarFindings
  • Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun. Dfanet: Deep feature aggregation for real-time semantic segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, and Hong Liu. Expectation-maximization attention networks for semantic segmentation. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xingang Wang. Attention-guided unified network for panoptic segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Yule Li, Jianping Shi, and Dahua Lin. Low-latency video semantic segmentation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Behrooz Mahasseni, Sinisa Todorovic, and Alan Fern. Budget-aware deep semantic video segmentation. In CVPR, 2017.
    Google ScholarFindings
  • Davide Mazzini. Guided upsampling network for real-time semantic segmentation. BMVC, 2018.
    Google ScholarLocate open access versionFindings
  • Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
    Google ScholarLocate open access versionFindings
  • Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid. Architecture search of dynamic cells for semantic video segmentation. In WACV, 2020.
    Google ScholarFindings
  • David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Marin Orsic, Ivan Kreso, Petra Bevandic, and Sinisa Segvic. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
    Findings
  • Matthieu Paul, Christoph Mayer, Luc Van Gool, and Radu Timofte. Efficient video semantic segmentation with labels propagation and refinement. In WACV, 2020.
    Google ScholarLocate open access versionFindings
  • Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Evan Shelhamer, Kate Rakelly, Judy Hoffman, and Trevor Darrell. Clockwork convnets for video semantic segmentation. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Bing Shuai, Zhen Zuo, Bing Wang, and Gang Wang. Dagrecurrent neural networks for scene labeling. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • Towaki Takikawa, David Acuna, Varun Jampani, and Sanja Fidler. Iccv. 2019.
    Google ScholarLocate open access versionFindings
  • Meng Tang, Abdelaziz Djelouah, Federico Perazzi, Yuri Boykov, and Christopher Schroers. Normalized cut loss for weakly-supervised CNN segmentation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. On regularized losses for weakly-supervised CNN segmentation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Subarna Tripathi, Serge Belongie, Youngbae Hwang, and Truong Nguyen. Semantic video segmentation: Exploring inference efficiency. In ISOCC. IEEE.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Zifeng Wu, Chunhua Shen, and Anton Van Den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 2019.
    Google ScholarLocate open access versionFindings
  • Yu-Syuan Xu, Tsu-Jui Fu, Hsuan-Kung Yang, and Chun-Yi Lee. Dynamic video segmentation network. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
    Google ScholarLocate open access versionFindings
  • Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and Jifeng Dai. An empirical study of spatial attention mechanisms in deep networks. ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In CVPR, 2017.
    Google ScholarFindings
  • Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, and Bryan Catanzaro. Improving semantic segmentation via video propagation and label relaxation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. Asymmetric non-local neural networks for semantic segmentation. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科