Memory Enhanced Global-Local Aggregation for Video Object Detection

CVPR, pp. 10334-10343, 2020.

Cited by: 1|Bibtex|Views225|DOI:https://doi.org/10.1109/CVPR42600.2020.01035
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We present Memory Enhance Global-Local Aggregation Network, which takes a joint view of both global and local information aggregation to solve video object detection

Abstract:

How do humans recognize an object in a piece of video? Due to the deteriorated quality of single frame, it may be hard for people to identify an occluded object in this frame by just utilizing information within one image. We argue that there are two important cues for humans to recognize objects in videos: the global semantic informati...More

Code:

Data:

0
Introduction
  • What differs detecting objects in videos from detecting them in static images? A quick answer is the information lies in the temporal dimension.
  • When people are not certain about the identity of an object, they would seek to find a distinct object from other frames that shares high semantic similarity with current object and assign them together
  • The authors refer this clue as global semantic information for every frame in the video could be (a) video object detection with full connection.
  • The authors could not rely on the semantic information to tell them where it is, as the existence of the instance has not been approved in key frame yet
  • This problem could be alleviated if nearby frames were given.
  • People identify objects mainly with these two sources of information
Highlights
  • What differs detecting objects in videos from detecting them in static images? A quick answer is the information lies in the temporal dimension
  • In the second stage, we introduce a novel Long Range Memory (LRM) module which enables key frame to get access to much more content than any previous methods
  • Thanks to the enormous aggregation size of memory enhanced global-local aggregation empowered by Long Range Memory, we achieve 85.4% mean average precision on ImageNet VID dataset, which is to-date the best reported result
  • We argue the superior performance of memory enhanced global-local aggregation is brought by the novel Long Range Memory module which enables one frame could gather information efficiently from much longer content, both globally and locally
  • We present Memory Enhance Global-Local Aggregation Network (MEGA), which takes a joint view of both global and local information aggregation to solve video object detection
  • Experiments conducted on ImageNet VID dataset validate the effectiveness of our method
Methods
  • FGFA [36] MANet [27] THP [35] STSN [1] OGEMN [6] SELSA [30] RDN [7] RDN [7] MEGA

    Backbone ResNet-101 ResNet-101 ResNet-101+DCN ResNet-101+DCN ResNet-101+DCN ResNet-101 ResNet-101 ResNeXt-101 ResNet-101 ResNeXt-101 local global.
  • FGFA [36] MANet [27] THP [35] STSN [1] OGEMN [6] SELSA [30] RDN [7] RDN [7] MEGA.
  • Backbone ResNet-101 ResNet-101 ResNet-101+DCN ResNet-101+DCN ResNet-101+DCN ResNet-101 ResNet-101 ResNeXt-101 ResNet-101 ResNeXt-101 local global.
  • Backbone ResNet-101 ResNet-101 ResNet-101 ResNet-101 ResNet-101+DCN ResNet-101 ResNet-101 ResNet-101+DCN ResNet-101 ResNet-101 Inception-ResNet Inception-v4 ResNeXt-101 ResNeXt-101.
  • This behavior is similar to [5] but with different motivation: the authors would like the model to pay more attention on most adjacent frames
Results
  • With ResNet-101 backbone, MEGA can achieve 82.9% mAP, 1.1% absolute improvement over the strongest competitor RDN.
  • By replacing the backbone feature extractor from ResNet-101 to a stronger backbone ResNeXt-101, the method achieves better performance of 84.1% mAP just as expected.
  • RDN is the most representative method in local aggregation scheme while SELSA is the one in global aggregation scheme.
  • RDN only models the relationship within a short local temporal range, while SELSA only builds sparse global connection.
  • As discussed in Section 1, these methods may suffer from the ineffective and insufficient approxima-
Conclusion
  • The authors present Memory Enhance Global-Local Aggregation Network (MEGA), which takes a joint view of both global and local information aggregation to solve video object detection.
  • The authors first thoroughly analyze the ineffecitve and insufficient problems existed in recent methods.
  • Afterwards, a novel Long Range Memory is introduced to solve the insufficient problem.
  • Experiments conducted on ImageNet VID dataset validate the effectiveness of the method
Summary
  • Introduction:

    What differs detecting objects in videos from detecting them in static images? A quick answer is the information lies in the temporal dimension.
  • When people are not certain about the identity of an object, they would seek to find a distinct object from other frames that shares high semantic similarity with current object and assign them together
  • The authors refer this clue as global semantic information for every frame in the video could be (a) video object detection with full connection.
  • The authors could not rely on the semantic information to tell them where it is, as the existence of the instance has not been approved in key frame yet
  • This problem could be alleviated if nearby frames were given.
  • People identify objects mainly with these two sources of information
  • Methods:

    FGFA [36] MANet [27] THP [35] STSN [1] OGEMN [6] SELSA [30] RDN [7] RDN [7] MEGA

    Backbone ResNet-101 ResNet-101 ResNet-101+DCN ResNet-101+DCN ResNet-101+DCN ResNet-101 ResNet-101 ResNeXt-101 ResNet-101 ResNeXt-101 local global.
  • FGFA [36] MANet [27] THP [35] STSN [1] OGEMN [6] SELSA [30] RDN [7] RDN [7] MEGA.
  • Backbone ResNet-101 ResNet-101 ResNet-101+DCN ResNet-101+DCN ResNet-101+DCN ResNet-101 ResNet-101 ResNeXt-101 ResNet-101 ResNeXt-101 local global.
  • Backbone ResNet-101 ResNet-101 ResNet-101 ResNet-101 ResNet-101+DCN ResNet-101 ResNet-101 ResNet-101+DCN ResNet-101 ResNet-101 Inception-ResNet Inception-v4 ResNeXt-101 ResNeXt-101.
  • This behavior is similar to [5] but with different motivation: the authors would like the model to pay more attention on most adjacent frames
  • Results:

    With ResNet-101 backbone, MEGA can achieve 82.9% mAP, 1.1% absolute improvement over the strongest competitor RDN.
  • By replacing the backbone feature extractor from ResNet-101 to a stronger backbone ResNeXt-101, the method achieves better performance of 84.1% mAP just as expected.
  • RDN is the most representative method in local aggregation scheme while SELSA is the one in global aggregation scheme.
  • RDN only models the relationship within a short local temporal range, while SELSA only builds sparse global connection.
  • As discussed in Section 1, these methods may suffer from the ineffective and insufficient approxima-
  • Conclusion:

    The authors present Memory Enhance Global-Local Aggregation Network (MEGA), which takes a joint view of both global and local information aggregation to solve video object detection.
  • The authors first thoroughly analyze the ineffecitve and insufficient problems existed in recent methods.
  • Afterwards, a novel Long Range Memory is introduced to solve the insufficient problem.
  • Experiments conducted on ImageNet VID dataset validate the effectiveness of the method
Tables
  • Table1: Performance comparison with state-of-the-art end-to-end video object detection models on ImageNet VID validation set
  • Table2: Performance comparison with state-of-the-art video object detection models with post-processing methods (e.g. SeqNMS, Tube Rescoring, BLR)
  • Table3: Performance of base model and MEGA
  • Table4: Ablation study on the global and local feature aggregation. Nl and Ng is the number of relation modules in local aggregation stage and global aggregation stage, respectively. By setting Nl or Ng to 0 removes the influence of local or global information
  • Table5: Ablation study on different global reference frame number Tg, local reference frame number Tl, number of relation modules in local aggregation stage Nl and memory size Tm. Default parameter is indicated by *
Download tables as Excel
Related work
  • Object Detection from Images. Current leading object detectors are build upon deep Convolutional Neural Networks (CNNs) [17, 24, 25, 14, 3] and can be classified into two main familys, namely, anchor-based detectors (e.g., R-CNN [11], Fast(er) R-CNN [10, 21], Cascade RCNN [2]) and anchor-free detectors (e.g., CornerNet [18], ExtremeNet [34]). Our method is built upon Faster-RCNN with ResNet-101, which is one of the state-of-the-art object detector.

    Object Detection in Videos. Due to the complex manner of video variation, e.g., motion blur, occlusion and out of focus, it is not trivial to generalize the success of image detector into the video domain. The main focus of recent methods [16, 12, 37, 36, 35, 9, 27, 1, 31, 7, 30] towards solving video object detection is improving the performance of per-frame detection by exploiting information in the temporal dimension. These methods could be categorized into local aggregation methods and global aggregation methods.
Funding
  • This work is supported by National Key R&D Program of China (2018YFB1402600), BJNSF (L172037), Beijing Acedemy of Artificial Intelligence and Zhejiang Lab
Reference
  • Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. Object detection in video with spatiotemporal sampling networks. In ECCV, pages 342–357, 2018.
    Google ScholarLocate open access versionFindings
  • Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: delving into high quality object detection. In CVPR, pages 6154– 6162, 2018.
    Google ScholarLocate open access versionFindings
  • Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In ICCV Workshops, pages 0–0, 2019.
    Google ScholarLocate open access versionFindings
  • Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuanjun Xiong, Chen Change Loy, and Dahua Lin. Optimizing video object detection via a scale-time lattice. In CVPR, pages 7814–7823, 2018.
    Google ScholarLocate open access versionFindings
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL, pages 2978–2988, 2019.
    Google ScholarLocate open access versionFindings
  • Hanming Deng, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. Object guided external memory network for video object detection. In ICCV, October 2019.
    Google ScholarFindings
  • Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei. Relation distillation networks for video object detection. In ICCV, October 2019.
    Google ScholarFindings
  • Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, pages 2758–2766, 2015.
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In ICCV, pages 3057– 3065, 2017.
    Google ScholarLocate open access versionFindings
  • Ross B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S. Huang. Seq-nms for video object detection. arxiv, abs/1602.08465, 2016.
    Findings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross B. Girshick. Mask R-CNN. In ICCV, pages 2980–2988, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, pages 3588–3597, 2018.
    Google ScholarLocate open access versionFindings
  • Kai Kang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Object detection from video tubelets with convolutional neural networks. In CVPR, pages 817–825, 2016.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, pages 1106–1114, 2012.
    Google ScholarLocate open access versionFindings
  • Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, pages 765–781, 2018.
    Google ScholarLocate open access versionFindings
  • Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/
    Findings
  • maskrcnn-benchmark, 2018.
    Google ScholarFindings
  • [20] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, pages 9225–9234, 2019.
    Google ScholarLocate open access versionFindings
  • [21] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • [22] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • [23] Mykhailo Shvets, Wei Liu, and Alexander C. Berg. Leveraging long-range temporal relationships between proposals for video object detection. In ICCV, pages 9755–9763, 2019.
    Google ScholarLocate open access versionFindings
  • [24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
    Google ScholarLocate open access versionFindings
  • [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeruIPS, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • [27] Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. Fully motion-aware network for video object detection. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • [28] Sanghyun Woo, Dahun Kim, Donghyeon Cho, and In So Kweon. Linknet: Relational embedding for scene graph. In NeurIPS, pages 558–568, 2018.
    Google ScholarLocate open access versionFindings
  • [29] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross B. Girshick. Longterm feature banks for detailed video understanding. In CVPR, pages 284–293, 2019.
    Google ScholarLocate open access versionFindings
  • [30] Haiping Wu, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Sequence level semantics aggregation for video object detection. In ICCV, October 2019.
    Google ScholarFindings
  • [31] Fanyi Xiao and Yong Jae Lee. Video object detection with an aligned spatial-temporal memory. In ECCV, pages 494–510, 2018.
    Google ScholarLocate open access versionFindings
  • [32] Saining Xie, Ross B. Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017.
    Google ScholarLocate open access versionFindings
  • [33] Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu. Spatialtemporal relation networks for multi-object tracking. In ICCV, pages 3988–3998, 2019.
    Google ScholarLocate open access versionFindings
  • [34] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In CVPR, pages 850–859, 2019.
    Google ScholarLocate open access versionFindings
  • [35] Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. Towards high performance video object detection. In CVPR, pages 7210–7218, 2018.
    Google ScholarLocate open access versionFindings
  • [36] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [37] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In CVPR, 2017.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments