Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification
CVPR, pp. 10404-10413, 2020.
EI
Weibo:
Abstract:
Video-based person re-identification (reID) aims at matching the same person across video clips. It is a challenging task due to the existence of redundancy among frames, newly revealed appearance, occlusion, and motion blurs. In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-aided Att...More
Code:
Data:
Introduction
- Person re-identification aims at matching persons in different positions, times, and camera views.
- With the prevalence of video capturing systems, person reID based on video offers larger capacity for achieving more robust performance.
- As illustrated in Figure 1, for a video clip, the visible contents of different frames differ but there are overlaps/redundancy.
- The multiple frames of a video clip/sequence can provide more comprehensive information Gender, Age,.
- Video Sequence of Example Person 1
Highlights
- Person re-identification aims at matching persons in different positions, times, and camera views
- Considering the characteristics of video, we propose to construct a small but representative set of reference feature nodes (S-RFNs) for globally modelling the pairwise relations, instead of using all the original feature nodes
- We propose an effective attention module, namely Multi-Granularity Reference-aided Global Attention (MGRAFA), for spatial-temporal feature aggregation to obtain a video-level feature vector
- We propose the Multi-Granularity Reference-aided Attentive Feature Aggregation (MG-RAFA) which derives the attention and introduces a hierarchical design, aiming at capturing the discriminative spatial and temporal information at different semantic levels
- We propose a Multi-Granularity Referenceaided Attentive Feature Aggregation scheme (MG-RAFA) for video-based person re-identification, which effectively enhances discriminative features and suppresses identityirrelavant features on the spatial and temporal feature representations
- Our scheme achieves the state-of-the-art performance on three benchmark datasets
Methods
- Datasets and Evaluation Metrics Datasets.
- MARS [44] iLIDS-VID [31] PRID2011 [9] Identities Tracklets.
- Resolution 128 × 256 64 × 128.
- Box Type detected manual manual
Conclusion
- The authors propose a Multi-Granularity Referenceaided Attentive Feature Aggregation scheme (MG-RAFA) for video-based person re-identification, which effectively enhances discriminative features and suppresses identityirrelavant features on the spatial and temporal feature representations.
- To reduce the optimization difficulty, the authors propose to use a representative set of reference feature nodes (S-RFNs) for modeling the global relations.
- The authors propose multi-gruanularity attention by exploring the relations at different granularity levels to capture semantics at different levels.
- The authors' scheme achieves the state-of-the-art performance on three benchmark datasets
Summary
Introduction:
Person re-identification aims at matching persons in different positions, times, and camera views.- With the prevalence of video capturing systems, person reID based on video offers larger capacity for achieving more robust performance.
- As illustrated in Figure 1, for a video clip, the visible contents of different frames differ but there are overlaps/redundancy.
- The multiple frames of a video clip/sequence can provide more comprehensive information Gender, Age,.
- Video Sequence of Example Person 1
Methods:
Datasets and Evaluation Metrics Datasets.- MARS [44] iLIDS-VID [31] PRID2011 [9] Identities Tracklets.
- Resolution 128 × 256 64 × 128.
- Box Type detected manual manual
Conclusion:
The authors propose a Multi-Granularity Referenceaided Attentive Feature Aggregation scheme (MG-RAFA) for video-based person re-identification, which effectively enhances discriminative features and suppresses identityirrelavant features on the spatial and temporal feature representations.- To reduce the optimization difficulty, the authors propose to use a representative set of reference feature nodes (S-RFNs) for modeling the global relations.
- The authors propose multi-gruanularity attention by exploring the relations at different granularity levels to capture semantics at different levels.
- The authors' scheme achieves the state-of-the-art performance on three benchmark datasets
Tables
- Table1: Three public datasets for video-based person reID
- Table2: The ablation study for our proposed multi-granularity reference-aided global attention (MG-RAFA) module. Here,“SG” denotes “Single-Granularity” and “MG” denotes ”Multi-Granularity”. N denotes the number of granularities. S denotes the number of splits(groups) along the channel dimension for masking attention on each split respectively. In a multi-granularity setting, the number of splits is equal to the number of granularities (i.e., S = N ) since each split correponds to a granularity level. We use “MG-AFA” to represent the attention module without relations, in which attention values are inferred from RGB information alone
- Table3: Comparison of different strategies on selection of the reference feature nodes (S-RFNs). Different spatial(S) and temporal
- Table4: Comparison with non-local related schemes
- Table5: Evaluation of the multi-granularity (MG) design when other attention methods are used on the extracted feature maps Ft, t = 1, · · · , T . Granularity is set to N = 4
- Table6: Performance (%) comparison of our scheme with the state-of-the-art methods on three benchmark datasets2
Related work
- In many practical scenarios, video is ready for access and contains more comprehensive information than a single image. Video-based person reID offers larger optimization space for achieving high reID performance and attracts more and more interests in recent years. Video-based Person ReID. Some works simply formulate the video-based person reID problem as an image-based reID problem, which extracts the feature representation for each frame and aggregates the representations of all the frames using temporal average pooling [27, 6]. McLaughlin et al apply Recurrent Neural Network on the frame-wise features extracted from CNN to allow information to flow among different frames, and then temporally pool the output features to obtain the final feature representation [24]. Similarly, Yan et al leverage LSTM network to aggregate the frame-wise features to obtain a sequence-level feature representation [37]. Liu et al propose a two-stream network in which motion context together with appearance features are accumulated by recurrent neural network [19]. Inspired by the exploration of 3D Convolutional Neural Network for learning the spatial-temporal representation in other videorelated tasks such as action recognition [11, 3], 3D convolution networks are used to extract sequence-level feature [17, 14]. These works treat the features with the same importance even though the features for different spatial and temporal positions have different contribution/importance levels for video-based person reID. Attention for Image-based Person ReID. For imagebased person reID, many attention mechanisms have been designed to emphasize important features and suppress irrelevant ones for obtaining discriminative features. Some works use the human part/pose/mask information to infer the attention regions for extracting part/foreground features [26, 12, 35, 26]. Some works learn the attention in terms of spatial positions or channels in end-to-end frameworks [18, 42, 16, 30, 38]. In [16], spatial attention and channel attention are adopted to modulate the features. In general, convolutional layers with limited receptive fields are used to learn the spatial attention. Zhang et al propose a relation-aware global attention to globally learn attention by exploiting the pairwise relations [41] and achieve significant improvement for image-based person reID. Despite the wide exploration in image-based reID, attention designs are under-explored for video-based reID, with much fewer efforts on the globally derived attention.
Funding
- This work was supported in part by NSFC under Grant U1908209, 61632001 and the National Key Research and Development Program of China 2018AAA0101400
Reference
- Jon Almazan, Bojana Gajic, Naila Murray, and Diane Larlus. Re-id done right: towards good practices for person reidentification. arXiv preprint arXiv:1801.05339, 2018. 6
- Aaron F Bobick and James W Davis. The recognition of human movement using temporal templates. TPAMI, (3):257– 267, 2001. 4
- Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017. 2
- Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xiaogang Wang. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In CVPR, pages 1169–1178, 2018. 3, 6, 8
- Yang Fu, Xiaoyang Wang, Yunchao Wei, and Thomas Huang. Sta: Spatial-temporal attention for large-scale videobased person re-identification. AAAI, 2019. 1, 2, 3, 6, 8
- Jiyang Gao and Ram Nevatia. Revisiting temporal modeling for video-based person reid. In BMVC, 2018. 1, 2
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 6
- Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 6, 8
- Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst Bischof. Person re-identification by descriptive and discriminative classification. In Scandinavian conference on Image analysis, pages 91–102. Springer, 2011. 5
- Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018. 8
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2012. 2
- Mahdi M Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, 2018. 1, 2
- Jianing Li, Jingdong Wang, Qi Tian, Wen Gao, and Shiliang Zhang. Global-local temporal representations for video person re-identification. In ICCV, pages 3958–3967, 2019. 8
- Jianing Li, Shiliang Zhang, and Tiejun Huang. Multiscale 3d convolution network for video based person reidentification. In AAAI, 2019. 2, 8
- Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang. Diversity regularized spatiotemporal attention for videobased person re-identification. In CVPR, pages 369–378, 2018. 1, 2, 6, 8
- Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In CVPR, pages 2285–2294, 2018. 1, 2
- Xingyu Liao, Lingxiao He, Zhouwang Yang, and Chi Zhang. Video-based person re-identification via 3d convolutional networks and non-local attention. In ACCV, pages 620–634, 2018. 2, 8
- Hao Liu, Jiashi Feng, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. End-to-end comparative attention networks for person re-identification. TIP, pages 3492–3506. 1, 2
- Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, and Jiashi Feng. Video-based person re-identification with accumulative motion context. TCSVT, 28(10):2788–2802, 2018. 2, 6, 8
- Yu Liu, Junjie Yan, and Wanli Ouyang. Quality aware network for set to set recognition. In CVPR, pages 5790–5799, 2017. 1, 3
- Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and Houqiang Li. Spatial and temporal mutual promotion for video-based person re-identification. In AAAI, 2019. 1
- Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and Houqiang Li. Spatial and temporal mutual promotion for video-based person re-identification. In AAAI, 2019. 1
- Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bags of tricks and a strong baseline for deep person re-identification. arXiv preprint arXiv:1903.07071, 2019. 6
- Niall McLaughlin, Jesus Martinez del Rincon, and Paul Miller. Recurrent convolutional network for video-based person re-identification. In CVPR, pages 1325–1334, 2016. 1, 2
- Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, 2018. 3, 8
- Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. Mask-guided contrastive attention model for person re-identification. In CVPR, 2018. 2
- Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. Part-aligned bilinear representations for person re-identification. In ECCV, 2018. 2
- Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling. 2018. 6
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 5
- Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In ECCV, 2018. 1, 2, 6
- Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. Person re-identification by video ranking. In ECCV, pages 688–703. Springer, 2014. 5
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794– 7803, 2018. 7
- Yan Wang, Lequn Wang, Yurong You, Xu Zou, Vincent Chen, Serena Li, Gao Huang, Bharath Hariharan, and Kilian Q Weinberger. Resource aware person re-identification across multiple resolutions. In CVPR, 2018. 6
- Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018. 8
- Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. In CVPR, pages 2119–2128, 2018. 2
- Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In ICCV, pages 4733–4742, 2017. 1, 3
- Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan, and Xiaokang Yang. Person re-identification via recurrent feature aggregation. In ECCV, pages 701–716, 2016. 2
- Fan Yang, Ke Yan, Shijian Lu, Huizhu Jia, Xiaodong Xie, and Wen Gao. Attention driven person re-identification. Pattern Recognition, 86:143–155, 2019. 2
- Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. Densely semantically aligned person re-identification. In CVPR, 2019. 6
- Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention. arXiv preprint arXiv:1904.02998, 2019. 8
- Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention for person reidentification. In CVPR, 2020. 2, 3
- Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang. Deeply-learned part-aligned representations for person reidentification. In ICCV, pages 3239–3248, 2017. 1, 2
- Yiru Zhao, Xu Shen, Zhongming Jin, Hongtao Lu, and Xian-sheng Hua. Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In CVPR, pages 4913–4922, 2019. 1, 3, 8
- Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A video benchmark for large-scale person re-identification. In ECCV, pages 868– 884. Springer, 2016. 5, 6
- Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717, 2017. 8
- Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017. 6
- Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person reidentification. In CVPR, pages 4747–4756, 2017. 1, 3
Full Text
Tags
Comments