Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification

CVPR, pp. 10404-10413, 2020.

Cited by: 0|Bibtex|Views41|DOI:https://doi.org/10.1109/CVPR42600.2020.01042
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose a Multi-Granularity Referenceaided Attentive Feature Aggregation scheme for video-based person re-identification, which effectively enhances discriminative features and suppresses identityirrelavant features on the spatial and temporal feature representations

Abstract:

Video-based person re-identification (reID) aims at matching the same person across video clips. It is a challenging task due to the existence of redundancy among frames, newly revealed appearance, occlusion, and motion blurs. In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-aided Att...More

Code:

Data:

0
Introduction
  • Person re-identification aims at matching persons in different positions, times, and camera views.
  • With the prevalence of video capturing systems, person reID based on video offers larger capacity for achieving more robust performance.
  • As illustrated in Figure 1, for a video clip, the visible contents of different frames differ but there are overlaps/redundancy.
  • The multiple frames of a video clip/sequence can provide more comprehensive information Gender, Age,.
  • Video Sequence of Example Person 1
Highlights
  • Person re-identification aims at matching persons in different positions, times, and camera views
  • Considering the characteristics of video, we propose to construct a small but representative set of reference feature nodes (S-RFNs) for globally modelling the pairwise relations, instead of using all the original feature nodes
  • We propose an effective attention module, namely Multi-Granularity Reference-aided Global Attention (MGRAFA), for spatial-temporal feature aggregation to obtain a video-level feature vector
  • We propose the Multi-Granularity Reference-aided Attentive Feature Aggregation (MG-RAFA) which derives the attention and introduces a hierarchical design, aiming at capturing the discriminative spatial and temporal information at different semantic levels
  • We propose a Multi-Granularity Referenceaided Attentive Feature Aggregation scheme (MG-RAFA) for video-based person re-identification, which effectively enhances discriminative features and suppresses identityirrelavant features on the spatial and temporal feature representations
  • Our scheme achieves the state-of-the-art performance on three benchmark datasets
Methods
  • Datasets and Evaluation Metrics Datasets.
  • MARS [44] iLIDS-VID [31] PRID2011 [9] Identities Tracklets.
  • Resolution 128 × 256 64 × 128.
  • Box Type detected manual manual
Conclusion
  • The authors propose a Multi-Granularity Referenceaided Attentive Feature Aggregation scheme (MG-RAFA) for video-based person re-identification, which effectively enhances discriminative features and suppresses identityirrelavant features on the spatial and temporal feature representations.
  • To reduce the optimization difficulty, the authors propose to use a representative set of reference feature nodes (S-RFNs) for modeling the global relations.
  • The authors propose multi-gruanularity attention by exploring the relations at different granularity levels to capture semantics at different levels.
  • The authors' scheme achieves the state-of-the-art performance on three benchmark datasets
Summary
  • Introduction:

    Person re-identification aims at matching persons in different positions, times, and camera views.
  • With the prevalence of video capturing systems, person reID based on video offers larger capacity for achieving more robust performance.
  • As illustrated in Figure 1, for a video clip, the visible contents of different frames differ but there are overlaps/redundancy.
  • The multiple frames of a video clip/sequence can provide more comprehensive information Gender, Age,.
  • Video Sequence of Example Person 1
  • Methods:

    Datasets and Evaluation Metrics Datasets.
  • MARS [44] iLIDS-VID [31] PRID2011 [9] Identities Tracklets.
  • Resolution 128 × 256 64 × 128.
  • Box Type detected manual manual
  • Conclusion:

    The authors propose a Multi-Granularity Referenceaided Attentive Feature Aggregation scheme (MG-RAFA) for video-based person re-identification, which effectively enhances discriminative features and suppresses identityirrelavant features on the spatial and temporal feature representations.
  • To reduce the optimization difficulty, the authors propose to use a representative set of reference feature nodes (S-RFNs) for modeling the global relations.
  • The authors propose multi-gruanularity attention by exploring the relations at different granularity levels to capture semantics at different levels.
  • The authors' scheme achieves the state-of-the-art performance on three benchmark datasets
Tables
  • Table1: Three public datasets for video-based person reID
  • Table2: The ablation study for our proposed multi-granularity reference-aided global attention (MG-RAFA) module. Here,“SG” denotes “Single-Granularity” and “MG” denotes ”Multi-Granularity”. N denotes the number of granularities. S denotes the number of splits(groups) along the channel dimension for masking attention on each split respectively. In a multi-granularity setting, the number of splits is equal to the number of granularities (i.e., S = N ) since each split correponds to a granularity level. We use “MG-AFA” to represent the attention module without relations, in which attention values are inferred from RGB information alone
  • Table3: Comparison of different strategies on selection of the reference feature nodes (S-RFNs). Different spatial(S) and temporal
  • Table4: Comparison with non-local related schemes
  • Table5: Evaluation of the multi-granularity (MG) design when other attention methods are used on the extracted feature maps Ft, t = 1, · · · , T . Granularity is set to N = 4
  • Table6: Performance (%) comparison of our scheme with the state-of-the-art methods on three benchmark datasets2
Download tables as Excel
Related work
  • In many practical scenarios, video is ready for access and contains more comprehensive information than a single image. Video-based person reID offers larger optimization space for achieving high reID performance and attracts more and more interests in recent years. Video-based Person ReID. Some works simply formulate the video-based person reID problem as an image-based reID problem, which extracts the feature representation for each frame and aggregates the representations of all the frames using temporal average pooling [27, 6]. McLaughlin et al apply Recurrent Neural Network on the frame-wise features extracted from CNN to allow information to flow among different frames, and then temporally pool the output features to obtain the final feature representation [24]. Similarly, Yan et al leverage LSTM network to aggregate the frame-wise features to obtain a sequence-level feature representation [37]. Liu et al propose a two-stream network in which motion context together with appearance features are accumulated by recurrent neural network [19]. Inspired by the exploration of 3D Convolutional Neural Network for learning the spatial-temporal representation in other videorelated tasks such as action recognition [11, 3], 3D convolution networks are used to extract sequence-level feature [17, 14]. These works treat the features with the same importance even though the features for different spatial and temporal positions have different contribution/importance levels for video-based person reID. Attention for Image-based Person ReID. For imagebased person reID, many attention mechanisms have been designed to emphasize important features and suppress irrelevant ones for obtaining discriminative features. Some works use the human part/pose/mask information to infer the attention regions for extracting part/foreground features [26, 12, 35, 26]. Some works learn the attention in terms of spatial positions or channels in end-to-end frameworks [18, 42, 16, 30, 38]. In [16], spatial attention and channel attention are adopted to modulate the features. In general, convolutional layers with limited receptive fields are used to learn the spatial attention. Zhang et al propose a relation-aware global attention to globally learn attention by exploiting the pairwise relations [41] and achieve significant improvement for image-based person reID. Despite the wide exploration in image-based reID, attention designs are under-explored for video-based reID, with much fewer efforts on the globally derived attention.
Funding
  • This work was supported in part by NSFC under Grant U1908209, 61632001 and the National Key Research and Development Program of China 2018AAA0101400
Reference
  • Jon Almazan, Bojana Gajic, Naila Murray, and Diane Larlus. Re-id done right: towards good practices for person reidentification. arXiv preprint arXiv:1801.05339, 2018. 6
    Findings
  • Aaron F Bobick and James W Davis. The recognition of human movement using temporal templates. TPAMI, (3):257– 267, 2001. 4
    Google ScholarFindings
  • Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017. 2
    Google ScholarLocate open access versionFindings
  • Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xiaogang Wang. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In CVPR, pages 1169–1178, 2018. 3, 6, 8
    Google ScholarLocate open access versionFindings
  • Yang Fu, Xiaoyang Wang, Yunchao Wei, and Thomas Huang. Sta: Spatial-temporal attention for large-scale videobased person re-identification. AAAI, 2019. 1, 2, 3, 6, 8
    Google ScholarLocate open access versionFindings
  • Jiyang Gao and Ram Nevatia. Revisiting temporal modeling for video-based person reid. In BMVC, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 6
    Google ScholarLocate open access versionFindings
  • Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 6, 8
    Findings
  • Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst Bischof. Person re-identification by descriptive and discriminative classification. In Scandinavian conference on Image analysis, pages 91–102. Springer, 2011. 5
    Google ScholarLocate open access versionFindings
  • Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018. 8
    Google ScholarLocate open access versionFindings
  • Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2012. 2
    Google ScholarLocate open access versionFindings
  • Mahdi M Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Jianing Li, Jingdong Wang, Qi Tian, Wen Gao, and Shiliang Zhang. Global-local temporal representations for video person re-identification. In ICCV, pages 3958–3967, 2019. 8
    Google ScholarLocate open access versionFindings
  • Jianing Li, Shiliang Zhang, and Tiejun Huang. Multiscale 3d convolution network for video based person reidentification. In AAAI, 2019. 2, 8
    Google ScholarLocate open access versionFindings
  • Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang. Diversity regularized spatiotemporal attention for videobased person re-identification. In CVPR, pages 369–378, 2018. 1, 2, 6, 8
    Google ScholarLocate open access versionFindings
  • Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In CVPR, pages 2285–2294, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Xingyu Liao, Lingxiao He, Zhouwang Yang, and Chi Zhang. Video-based person re-identification via 3d convolutional networks and non-local attention. In ACCV, pages 620–634, 2018. 2, 8
    Google ScholarLocate open access versionFindings
  • Hao Liu, Jiashi Feng, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. End-to-end comparative attention networks for person re-identification. TIP, pages 3492–3506. 1, 2
    Google ScholarLocate open access versionFindings
  • Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, and Jiashi Feng. Video-based person re-identification with accumulative motion context. TCSVT, 28(10):2788–2802, 2018. 2, 6, 8
    Google ScholarLocate open access versionFindings
  • Yu Liu, Junjie Yan, and Wanli Ouyang. Quality aware network for set to set recognition. In CVPR, pages 5790–5799, 2017. 1, 3
    Google ScholarLocate open access versionFindings
  • Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and Houqiang Li. Spatial and temporal mutual promotion for video-based person re-identification. In AAAI, 2019. 1
    Google ScholarLocate open access versionFindings
  • Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and Houqiang Li. Spatial and temporal mutual promotion for video-based person re-identification. In AAAI, 2019. 1
    Google ScholarLocate open access versionFindings
  • Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bags of tricks and a strong baseline for deep person re-identification. arXiv preprint arXiv:1903.07071, 2019. 6
    Findings
  • Niall McLaughlin, Jesus Martinez del Rincon, and Paul Miller. Recurrent convolutional network for video-based person re-identification. In CVPR, pages 1325–1334, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, 2018. 3, 8
    Google ScholarLocate open access versionFindings
  • Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. Mask-guided contrastive attention model for person re-identification. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. Part-aligned bilinear representations for person re-identification. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling. 2018. 6
    Google ScholarFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In ECCV, 2018. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. Person re-identification by video ranking. In ECCV, pages 688–703. Springer, 2014. 5
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794– 7803, 2018. 7
    Google ScholarLocate open access versionFindings
  • Yan Wang, Lequn Wang, Yurong You, Xu Zou, Vincent Chen, Serena Li, Gao Huang, Bharath Hariharan, and Kilian Q Weinberger. Resource aware person re-identification across multiple resolutions. In CVPR, 2018. 6
    Google ScholarLocate open access versionFindings
  • Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018. 8
    Google ScholarLocate open access versionFindings
  • Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. In CVPR, pages 2119–2128, 2018. 2
    Google ScholarLocate open access versionFindings
  • Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In ICCV, pages 4733–4742, 2017. 1, 3
    Google ScholarLocate open access versionFindings
  • Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan, and Xiaokang Yang. Person re-identification via recurrent feature aggregation. In ECCV, pages 701–716, 2016. 2
    Google ScholarLocate open access versionFindings
  • Fan Yang, Ke Yan, Shijian Lu, Huizhu Jia, Xiaodong Xie, and Wen Gao. Attention driven person re-identification. Pattern Recognition, 86:143–155, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. Densely semantically aligned person re-identification. In CVPR, 2019. 6
    Google ScholarLocate open access versionFindings
  • Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention. arXiv preprint arXiv:1904.02998, 2019. 8
    Findings
  • Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention for person reidentification. In CVPR, 2020. 2, 3
    Google ScholarLocate open access versionFindings
  • Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang. Deeply-learned part-aligned representations for person reidentification. In ICCV, pages 3239–3248, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Yiru Zhao, Xu Shen, Zhongming Jin, Hongtao Lu, and Xian-sheng Hua. Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In CVPR, pages 4913–4922, 2019. 1, 3, 8
    Google ScholarLocate open access versionFindings
  • Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A video benchmark for large-scale person re-identification. In ECCV, pages 868– 884. Springer, 2016. 5, 6
    Google ScholarLocate open access versionFindings
  • Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717, 2017. 8
    Findings
  • Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017. 6
    Findings
  • Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person reidentification. In CVPR, pages 4747–4756, 2017. 1, 3
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments