Relation-Aware Global Attention for Person Re-Identification

CVPR, pp. 3183-3192, 2020.

Cited by: 0|Bibtex|Views98|DOI:https://doi.org/10.1109/CVPR42600.2020.00325
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
For person re-id, in order to learn more discriminative features, we propose a simple yet effective Relation-Aware Global Attention module which models the global scope structural information and based on this to infer attention through a learned model

Abstract:

For person re-identification (re-id), attention mechanisms have become attractive as they aim at strengthening discriminative features and suppressing irrelevant ones, which matches well the key of re-id, i.e., discriminative feature learning. Previous approaches typically learn attention using local convolutions, ignoring the mining of k...More

Code:

Data:

0
Introduction
  • Person re-identification aims to match a specific person across different times, places, or cameras, which has drawn a surge of interests from both industry and academia.
  • The studies in [24] show that the effective receptive field of CNN only takes up a fraction of the full theoretical receptive field.
  • These solutions cannot ensure the effective exploration of global scope information for effective person re-id
Highlights
  • Person re-identification aims to match a specific person across different times, places, or cameras, which has drawn a surge of interests from both industry and academia
  • We propose an effective Relation-Aware Global Attention (RGA) module to efficiently learn discriminative features for person re-id
  • As illustrated in Fig. 2 (c), for each feature node, e.g., a feature vector of a spatial position on a feature map, we model the pairwise relations of this node with respect to all the nodes and compactly stack the relations as a vector together with the feature of the node itself to infer the attention intensity via a small model
  • For discriminative feature extraction in person re-id, we propose a Relation-aware Global Attention (RGA) module which makes use of the compact global scope structural relation information to infer the attention
  • For person re-id, in order to learn more discriminative features, we propose a simple yet effective Relation-Aware Global Attention module which models the global scope structural information and based on this to infer attention through a learned model
  • The structural patterns provide some kind of global scope semantics which is helpful for inferring attention
Methods
  • CUHK03 (L) Market1501 R1 mAP R1 mAP

    CBAM-S [38] FC-S [23] NL [35] SNL [5] RGA-S (Ours) Channel

    SE [18] CBAM-C [38] FC-C [23] RGA-C (Ours)

    Both FC-S//C [23]

    RGA-SC (Ours) 81.1 77.4 96.1 88.4 scheme, which is similar to the observation made by Cao et al [5].
  • In Squeeze-and-Excitation module (SE [18]), they use spatially global average-pooled features to compute channel-wise attention, by using two fully connected (FC) layers with the non-linearity.
  • FC-C [23] uses a FC layer over spatially average pooled features.
  • Thanks to the exploration of pairwise relations, the scheme RGA-C outperforms FC-C [23] and SE [18], which use global information, by 1.9% and 3.0% in Rank-1 accuracy on CUHK03.
Results
  • Extensive ablation studies demonstrate that the RGA can significantly enhance the feature representation power and help achieve the state-of-the-art performance on several popular benchmarks.
  • The authors' scheme empowered by RGA modules achieves the state-of-the art performance on the benchmark datasets CUHK03 [21], Market1501 [46], and MSMT17[37].
  • RGA-SC achieves the best performance, 2.7% and 1.8% higher than RGA-S and RGA-C, respectively, in mAP accuracy on CUHK03.
  • Thanks to the exploration of global structural information and its mining through learnable modeling function, the RGA-S achieves the best performance, which is about 2% better than the others in mAP accuracy on CUHK03(L)
Conclusion
  • For the ith feature node xi, its corresponding relation vector ri provides a compact representation to capture the global structural information, i.e., both the position information and pairwise affinities with respect to all feature nodes.
  • For the spatial RGA (RGA-S), for a spatial feature position, the authors jointly exploit the feature nodes at all spatial positions to globally determine the attention
  • The authors achieve this through simple 1 × 1 convolutional operations on the vector of stacked relations.
  • Such feature representation facilitates the use of shallow convolutional layers to globally infer the attention
  • The authors apply this module to the spatial and channel dimensions of CNN features and demonstrate its effectiveness in both cases.
  • Extensive ablation studies validate the high efficiency of the designs and state-of-the-art performance is achieved
Summary
  • Introduction:

    Person re-identification aims to match a specific person across different times, places, or cameras, which has drawn a surge of interests from both industry and academia.
  • The studies in [24] show that the effective receptive field of CNN only takes up a fraction of the full theoretical receptive field.
  • These solutions cannot ensure the effective exploration of global scope information for effective person re-id
  • Methods:

    CUHK03 (L) Market1501 R1 mAP R1 mAP

    CBAM-S [38] FC-S [23] NL [35] SNL [5] RGA-S (Ours) Channel

    SE [18] CBAM-C [38] FC-C [23] RGA-C (Ours)

    Both FC-S//C [23]

    RGA-SC (Ours) 81.1 77.4 96.1 88.4 scheme, which is similar to the observation made by Cao et al [5].
  • In Squeeze-and-Excitation module (SE [18]), they use spatially global average-pooled features to compute channel-wise attention, by using two fully connected (FC) layers with the non-linearity.
  • FC-C [23] uses a FC layer over spatially average pooled features.
  • Thanks to the exploration of pairwise relations, the scheme RGA-C outperforms FC-C [23] and SE [18], which use global information, by 1.9% and 3.0% in Rank-1 accuracy on CUHK03.
  • Results:

    Extensive ablation studies demonstrate that the RGA can significantly enhance the feature representation power and help achieve the state-of-the-art performance on several popular benchmarks.
  • The authors' scheme empowered by RGA modules achieves the state-of-the art performance on the benchmark datasets CUHK03 [21], Market1501 [46], and MSMT17[37].
  • RGA-SC achieves the best performance, 2.7% and 1.8% higher than RGA-S and RGA-C, respectively, in mAP accuracy on CUHK03.
  • Thanks to the exploration of global structural information and its mining through learnable modeling function, the RGA-S achieves the best performance, which is about 2% better than the others in mAP accuracy on CUHK03(L)
  • Conclusion:

    For the ith feature node xi, its corresponding relation vector ri provides a compact representation to capture the global structural information, i.e., both the position information and pairwise affinities with respect to all feature nodes.
  • For the spatial RGA (RGA-S), for a spatial feature position, the authors jointly exploit the feature nodes at all spatial positions to globally determine the attention
  • The authors achieve this through simple 1 × 1 convolutional operations on the vector of stacked relations.
  • Such feature representation facilitates the use of shallow convolutional layers to globally infer the attention
  • The authors apply this module to the spatial and channel dimensions of CNN features and demonstrate its effectiveness in both cases.
  • Extensive ablation studies validate the high efficiency of the designs and state-of-the-art performance is achieved
Tables
  • Table1: Performance (%) comparisons of our models with the baseline, and the effectiveness of the global relation representation (Rel.) and the feature itself (Ori.). w/o: without
  • Table2: Performance (%) comparisons of our attention and other approaches, applied on top of our baseline
  • Table3: Number of parameters for different schemes (Million)
  • Table4: Influence of embedding functions on performance (%)
  • Table5: Performance (%) comparisons with the state-of-the-arts on CUHK03, Market1501 and MSMT17.2
Download tables as Excel
Related work
  • 2.1. Attention and Person Re-id

    Attention aims to focus on important features and suppress irrelevant features. This well matches the goal of handling aforementioned challenges in person re-id and is thus attractive. Many works learn the attention using convolutional operations with small receptive fields on feature maps [32, 45, 22, 6]. However, intuitively, to have a good sense of whether a feature node is important or not, one should know the features of global scope which facilitates the comparisons needed for decision.

    In order to introduce more contextual information, Wang et al and Yang et al stack many convolutional layers in their encoder-decoder style attention module to have larger receptive fields [33, 40]. Woo et al use a large filter size of 7×7 over the spatial features in their Convolutional Block Attention Module (CBAM) to produce a spatial attention map [38]. In [42], a non-local block [35] is inserted before the encoder-decoder style attention module to enable attention learning based on globally refined features. Limited by the practical receptive fields, all these approaches are not efficient in capturing the large scope information to globally determine the spatial attention.
Funding
  • This work was supported in part by NSFC under Grant U1908209, 61632001 and the National Key Research and Development Program of China 2018AAA0101400
Reference
  • Jon Almazan, Bojana Gajic, Naila Murray, and Diane Larlus. Re-id done right: towards good practices for person reidentification. arXiv preprint arXiv:1801.05339, 2018. 6
    Findings
  • Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. In TOG, volume 28, page 24. ACM, 2009. 2
    Google ScholarLocate open access versionFindings
  • Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In CVPR, volume 2, pages 60–65, 2005. 2
    Google ScholarLocate open access versionFindings
  • Antoni Buades, Bartomeu Coll, and J-M Morel. Non-local color image denoising with convolutional neural networks. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492, 2019. 2, 3, 5, 6, 7
    Findings
  • Binghui Chen, Weihong Deng, and Jiani Hu. Mixed highorder attention network for person re-identification. In ICCV, pages 371–381, 2019. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • Peihao Chen, Chuang Gan, Guangyao Shen, Wenbing Huang, Runhao Zeng, and Mingkui Tan. Relation attention for temporal action localization. TMM, 2019. 2
    Google ScholarLocate open access versionFindings
  • Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transformdomain collaborative filtering. TIP, 16(8):2080–2095, 2007. 2
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2006
    Google ScholarLocate open access versionFindings
  • Alexei A Efros and Thomas K Leung. Texture synthesis by non-parametric sampling. In ICCV, volume 2, pages 1033– 1038. IEEE, 1999. 2
    Google ScholarLocate open access versionFindings
  • Pengfei Fang, Jieming Zhou, Soumava Kumar Roy, Lars Petersson, and Mehrtash Harandi. Bilinear attention networks for person retrieval. In ICCV, pages 8030–8039, 2019. 1, 8
    Google ScholarLocate open access versionFindings
  • Yang Fu, Xiaoyang Wang, Yunchao Wei, and Thomas Huang. Sta: Spatial-temporal attention for large-scale videobased person re-identification. AAAI, 2019. 1
    Google ScholarLocate open access versionFindings
  • Yang Fu, Yunchao Wei, Yuqian Zhou, Honghui Shi, Gao Huang, Xinchao Wang, Zhiqiang Yao, and Thomas Huang. Horizontal pyramid matching for person re-identification. AAAI, 2019. 8
    Google ScholarLocate open access versionFindings
  • Daniel Glasner, Shai Bagon, and Michal Irani. Superresolution from a single image. In ICCV, pages 349–356. IEEE, 2009. 2
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 6
    Google ScholarLocate open access versionFindings
  • Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 6
    Findings
  • Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. Interaction-and-aggregation network for person re-identification. In CVPR, pages 9317– 9326, 2019. 8
    Google ScholarLocate open access versionFindings
  • Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 207
    Google ScholarLocate open access versionFindings
  • Mahdi M Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, 2018. 7, 8
    Google ScholarLocate open access versionFindings
  • Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang. Diversity regularized spatiotemporal attention for videobased person re-identification. In CVPR, pages 369–378, 2018. 1
    Google ScholarLocate open access versionFindings
  • Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person reidentification. In CVPR, 2014. 2, 6
    Google ScholarLocate open access versionFindings
  • Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In CVPR, pages 2285–2294, 2018. 1, 2, 5, 8
    Google ScholarLocate open access versionFindings
  • Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and Houqiang Li. Spatial and temporal mutual promotion for video-based person re-identification. In AAAI, 2019. 3, 6, 7
    Google ScholarLocate open access versionFindings
  • Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. In NeurIPS, pages 4898–4906, 2016. 1
    Google ScholarLocate open access versionFindings
  • Lei Qi, Jing Huo, Lei Wang, Yinghuan Shi, and Yang Gao. Maskreid: A mask based deep ranking neural network for person re-identification. ICME, 2019. 7
    Google ScholarLocate open access versionFindings
  • Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017. 8
    Google ScholarLocate open access versionFindings
  • Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, 2018. 8
    Google ScholarLocate open access versionFindings
  • Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. Mask-guided contrastive attention model for person re-identification. In CVPR, 2018. 2, 7, 8
    Google ScholarLocate open access versionFindings
  • Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. Part-aligned bilinear representations for person re-identification. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling. 2018. 6, 8
    Google ScholarFindings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 6
    Google ScholarLocate open access versionFindings
  • Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In ECCV, 2018. 2, 3, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In CVPR, pages 3156–3164, 2017. 1, 2, 3, 5
    Google ScholarLocate open access versionFindings
  • Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. ACM Multimedia, 2018. 8
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794– 7803, 2018. 1, 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Yan Wang, Lequn Wang, Yurong You, Xu Zou, Vincent Chen, Serena Li, Gao Huang, Bharath Hariharan, and Kilian Q Weinberger. Resource aware person re-identification across multiple resolutions. In CVPR, 2018. 6
    Google ScholarLocate open access versionFindings
  • Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer GAN to bridge domain gap for person reidentification. In CVPR, 2018. 2, 6
    Google ScholarLocate open access versionFindings
  • Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018. 1, 2, 3, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. In CVPR, pages 2119–2128, 2018. 1, 2, 7, 8
    Google ScholarLocate open access versionFindings
  • Fan Yang, Ke Yan, Shijian Lu, Huizhu Jia, Xiaodong Xie, and Wen Gao. Attention driven person re-identification. Pattern Recognition, 86:143–155, 2019. 2
    Google ScholarLocate open access versionFindings
  • Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei Jiang, Chi Zhang, and Jian Sun. Alignedreid: Surpassing human-level performance in person reidentification. arXiv preprint arXiv:1711.08184, 2017. 6
    Findings
  • Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restoration. In ICLR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. Densely semantically aligned person re-identification. In CVPR, 2019. 6, 8
    Google ScholarLocate open access versionFindings
  • Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, Junjie Yan, Shuai Yi, Xiaogang Wang, and Xiaoou Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang. Deeply-learned part-aligned representations for person reidentification. In ICCV, pages 3239–3248, 2017. 2
    Google ScholarLocate open access versionFindings
  • Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In ICCV, 2015. 2, 6
    Google ScholarLocate open access versionFindings
  • Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint discriminative and generative learning for person re-identification. In CVPR, pages 2138– 2147, 2019. 8
    Google ScholarLocate open access versionFindings
  • Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. arXiv preprint arXiv:1701.07717, 2017. 8
    Findings
  • Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Reranking person re-identification with k-reciprocal encoding. In CVPR, 2017. 6
    Google ScholarLocate open access versionFindings
  • Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017. 6
    Findings
  • Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. Omni-scale feature learning for person reidentification. ICCV, 2019. 8
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments