Compressed Self-Attention for Deep Metric Learning with Low-Rank Approximation

IJCAI, pp. 2058-2064, 2020.

Cited by: 0|Bibtex|Views14|DOI:https://doi.org/10.24963/ijcai.2020/285
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
Qualitative and quantitative experiments demonstrate the significance of compressed self-attention with low-rank approximation in deep metric learning

Abstract:

In this paper, we apply self-attention (SA) mechanism to boost the performance of deep metric learning. However, due to the pairwise similarity measurement, the cost of storing and manipulating the complete attention maps makes it infeasible for large inputs. To solve this problem, we propose a compressed self-attention with low-rank app...More

Code:

Data:

0
Introduction
  • Metric learning aims to construct well-structured distance metrics, which can be used to perform various tasks, such as k-NN classification, clustering, and information retrieval.
  • Due to various geometric and photometric changes such as scale change, viewpoint change, and illumination change, and the limited receptive field of convolutional kernels, the learned embedding features are not discriminative enough to ensure the intra-class compactness and the inter-class discrepancy, which would affect the performance of deep metric learning
  • To tackle this problem, the authors enhance the discriminative power of CNNs with self-attention (SA) mechanism [Vaswani et al, 2017], which can capture long-range contextual dependencies adaptively.
  • [Chen et al, 2020] proposed a compressed self-attention (CSA) module, it lacks the theoretical guarantee for the accurate reconstruction of the original attention maps
Highlights
  • Metric learning aims to construct well-structured distance metrics, which can be used to perform various tasks, such as k-NN classification, clustering, and information retrieval
  • Deep metric learning with convolutional neural networks has shown a large improvement in learning embedding features that have
  • Deep metric learning has a wide range of application in computer vision, such as person re-identification [Ye et al, 2018; Wang et al, 2018c], face recognition [Deng et al, 2019; Liu et al, 2018], and keypoint descriptor learning [Mishchuk et al, 2017; Xu et al, 2019]
  • We propose a compressed self-attention with low-rank approximation (CSALR) module, which reduces the computation and memory costs greatly without sacrificing the accuracy compared with the original SA
  • We modify the backbone networks of the baseline models without changing other settings. We provide both qualitative and qualitative comparisons to demonstrate the effectiveness and efficiency of compressed self-attention with low-rank approximation in deep metric learning
  • Qualitative and quantitative experiments demonstrate the significance of compressed self-attention with low-rank approximation in deep metric learning
Methods
  • The authors validate the proposed CSALR on person re-identification, which is a typical metric learning task.
  • The authors modify the backbone networks of the baseline models without changing other settings
  • The authors provide both qualitative and qualitative comparisons to demonstrate the effectiveness and efficiency of CSALR in deep metric learning.
  • Market-1501 contains 751 training IDs with 12,936 images and 750 query IDs with 3,368 query images and 19,732 gallery images, which are captured by 6 cameras.
  • DukeMTMC-ReID contains 702 training IDs with 16,522 images and 702 query IDs with 2,228 query images and 17,661 gallery images, which are captured by 8 cameras.
  • CUHK03NP contains 767 training IDs with 7,365 images and 700
Conclusion
  • The authors aim to boost the performance of deep metric learning with self-attention (SA) mechanism, which can capture long-range contextual dependencies adaptively.
  • The authors propose a compressed self-attention with lowrank approximation (CSALR), which significantly reduces the computation and memory costs without sacrificing the accuracy.
  • Qualitative and quantitative experiments demonstrate the significance of CSALR in deep metric learning
Summary
  • Introduction:

    Metric learning aims to construct well-structured distance metrics, which can be used to perform various tasks, such as k-NN classification, clustering, and information retrieval.
  • Due to various geometric and photometric changes such as scale change, viewpoint change, and illumination change, and the limited receptive field of convolutional kernels, the learned embedding features are not discriminative enough to ensure the intra-class compactness and the inter-class discrepancy, which would affect the performance of deep metric learning
  • To tackle this problem, the authors enhance the discriminative power of CNNs with self-attention (SA) mechanism [Vaswani et al, 2017], which can capture long-range contextual dependencies adaptively.
  • [Chen et al, 2020] proposed a compressed self-attention (CSA) module, it lacks the theoretical guarantee for the accurate reconstruction of the original attention maps
  • Methods:

    The authors validate the proposed CSALR on person re-identification, which is a typical metric learning task.
  • The authors modify the backbone networks of the baseline models without changing other settings
  • The authors provide both qualitative and qualitative comparisons to demonstrate the effectiveness and efficiency of CSALR in deep metric learning.
  • Market-1501 contains 751 training IDs with 12,936 images and 750 query IDs with 3,368 query images and 19,732 gallery images, which are captured by 6 cameras.
  • DukeMTMC-ReID contains 702 training IDs with 16,522 images and 702 query IDs with 2,228 query images and 17,661 gallery images, which are captured by 8 cameras.
  • CUHK03NP contains 767 training IDs with 7,365 images and 700
  • Conclusion:

    The authors aim to boost the performance of deep metric learning with self-attention (SA) mechanism, which can capture long-range contextual dependencies adaptively.
  • The authors propose a compressed self-attention with lowrank approximation (CSALR), which significantly reduces the computation and memory costs without sacrificing the accuracy.
  • Qualitative and quantitative experiments demonstrate the significance of CSALR in deep metric learning
Tables
  • Table1: Comparison of the models with and without the proposed CSALR. SA means the original self-attention
  • Table2: Comparison of the models with different number of sampled points in each group of CSALR
  • Table3: Comparison of the models with different numbers of groups in CSALR
  • Table4: Comparison of the models with CSALR on different layers of the backbone
  • Table5: Comparison of the speed and memory cost of the models with and without SA or CSALR
Download tables as Excel
Related work
  • 2.1 Deep Metric Learning

    Deep metric learning employs deep convolutional neural networks (CNNs) to learn embedding features which have small intra-class and large inter-class distance. However, due to the limited receptive field of CNNs and the challenging geometric and photometric changes, the learned embedding features are not discriminative enough. Recently, some researchers focus on designing a more effective loss function to force the networks to learn more representative embedding features. For example, [Fan et al, 2019] proposed a modified softmax function to learn a discriminative hypersphere manifold embedding for person re-identification. [Deng et al, 2019] proposed an additive angular margin loss to enhance the discriminative power of the network for face recognition.

    Other researchers focus on designing a more robust network architecture. For instance, [Sun et al, 2018] proposed a body partition strategy and a partition refinement method for person re-identification. [Kalayeh et al, 2018] adopted semantic parsing strategy to extract the features of human body parts for person re-identification. There are also some works applying SA in deep metric learning, such as [Si et al, 2018] and [Han et al, 2018]. However, due to the high computation and memory costs of SA, SA is only applied to the part level or global level features, which is difficult to make full use of SA mechanism. [Chen et al, 2020] proposed a compressed form of self-attention, however, it cannot guarantee that the reconstructed attention maps can actually approximate the original attention maps, which may degrade the performance.
Funding
  • This work was supported in part by the National Natural Science Foundation of China under Grants 61822113, 62041105, the Science and Technology Major Project of Hubei Province (Next-Generation AI Technologies) under Grant 2019AEA170, the Natural Science Foundation of Hubei Province under Grants 2018CFA050, the Fundamental Research Funds for the Central Universities under Grant 413000092 and 413000082
Reference
  • [Baker, 1977] Christopher TH Baker. The numerical treatment of integral equations. 1977.
    Google ScholarFindings
  • [Chen et al., 2020] Ziye Chen, Yanwu Xu, Mingming Gong, Chaohui Wang, Bo Du, and Kun Zhang. Compressed selfattention for deep metric learning. 2020.
    Google ScholarFindings
  • [Deng et al., 2019] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
    Google ScholarLocate open access versionFindings
  • [Fan et al., 2019] Xing Fan, Wei Jiang, Hao Luo, and Mengjuan Fei. Spherereid: Deep hypersphere manifold embedding for person re-identification. Journal of Visual Communication and Image Representation, 60:51– 58, 2019.
    Google ScholarLocate open access versionFindings
  • [Fang et al., 2020a] Yixiang Fang, Xin Huang, Lu Qin, Ying Zhang, Wenjie Zhang, Reynold Cheng, and Xuemin Lin. A survey of community search over big graphs. 29(1):353–392, 2020.
    Google ScholarFindings
  • [Fang et al., 2020b] Yixiang Fang, Yixing Yang, Wenjie Zhang, Xuemin Lin, and Xin Cao. Effective and efficient community search over large heterogeneous information networks. 13(6):854–867, 2020.
    Google ScholarLocate open access versionFindings
  • [Fu et al., 2019] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3146–3154, 2019.
    Google ScholarLocate open access versionFindings
  • [Han et al., 2018] Kai Han, Jianyuan Guo, Chao Zhang, and Mingjian Zhu. Attribute-aware attention model for finegrained representation learning. pages 2040–2048, 2018.
    Google ScholarFindings
  • [Kalayeh et al., 2018] Mahdi M Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E Kamasak, and Mubarak Shah. Human semantic parsing for person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1062– 1071, 2018.
    Google ScholarLocate open access versionFindings
  • [Liu et al., 2018] Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learning towards minimum hyperspherical energy. In Advances in Neural Information Processing Systems, pages 6222– 6233, 2018.
    Google ScholarLocate open access versionFindings
  • [Mishchuk et al., 2017] Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Working hard to know your neighbor’s margins: Local descriptor learning loss. In Advances in Neural Information Processing Systems, pages 4826–4837, 2017.
    Google ScholarLocate open access versionFindings
  • [Ristani et al., 2016] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pages 17–35.
    Google ScholarLocate open access versionFindings
  • [Si et al., 2018] Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5363–5372, 2018.
    Google ScholarLocate open access versionFindings
  • [Sun et al., 2018] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pages 480–496, 2018.
    Google ScholarLocate open access versionFindings
  • [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018a] Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 365– 381, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018b] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018c] Zheng Wang, Xiang Bai, Mang Ye, and Shinichi Satoh. Incremental deep hidden attribute learning. pages 72–80, 2018.
    Google ScholarFindings
  • [Xu et al., 2019] Yanwu Xu, Mingming Gong, Tongliang Liu, Kayhan Batmanghelich, and Chaohui Wang. Robust angular local descriptor learning. arXiv preprint arXiv:1901.07076, 2019.
    Findings
  • [Ye et al., 2018] Mang Ye, Zheng Wang, Xiangyuan Lan, and Pong C Yuen. Visible thermal person re-identification via dual-constrained top-ranking. pages 1092–1099, 2018.
    Google ScholarFindings
  • [Zhang et al., 2018] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
    Findings
  • [Zheng et al., 2015] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015.
    Google ScholarLocate open access versionFindings
  • [Zheng et al., 2017] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, pages 3754–3762, 2017.
    Google ScholarLocate open access versionFindings
  • [Zhong et al., 2017] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1318–1327, 2017.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments