Range Loss for Deep Face Recognition with Long-Tailed Training Data

ICCV, pp. 5419-5428, 2017.

Cited by: 164|Bibtex|Views47
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We deeply explore the effects the long tailed data in the context of training deep CNNs for face recognition

Abstract:

Deep convolutional neural networks have achieved significant improvements on face recognition task due to their ability to learn highly discriminative features from tremendous amounts of face images. Many large scale face datasets exhibit long-tail distribution where a small number of entities (persons) have large number of face images wh...More

Code:

Data:

0
Introduction
  • Recent years witnessed the remarkable progresses of applying deep learning models in various computer vision tasks such as classification [14, 26, 29, 10, 9] , scene understanding [37, 36], and action recognition [13].
  • According to [20], the performance can improve slightly if one just preserves 40% of positive samples to make the training samples more uniform.
  • Such disposal strategys flaw is obvious: to abandon the data partially, information contained in these data may be omitted.
  • Poor classes can include complementary knowledge to rich classes which can boost the performance of the final models
Highlights
  • Recent years witnessed the remarkable progresses of applying deep learning models in various computer vision tasks such as classification [14, 26, 29, 10, 9], scene understanding [37, 36], and action recognition [13]
  • This paper addresses the long tail problem in the context of deep face recognition from two aspects
  • The main contributions of this paper are summarized as follows: 1) We empirically find that popular training losses of deep face recognition, i.e. contrastive loss [27], triplet loss [24], and center loss [32], all suffer from long tail distributions, while removing long tailed data can improve the recognition performance
  • Similar problems exist for triplet loss and center loss due to insufficient samples in the tailed parts. This paper addresses this challenge by proposing range loss to handle imbalanced data
  • We deeply explore the effects the long tailed data in the context of training deep CNNs for face recognition
  • We propose a new loss function, namely range loss, to effectively exploit the tailed data in training deep networks
Methods
  • DeepID-2+ [27] FaceNet [24] Baidu [17] Deep FR [21] DeepFace [30]

    Center Loss [32] Softmax Loss Range Loss LFW

    YTF data from MS-Celeb-1M [6] and CASIA-WebFace [35].
  • The authors make comparison with a number of state-of-the-art methods, including DeepID-2+ [28], FaceNet [24], Baidu [17], DeepFace [30], and the residual net structure trained with softmax loss only.
  • Range loss again achieves better performance than baseline softmax with a clear margin.
  • This indicates that the joint supervision of range loss and softmax loss can always enhance the deep networks ability to extract discriminative representations.
  • FaceNet has better performance than ours, it is trained on a super large datatsets, 133 times than ours
Conclusion
  • The authors deeply explore the effects the long tailed data in the context of training deep CNNs for face recognition.
  • The authors propose a new loss function, namely range loss, to effectively exploit the tailed data in training deep networks.
  • The authors' range loss contributes to reduce the intra-class variations and enlarge the inter-class distance for imbalanced and long tailed datasets.
  • Experiments on two large scale face benchmarks, i.e. LFW and YTF, demonstrate the effectiveness of the proposed methods which clearly outperform baseline methods under long tailed conditions.
Summary
  • Introduction:

    Recent years witnessed the remarkable progresses of applying deep learning models in various computer vision tasks such as classification [14, 26, 29, 10, 9] , scene understanding [37, 36], and action recognition [13].
  • According to [20], the performance can improve slightly if one just preserves 40% of positive samples to make the training samples more uniform.
  • Such disposal strategys flaw is obvious: to abandon the data partially, information contained in these data may be omitted.
  • Poor classes can include complementary knowledge to rich classes which can boost the performance of the final models
  • Methods:

    DeepID-2+ [27] FaceNet [24] Baidu [17] Deep FR [21] DeepFace [30]

    Center Loss [32] Softmax Loss Range Loss LFW

    YTF data from MS-Celeb-1M [6] and CASIA-WebFace [35].
  • The authors make comparison with a number of state-of-the-art methods, including DeepID-2+ [28], FaceNet [24], Baidu [17], DeepFace [30], and the residual net structure trained with softmax loss only.
  • Range loss again achieves better performance than baseline softmax with a clear margin.
  • This indicates that the joint supervision of range loss and softmax loss can always enhance the deep networks ability to extract discriminative representations.
  • FaceNet has better performance than ours, it is trained on a super large datatsets, 133 times than ours
  • Conclusion:

    The authors deeply explore the effects the long tailed data in the context of training deep CNNs for face recognition.
  • The authors propose a new loss function, namely range loss, to effectively exploit the tailed data in training deep networks.
  • The authors' range loss contributes to reduce the intra-class variations and enlarge the inter-class distance for imbalanced and long tailed datasets.
  • Experiments on two large scale face benchmarks, i.e. LFW and YTF, demonstrate the effectiveness of the proposed methods which clearly outperform baseline methods under long tailed conditions.
Tables
  • Table1: Training Set with Long-tail Distribution. Control group’s division proportion can be viewed in Fig. 2 images are regarded as poor classes (tailed data). In Table1, group A-0 contains all tailed data while A-4 includes no tailed data. A-1, A-2, A-3 have tailed data ratios as 80%, 50%, 30%, respectively
  • Table2: Performances comparison of softmax loss on LFW and YTF with/without long-tail data. VGG Net is used
  • Table3: Performances comparison of softmax loss on LFW with/without long-tail data. AlexNet is used. Since AlexNet has fewer layers and weights than VGG Net, its baseline is low, which makes long tail effect more obvious
  • Table4: Long-tail effect of contrastive Loss, triplet Loss and center loss. Eveluated on LFW and YTF with VGG Nets
  • Table5: The intra-class and inter-class statistics expose differences between long-tail model and cut-tail model. Here SD is standard deviation and EM is the average Euclidean metric. Good CNN models are expected to have small intra-class standard deviation and average Euclidean metric while large for inter-class. Kurtosis describes the 4th order statistics of feature distribution. Infrequent extreme deviated vectors lead to high kurtosis. We always expect a low kurtosis because infrequent extreme deviation is harmful for face recognition task. Range loss resists the increase of kurtosis and restrains the extension of inter-class distance
  • Table6: Verification accuracy of Range Loss, Contrastive Loss, Triplet Loss, and Center Loss on LFW and YTF. A-0 contains all tailed data while A-2 includes 50% tailed data
  • Table7: Comparison with state of the art methods on LFW and YTF datasets
Download tables as Excel
Related work
  • Deep neural networks with great ability to learning representation from data, achieve remarkable successes in a series of vision tasks like recognition and detection [7, 25, 16, 8, 29], face recognition [21, 24, 27, 3, 34, 19, 31]. By increasing the depth, VGG [26] and GoogLeNet [4] achieved significant improvements on ImageNet [22] and VOC Pascale dataset [5]. More recently, Residual Network exploits shortcut connections to ease the training of substantially deeper networks [9]. Deep architectures like DeepID2+ [27], FaceNet [24], DeepFace [30], Deep FR [21], significantly boost the face recognition performance than previous shallow models. Loss function is important to train powerful deep models. DeepID2 utilized both verification and identification loss to enhance the training of CNNs [28]. FaceNet further shows that triplet loss contributes to improve the performance. More recently, [32] proposed center loss which takes account of class-clusters in CNN training. Different from these loss functions, range loss is defined on a new measure to minimize the within-person variations of deep representations.
Funding
  • This work was supported in part by National HighTech Research and Development Program of China (2015AA042303), National Natural Science Foundation of China (U1613211), and External Cooperation Program of BIC Chinese Academy of Sciences (172644KYSB20160033)
Reference
  • Springer, 2003. 1
    Google ScholarFindings
  • A. Bingham and D. Spradlin. The long tail of expertise. 2011. 2
    Google ScholarFindings
  • D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3025– 3032, 2012
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 1, 2
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303– 338, June 2010. 2
    Google ScholarLocate open access versionFindings
  • Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2012, 8
    Google ScholarLocate open access versionFindings
  • S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from rgb-d images for object detection and segmentation. In European Conference on Computer Vision, pages 345–360. Springer, 2014. 2
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. 1, 2, 7
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015. 1
    Google ScholarLocate open access versionFindings
  • G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. 1, 2, 3, 7
    Google ScholarFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 7
    Findings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 1
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • S. Liao, Z. Lei, D. Yi, and S. Z. Li. A benchmark study of large-scale unconstrained face recognition. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014. 7
    Google ScholarLocate open access versionFindings
  • M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. 2
    Findings
  • J. Liu, Y. Deng, and C. Huang. Targeting ultimate accuracy: Face recognition via deep embedding. arXiv preprint arXiv:1506.07310, 2015. 8
    Findings
  • L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008. 4
    Google ScholarLocate open access versionFindings
  • M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013. 2
    Findings
  • W. Ouyang, X. Wang, C. Zhang, and X. Yang. Factors in finetuning deep model for object detection with long-tail distribution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 864–873, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, volume 1, page 6, 2015. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • S.Bengio. The battle against the long tail. Talk on Workshop on Big Data and Statistical Machine Learning., 2015. 2
    Google ScholarLocate open access versionFindings
  • F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015. 1, 2, 3, 8
    Google ScholarLocate open access versionFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. 2
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 2, 7
    Findings
  • Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, pages 1988–1996, 2014. 2, 3, 8
    Google ScholarLocate open access versionFindings
  • Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2892–2900, 2015. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708, 2014. 1, 2, 8
    Google ScholarLocate open access versionFindings
  • Y. Wen, Z. Li, and Y. Qiao. Latent factor guided convolutional neural networks for age-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4893–4901, 2016. 2
    Google ScholarLocate open access versionFindings
  • Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016. 2, 3, 7, 8
    Google ScholarLocate open access versionFindings
  • L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 529–534. IEEE, 2011. 1, 2, 3, 7
    Google ScholarLocate open access versionFindings
  • J. Yang, B. Price, S. Cohen, and M.-H. Yang. Context driven scene parsing with attention to rare classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3294–3301, 2014. 2
    Google ScholarLocate open access versionFindings
  • D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014. 8
    Findings
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014. 1
    Findings
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pages 487–495, 2014. 1
    Google ScholarLocate open access versionFindings
  • E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition: Touching the limit of lfw benchmark or not? arXiv preprint arXiv:1501.04690, 2015. 1
    Findings
Full Text
Your rating :
0

 

Tags
Comments