Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition

Inoue Nakamasa
Inoue Nakamasa
Goto Keita
Goto Keita
Cited by: 0|Bibtex|Views68
Other Links: arxiv.org
Weibo:
We showed via experiments on the VoxCeleb dataset that the proposed generalized contrastive loss enables a network to learn speaker embeddings in three manners, namely, supervised learning, semi-supervised learning, and unsupervised learning

Abstract:

This paper introduces a semi-supervised contrastive learning framework and its application to text-independent speaker verification. The proposed framework employs generalized contrastive loss (GCL). GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and thus it n...More

Code:

Data:

0
Introduction
  • With the development of various optimization techniques, deep learning has become a powerful tool for numerous applications, including speech and image recognition.
  • In recent years, supervised metric learning methods for deep neural networks have attracted attention.
  • Examples of these include triplet loss [2] and prototypical episode loss [3], which predispose a network to minimize within-class distance and maximize between-class distance.
  • They are effective for text-independent speaker verification, as shown in [4], because cosine similarity between utterances from the same speaker is directly maximized in the training phase
Highlights
  • With the development of various optimization techniques, deep learning has become a powerful tool for numerous applications, including speech and image recognition
  • We propose a semi-supervised contrastive learning framework based on generalized contrastive loss (GCL)
  • We demonstrated that GCL enables the network to learn speaker embeddings in three manners, supervised learning, semi-supervised learning, and unsupervised learning, without any changes in the definition of the loss function
  • The results demonstrate that GCL enables the learning of speaker embeddings in the three different settings without any changes in the definition of the loss function
  • This paper proposed a semi-supervised contrastive learning framework with GCL
  • We showed via experiments on the VoxCeleb dataset that the proposed GCL enables a network to learn speaker embeddings in three manners, namely, supervised learning, semi-supervised learning, and unsupervised learning
Methods
  • The authors present 1) Generalized contrastive loss (GCL) and 2) GCL for semi-supervised learning.
  • GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and it naturally works as a loss function for semi-supervised learning.
  • Let Z = {zki : i = 1, 2, · · · , N, k = 1, 2} be a representation batch obtained from a mini-batch for either supervised metric learning or unsupervised contrastive learning.
  • Cross-modal [19] Unsupervised Unsupervised AM-Softmax Supervised
Results
  • TABLE the author RESULTSs OF SEMI

    SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING. EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS R E P O RT E D.
  • TABLE the author RESULTSs OF SEMI.
  • SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING.
  • EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS R E P O RT E D
Conclusion
  • This paper proposed a semi-supervised contrastive learning framework with GCL.
  • The authors showed via experiments on the VoxCeleb dataset that the proposed GCL enables a network to learn speaker embeddings in three manners, namely, supervised learning, semi-supervised learning, and unsupervised learning.
  • This was accomplished without making any changes to the definition of the loss function
Summary
  • Introduction:

    With the development of various optimization techniques, deep learning has become a powerful tool for numerous applications, including speech and image recognition.
  • In recent years, supervised metric learning methods for deep neural networks have attracted attention.
  • Examples of these include triplet loss [2] and prototypical episode loss [3], which predispose a network to minimize within-class distance and maximize between-class distance.
  • They are effective for text-independent speaker verification, as shown in [4], because cosine similarity between utterances from the same speaker is directly maximized in the training phase
  • Methods:

    The authors present 1) Generalized contrastive loss (GCL) and 2) GCL for semi-supervised learning.
  • GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and it naturally works as a loss function for semi-supervised learning.
  • Let Z = {zki : i = 1, 2, · · · , N, k = 1, 2} be a representation batch obtained from a mini-batch for either supervised metric learning or unsupervised contrastive learning.
  • Cross-modal [19] Unsupervised Unsupervised AM-Softmax Supervised
  • Results:

    TABLE the author RESULTSs OF SEMI

    SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING. EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS R E P O RT E D.
  • TABLE the author RESULTSs OF SEMI.
  • SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING.
  • EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS R E P O RT E D
  • Conclusion:

    This paper proposed a semi-supervised contrastive learning framework with GCL.
  • The authors showed via experiments on the VoxCeleb dataset that the proposed GCL enables a network to learn speaker embeddings in three manners, namely, supervised learning, semi-supervised learning, and unsupervised learning.
  • This was accomplished without making any changes to the definition of the loss function
Tables
  • Table1: RESULTS OF SEMI-SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING. EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS
  • Table2: COMPARISON OF RECENT LOSS DEFINITIONS IN GCL FORMULATION. THE AFFINITY TENSOR MAKES PAIRS, TRIPLETS, (N + 1)-TUPLES, OR 2N -TUPLES, AS SHOWN IN FIGURE 3. REPRESENTATION BATCH Z IS CONSTRUCTED FROM LABELED SAMPLES, UNLABELED SAMPLES, AND/OR PARAMETERS. SEE THE DEFINITION OF GCL IN SEC. IV FOR THE MEANING OF s, α, AND Ψ(v). m IS A MARGIN HYPER-PARAMETER, AND
Download tables as Excel
Related work
  • A. Supervised Metric Learning

    Supervised metric learning is a framework to learn a metric space from a given set of labeled training samples. For recognition problems, such as audio and image recognition, the goal is typically to learn the semantic distance between samples.

    A recent trend in supervised metric learning is to design a loss function at the top of a deep neural network. Examples include contrastive loss for Siamese networks [6], triplet loss for triplet networks [2], and episode loss for prototypical networks [3]. To measure the distance between samples, Euclidean distance is often used with these losses.
Funding
  • This work was partially supported by the Japan Science and Technology Agency, ACT-X Grant JPMJAX1905, and the Japan Society for the Promotion of Science, KAKENHI Grant
Study subjects and analysis
speakers: 5994
We used the VoxCeleb dataset [23], [24] for evaluating our proposed framework. The training set (voxceleb 2 dev) consists of 1,092,009 utterances of 5,994 speakers. The test set (voxceleb 1 test) consists of 37,611 enrollment-test utterance pairs

enrollment-test utterance pairs: 37611
The training set (voxceleb 2 dev) consists of 1,092,009 utterances of 5,994 speakers. The test set (voxceleb 1 test) consists of 37,611 enrollment-test utterance pairs. The equal error rate (EER) was used as an evaluation measure

Reference
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition (SIMBAD), pages 84–92, 2015.
    Google ScholarLocate open access versionFindings
  • Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 4077–4087, 2017.
    Google ScholarLocate open access versionFindings
  • Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982, 2020.
    Findings
  • Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning (ICML), 2020.
    Google ScholarLocate open access versionFindings
  • Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 1735–1742, 2006.
    Google ScholarLocate open access versionFindings
  • Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 4690–4699, 2019.
    Google ScholarLocate open access versionFindings
  • Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 5265–5274, 2018.
    Google ScholarLocate open access versionFindings
  • Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 212–220, 2017.
    Google ScholarLocate open access versionFindings
  • Yutong Zheng, Dipan K. Pal, and Marios Savvides. Ring loss: Convex feature normalization for face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 5089–5097, 2018.
    Google ScholarLocate open access versionFindings
  • Yi Liu, Liang He, and Jia Liu. Large Margin Softmax Loss for Speaker Verification. In Proceedings of Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • Stuart Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, vol. 28, no. 2, pages 129–137, 1982.
    Google ScholarLocate open access versionFindings
  • Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision (ECCV), pages 69–84, 2016.
    Google ScholarLocate open access versionFindings
  • Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In Proceedings of the International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Philip Bachman, R. Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 15509–15519, 2019.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
    Findings
  • Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
    Findings
  • Arsha Nagrani, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. Disentangled speech embeddings using cross-modal selfsupervision. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6829–6833, 2020.
    Google ScholarLocate open access versionFindings
  • Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 1163–1171, 2016.
    Google ScholarLocate open access versionFindings
  • Themos Stafylakis, Johan Rohdin, Oldrich Plchot, Petr Mizera, and Lukas Burget. Self-supervised speaker embeddings. Proceedings of Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883, 2018.
    Google ScholarLocate open access versionFindings
  • Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A large-scale speaker identification dataset. In Proceedings of Interspeech, 2017.
    Google ScholarLocate open access versionFindings
  • Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. In Proceedings of Interspeech, 2018.
    Google ScholarLocate open access versionFindings
  • Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, and Zhangyang Wang. Autospeech: Neural architecture search for speaker recognition, arXiv preprint arXiv:2005.03215, 2020.
    Findings
  • Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. Ecapatdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020.
    Findings
Full Text
Your rating :
0

 

Tags
Comments