Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition
Weibo:
Abstract:
This paper introduces a semi-supervised contrastive learning framework and its application to text-independent speaker verification. The proposed framework employs generalized contrastive loss (GCL). GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and thus it n...More
Code:
Data:
Introduction
- With the development of various optimization techniques, deep learning has become a powerful tool for numerous applications, including speech and image recognition.
- In recent years, supervised metric learning methods for deep neural networks have attracted attention.
- Examples of these include triplet loss [2] and prototypical episode loss [3], which predispose a network to minimize within-class distance and maximize between-class distance.
- They are effective for text-independent speaker verification, as shown in [4], because cosine similarity between utterances from the same speaker is directly maximized in the training phase
Highlights
- With the development of various optimization techniques, deep learning has become a powerful tool for numerous applications, including speech and image recognition
- We propose a semi-supervised contrastive learning framework based on generalized contrastive loss (GCL)
- We demonstrated that GCL enables the network to learn speaker embeddings in three manners, supervised learning, semi-supervised learning, and unsupervised learning, without any changes in the definition of the loss function
- The results demonstrate that GCL enables the learning of speaker embeddings in the three different settings without any changes in the definition of the loss function
- This paper proposed a semi-supervised contrastive learning framework with GCL
- We showed via experiments on the VoxCeleb dataset that the proposed GCL enables a network to learn speaker embeddings in three manners, namely, supervised learning, semi-supervised learning, and unsupervised learning
Methods
- The authors present 1) Generalized contrastive loss (GCL) and 2) GCL for semi-supervised learning.
- GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and it naturally works as a loss function for semi-supervised learning.
- Let Z = {zki : i = 1, 2, · · · , N, k = 1, 2} be a representation batch obtained from a mini-batch for either supervised metric learning or unsupervised contrastive learning.
- Cross-modal [19] Unsupervised Unsupervised AM-Softmax Supervised
Results
- TABLE the author RESULTSs OF SEMI
SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING. EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS R E P O RT E D. - TABLE the author RESULTSs OF SEMI.
- SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING.
- EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS R E P O RT E D
Conclusion
- This paper proposed a semi-supervised contrastive learning framework with GCL.
- The authors showed via experiments on the VoxCeleb dataset that the proposed GCL enables a network to learn speaker embeddings in three manners, namely, supervised learning, semi-supervised learning, and unsupervised learning.
- This was accomplished without making any changes to the definition of the loss function
Summary
Introduction:
With the development of various optimization techniques, deep learning has become a powerful tool for numerous applications, including speech and image recognition.- In recent years, supervised metric learning methods for deep neural networks have attracted attention.
- Examples of these include triplet loss [2] and prototypical episode loss [3], which predispose a network to minimize within-class distance and maximize between-class distance.
- They are effective for text-independent speaker verification, as shown in [4], because cosine similarity between utterances from the same speaker is directly maximized in the training phase
Methods:
The authors present 1) Generalized contrastive loss (GCL) and 2) GCL for semi-supervised learning.- GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and it naturally works as a loss function for semi-supervised learning.
- Let Z = {zki : i = 1, 2, · · · , N, k = 1, 2} be a representation batch obtained from a mini-batch for either supervised metric learning or unsupervised contrastive learning.
- Cross-modal [19] Unsupervised Unsupervised AM-Softmax Supervised
Results:
TABLE the author RESULTSs OF SEMI
SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING. EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS R E P O RT E D.- TABLE the author RESULTSs OF SEMI.
- SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING.
- EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS R E P O RT E D
Conclusion:
This paper proposed a semi-supervised contrastive learning framework with GCL.- The authors showed via experiments on the VoxCeleb dataset that the proposed GCL enables a network to learn speaker embeddings in three manners, namely, supervised learning, semi-supervised learning, and unsupervised learning.
- This was accomplished without making any changes to the definition of the loss function
Tables
- Table1: RESULTS OF SEMI-SUPERVISED, UNSUPERVISED, AND SUPERVISED LEARNING. EQUAL ERROR RATE (EER) ON THE VOXCELEB 1 TEST IS
- Table2: COMPARISON OF RECENT LOSS DEFINITIONS IN GCL FORMULATION. THE AFFINITY TENSOR MAKES PAIRS, TRIPLETS, (N + 1)-TUPLES, OR 2N -TUPLES, AS SHOWN IN FIGURE 3. REPRESENTATION BATCH Z IS CONSTRUCTED FROM LABELED SAMPLES, UNLABELED SAMPLES, AND/OR PARAMETERS. SEE THE DEFINITION OF GCL IN SEC. IV FOR THE MEANING OF s, α, AND Ψ(v). m IS A MARGIN HYPER-PARAMETER, AND
Related work
- A. Supervised Metric Learning
Supervised metric learning is a framework to learn a metric space from a given set of labeled training samples. For recognition problems, such as audio and image recognition, the goal is typically to learn the semantic distance between samples.
A recent trend in supervised metric learning is to design a loss function at the top of a deep neural network. Examples include contrastive loss for Siamese networks [6], triplet loss for triplet networks [2], and episode loss for prototypical networks [3]. To measure the distance between samples, Euclidean distance is often used with these losses.
Funding
- This work was partially supported by the Japan Science and Technology Agency, ACT-X Grant JPMJAX1905, and the Japan Society for the Promotion of Science, KAKENHI Grant
Study subjects and analysis
speakers: 5994
We used the VoxCeleb dataset [23], [24] for evaluating our proposed framework. The training set (voxceleb 2 dev) consists of 1,092,009 utterances of 5,994 speakers. The test set (voxceleb 1 test) consists of 37,611 enrollment-test utterance pairs
enrollment-test utterance pairs: 37611
The training set (voxceleb 2 dev) consists of 1,092,009 utterances of 5,994 speakers. The test set (voxceleb 1 test) consists of 37,611 enrollment-test utterance pairs. The equal error rate (EER) was used as an evaluation measure
Reference
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition (SIMBAD), pages 84–92, 2015.
- Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 4077–4087, 2017.
- Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982, 2020.
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning (ICML), 2020.
- Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 1735–1742, 2006.
- Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 4690–4699, 2019.
- Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 5265–5274, 2018.
- Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 212–220, 2017.
- Yutong Zheng, Dipan K. Pal, and Marios Savvides. Ring loss: Convex feature normalization for face recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pages 5089–5097, 2018.
- Yi Liu, Liang He, and Jia Liu. Large Margin Softmax Loss for Speaker Verification. In Proceedings of Interspeech, 2019.
- Stuart Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, vol. 28, no. 2, pages 129–137, 1982.
- Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision (ECCV), pages 69–84, 2016.
- Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In Proceedings of the International Conference on Learning Representations, 2018.
- R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations, 2019.
- Philip Bachman, R. Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 15509–15519, 2019.
- Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
- Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
- Arsha Nagrani, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. Disentangled speech embeddings using cross-modal selfsupervision. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6829–6833, 2020.
- Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 1163–1171, 2016.
- Themos Stafylakis, Johan Rohdin, Oldrich Plchot, Petr Mizera, and Lukas Burget. Self-supervised speaker embeddings. Proceedings of Interspeech, 2019.
- Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883, 2018.
- Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A large-scale speaker identification dataset. In Proceedings of Interspeech, 2017.
- Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. In Proceedings of Interspeech, 2018.
- Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, and Zhangyang Wang. Autospeech: Neural architecture search for speaker recognition, arXiv preprint arXiv:2005.03215, 2020.
- Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. Ecapatdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020.
Full Text
Tags
Comments