Speaker Diarization For Vietnamese Conversations Using Deep Neural Network Embeddings

2022 IEEE Ninth International Conference on Communications and Electronics (ICCE)(2022)

引用 1|浏览0
暂无评分
摘要
Speaker diarization, known as finding “who spoke when” is the method of dividing a conversation into segments spoken by the same speaker. While speaker diarization has numerous applications, there are little to no reports on its application in Vietnamese speech processing system. In addition, the key to accurately do such task is to learn discriminative speaker representations, or speaker embeddings. Recently X-Vectors and ECAPA-TDNN, based on deep neural networks, has emerged as state-of-the-art speaker embeddings networks for English corpora. In this work, we build a speaker diarization system for Vietnamese telephone conversations, and explore the capabilities of X-Vectors and ECAPA-TDNN in the system. We also evaluate the discriminative characteristics of these speaker embeddings networks on a bare-bones speaker verification system. Used data include proprietary datasets (IPCC-110000, IPCC-2000, VTR-1350) and a public dataset (ZALO-400). While these datasets can be used directly for training and testing for speaker verification task, for speaker diarization task we have to simulate multi-way conversations. Our conducted experiments show that ECAPA-TDNN system out-perform the X-Vectors system on both speaker verification and speaker diarization tasks.
更多
查看译文
关键词
Vietnamese,speaker diarization,speaker embeddings,clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要