Multi-Speaker End-to-End Multi-Modal Speaker Diarization System for the MISP 2022 Challenge

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)

引用 0|浏览5
暂无评分
摘要
This paper presents the design and implementation of our system for Track 1 of the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. We design an end-to-end transformer-based multi-talker system. The transformer backbone is well-suited to capture long-term features, which is crucial for multi-modal speaker diarization in cases where temporal modalities are missing. Besides, we employ several loss functions and image data augmentation techniques to prevent over-fitting during training. Moreover, to further improve the system’s performance, we incorporate Interchannel Phase Difference (IPD) to model the location features and pre-train an ECAPA-TDNN-based model to extract speaker embedding features. Our system achieved a diarization error rate (DER) of 10.82% on the evaluation set, which earned us second place in the audio-visual speaker diarization task of the MISP 2022 challenge.
更多
查看译文
关键词
MISP Challenge,Audio-visual,Speaker Diarization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要