Multi-Speaker End-to-End Multi-Modal Speaker Diarization System for the MISP 2022 Challenge

Tao Liu,Zhengyang Chen,Yanmin Qian,Kai Yu

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)（2023）

引用 0|浏览5

暂无评分

摘要

This paper presents the design and implementation of our system for Track 1 of the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. We design an end-to-end transformer-based multi-talker system. The transformer backbone is well-suited to capture long-term features, which is crucial for multi-modal speaker diarization in cases where temporal modalities are missing. Besides, we employ several loss functions and image data augmentation techniques to prevent over-fitting during training. Moreover, to further improve the system’s performance, we incorporate Interchannel Phase Difference (IPD) to model the location features and pre-train an ECAPA-TDNN-based model to extract speaker embedding features. Our system achieved a diarization error rate (DER) of 10.82% on the evaluation set, which earned us second place in the audio-visual speaker diarization task of the MISP 2022 challenge.

查看译文

关键词

MISP Challenge,Audio-visual,Speaker Diarization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要