Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

Luzhen Xu, Haoyin Yan,Maokui He, Zixian Guo, Yeping Zhou, Peiqi Liu,Jie Zhang,Lirong Dai

Journal of Shanghai Jiaotong University (Science)(2024)

引用 0|浏览13
暂无评分
摘要
This paper describes a speaker-attributed automatic speech recognition (SA-ASR) system submitted to the multi-channel multi-party meeting transcription challenge, which aims to address the “who spoke what” problem. We align the serialized output training-based multi-speaker ASR hypotheses and speaker diarization (SD) results to obtain speaker-attributed transcriptions. We use a pre-trained multi-frame cross-channel attention (MFCCA) model as the ASR module. We build a cascade system which includes a pre-trained speaker overlap-aware neural diarization and target-speaker voice activity detection model as the SD module. Decoding and alignment strategies are further used to improve the SA-ASR performance. Our proposed system outperforms the baseline with a relative improvement of 40.3
更多
查看译文
关键词
multi-channel multi-party meeting transcription,speaker-attributed automatic speech recognition (SA-ASR),serialized output training,speaker diarization,concatenated minimum-permutation character error rate,多通道多方会议转录(M2MET2.0),说话人相关自动语音识别(SA-ASR),序列化输出训练,说话人日志,级联最小排列字符错误率,TN912.34,A
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要