Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

Luzhen Xu, Haoyin Yan,Maokui He, Zixian Guo, Yeping Zhou, Peiqi Liu,Jie Zhang,Lirong Dai

Journal of Shanghai Jiaotong University (Science)（2024）

引用 0|浏览13

暂无评分

摘要

This paper describes a speaker-attributed automatic speech recognition (SA-ASR) system submitted to the multi-channel multi-party meeting transcription challenge, which aims to address the “who spoke what” problem. We align the serialized output training-based multi-speaker ASR hypotheses and speaker diarization (SD) results to obtain speaker-attributed transcriptions. We use a pre-trained multi-frame cross-channel attention (MFCCA) model as the ASR module. We build a cascade system which includes a pre-trained speaker overlap-aware neural diarization and target-speaker voice activity detection model as the SD module. Decoding and alignment strategies are further used to improve the SA-ASR performance. Our proposed system outperforms the baseline with a relative improvement of 40.3

查看译文

关键词

multi-channel multi-party meeting transcription,speaker-attributed automatic speech recognition (SA-ASR),serialized output training,speaker diarization,concatenated minimum-permutation character error rate,多通道多方会议转录(M2MET2.0),说话人相关自动语音识别(SA-ASR),序列化输出训练,说话人日志,级联最小排列字符错误率,TN912.34,A

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要