All-in-One Transformer - Unifying Speech Recognition, Audio Tagging, and Event Detection.

INTERSPEECH(2020)

引用 15|浏览43
暂无评分
摘要
Automatic speech recognition (ASR), audio tagging (AT), and acoustic event detection (AED) are typically treated as separate problems, where each task is tackled using specialized system architectures. This is in contrast with the way the human auditory system uses a single (binaural) pathway to process sound signals from different sources. In addition, an acoustic model trained to recognize speech as well as sound events could leverage multi-task learning to alleviate data scarcity problems in individual tasks. In this work, an all-in-one (AIO) acoustic model based on the Transformer architecture is trained to solve ASR, AT, and AED tasks simultaneously, where model parameters are shared across all tasks. For the ASR and AED tasks, the Transformer model is combined with the connectionist temporal classification (CTC) objective to enforce a monotonic ordering and to utilize timing information. Our experiments demonstrate that the AIO Transformer achieves better performance compared to all baseline systems of various recent DCASE challenge tasks and is suitable for the total transcription of an acoustic scene, i.e., to simultaneously transcribe speech and recognize the acoustic events occurring in it.
更多
查看译文
关键词
automatic speech recognition, DCASE challenge, acoustic event detection, audio tagging, Transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要