Streaming Attention-Based Models with Augmented Memory for End-To-End Speech Recognition

Ching-Feng Yeh,Yongqiang Wang,Yangyang Shi,Chunyang Wu,Frank Zhang,Julian Chan,Michael L. Seltzer

2021 IEEE Spoken Language Technology Workshop (SLT)（2021）

引用 9|浏览93

暂无评分

摘要

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation [1] and automatic speech recognition [2]. One major challenge of attention-based models is the need of access to the full sequence and the quadratically growing computational cost concerning the sequence length. These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. In this paper, we build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture [3] with attention-based modules augmented with convolution [2]. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory [4], [5]. On the LibriSpeech [6] dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.

查看译文

关键词

transformer,transducer,end-to-end,self-attention,speech recognition

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要