Effective Decoder Masking for Transformer Based End-to-End Speech Recognition

arxiv(2021)

引用 0|浏览5
暂无评分
摘要
The attention-based encoder-decoder modeling paradigm has achieved promising results on a variety of speech processing tasks like automatic speech recognition (ASR), text-to-speech (TTS) and among others. This paradigm takes advantage of the generalization ability of neural networks to learn a direct mapping from an input sequence to an output sequence, without recourse to prior knowledge such as audio-text alignments or pronunciation lexicons. However, ASR models stemming from this paradigm are prone to overfitting, especially when the training data is limited. Inspired by SpecAugment and BERT-like masked language modeling, we propose in the paper a decoder masking based training approach for end-to-end (E2E) ASR models. During the training phase we randomly replace some portions of the decoder's historical text input with the symbol [mask], in order to encourage the decoder to robustly output a correct token even when parts of its decoding history are masked or corrupted. The proposed approach is instantiated with the top-of-the-line transformer-based E2E ASR model. Extensive experiments on the Librispeech960h and TedLium2 benchmark datasets demonstrate the superior performance of our approach in comparison to some existing strong E2E ASR systems.
更多
查看译文
关键词
effective decoder masking,recognition,end-to-end
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要