End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Synnaeve Gabriel,Xu Qiantong,Kahn Jacob,Grave Edouard,Likhomanenko Tatiana,Pratap Vineel,Sriram Anuroop,Liptchinsky Vitaliy,Collobert Ronan

arxiv（2019）

引用 60|浏览103

暂无评分

摘要

We study ResNet-, Time-Depth Separable ConvNets-, and Transformer-based acoustic models, trained with CTC or Seq2Seq criterions. We perform experiments on the LibriSpeech dataset, with and without LM decoding, optionally with beam rescoring. We reach 5.18% WER with external language models for decoding and rescoring. Additionally, we leverage the unlabeled data from LibriVox by doing semi-supervised training and show that it is possible to reach 5.29% WER on test-other without decoding, and 4.11% WER with decoding and rescoring, with only the standard 960 hours from LibriSpeech as labeled data.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要