Emitting Word Timings with HMM-free End-to-End System in Automatic Speech Recognition

Xianzhao Chen,Hao Ni,Yi He,Kang Wang,Zejun Ma,Zongxia Xie

Interspeech（2021）

引用 0|浏览7

暂无评分

摘要

Word timings, which mark the start and end times of each word in ASR results, play an important part in many applications, such as computer assisted language learning. To date, end-to-end (E2E) systems outperform conventional DNNHMM hybrid systems in ASR accuracy but have challenges to obtain accurate word timings. In this paper, we propose a two-pass method to estimate word timings under an E2Ebased LAS modeling framework, which is completely free of using the DNN-HMM ASR system. Specifically, we first employ the LAS system to obtain word-piece transcripts of the input audio, we then compute forced-alignments with a framelevel-based word-piece classifier. In order to make the classifier yield accurate word-piece timing results, we propose a novel objective function to learn the classifier, utilizing the spike timings of the connectionist temporal classification (CTC) model. On Librispeech data, our E2E-based LAS system achieves 2.8%/7.0% WERs, while its word timing (start/end) accuracy are 99.0%/95.3% and 98.6%/93.7% on test-clean and test-other two test sets respectively. Compared with a DNN-HMM hybrid ASR system (here, TDNN), the LAS system is better in ASR performance, and the generated word timings are close to what the TDNN ASR system presents.

查看译文

关键词

ASR,End-to-end,Listen Attend and Spell,connectionist temporal classification,forced alignment

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要