Phoebe: Pronunciation-Aware Contextualization For End-To-End Speech Recognition

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2019)

引用 49|浏览175
暂无评分
摘要
End-to-End ( E2E) automatic speech recognition ( ASR) systems learn word spellings directly from text-audio pairs, in contrast to traditional ASR systems which incorporate a separate pronunciation lexicon. The lexicon allows a traditional system to correctly spell rare words observed only in LM training, if their phonetic pronunciation is known during inference. E2E systems, however, are more likely to misspell rare words.We propose an E2E model which benefits from the best of both worlds: it outputs graphemes, and thus learns to spell words directly, while leveraging pronunciations for words which might be likely in a given context. Our model is based on the recently proposed Contextual Listen, Attend, and Spell ( CLAS) model. As in CLAS, our model accepts a set of bias phrases, which are first converted into fixed length embeddings which are provided as additional inputs to the model. Unlike CLAS, which accepts only the textual form of the bias phrases, the proposed model also has access to the corresponding phonetic pronunciations, which improves performance on challenging sets which include words unseen in training. The proposed model provides a 16% relative word-error-rate reduction over CLAS when both the phonetic and written representation of the context bias phrases are used.
更多
查看译文
关键词
speech recognition, sequence-to-sequence, biasing, pronunciation, LAS
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要