Overcoming Data Sparsity in Automatic Transcription of Dictated Medical Findings

2022 30th European Signal Processing Conference (EUSIPCO)(2022)

引用 1|浏览6
暂无评分
摘要
This paper presents a method for introducing class n-gram language models as a means for overcoming data sparsity in the training of an automatic speech recognition (ASR) system aimed at transcription of dictated medical findings composed predominantly in the Serbian language, including occasional phrases in Latin. The classes used by the model are defined with the specific aim of avoiding the need of identifying an appropriate orthographic expansion of each abbreviation, number or other non-orthographic element in a particular context. Generated language models are decoded in Kaldi using token passing, and lattices generated in this way are rescored using recurrent neural network language models (RNNLM). Although the proposed approach requires extensive effort for initial definition of classes based on existing text corpora of medical findings, it improves the quality of the model and increases the degree of automation in the processing of future training corpora. As such, the proposed method is particularly suitable for training on noisy data, full of misspel-lings and other errors, such as medical findings. The feasibility of the approach has been tested on a corpus of medical findings in the domain of radiology, where a perplexity score of 59.55 and word error rate of 1.4% have been achieved.
更多
查看译文
关键词
the Serbian language,language modeling,class-based LM,code switching
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要