Preparation of Bangla Speech Corpus from Publicly Available Audio & Text.

LREC(2020)

引用 0|浏览11
暂无评分
摘要
Automatic speech recognition systems require large annotated speech corpus. Manual annotation of a large corpus is very difficult. In this paper, we focus on automatic preparation of a speech corpus for Bangladeshi Bangla. We have used publicly available Bangla audiobooks and TV news recordings as audio sources. We designed and implemented an iterative algorithm that takes as input a speech corpus and a huge amount of raw audio (without transcription) and outputs a much larger speech corpus with reasonable confidence. We have leveraged speaker diarization, gender detection, etc. to prepare the annotated corpus. We also have prepared a synthetic speech corpus for handling out-of-vocabulary word problem in Bangla language. Our corpus is suitable for training with Kaldi. Experimental results show that use of our corpus in addition to Google Speech corpus (229 hours) significantly improves the performance of the ASR system.
更多
查看译文
关键词
bangla speech corpus,publicly available audio,text
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要