Preparation of Bangla Speech Corpus from Publicly Available Audio & Text.

Shafayat Ahmed,Nafis Sadeq,Sudipta Saha Shubha,Md. Nahidul Islam,Muhammad Abdullah Adnan,Mohammad Zuberul Islam

LREC（2020）

引用 0|浏览11

暂无评分

摘要

Automatic speech recognition systems require large annotated speech corpus. Manual annotation of a large corpus is very difficult. In this paper, we focus on automatic preparation of a speech corpus for Bangladeshi Bangla. We have used publicly available Bangla audiobooks and TV news recordings as audio sources. We designed and implemented an iterative algorithm that takes as input a speech corpus and a huge amount of raw audio (without transcription) and outputs a much larger speech corpus with reasonable confidence. We have leveraged speaker diarization, gender detection, etc. to prepare the annotated corpus. We also have prepared a synthetic speech corpus for handling out-of-vocabulary word problem in Bangla language. Our corpus is suitable for training with Kaldi. Experimental results show that use of our corpus in addition to Google Speech corpus (229 hours) significantly improves the performance of the ASR system.

查看译文

关键词

bangla speech corpus,publicly available audio,text

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要