Speech recognition datasets for low-resource Congolese languages

Ussen Kimanuka,Ciira wa Maina,Osman Buyuk

DATA IN BRIEF(2024)

引用 0|浏览2
暂无评分
摘要
Large pre-trained Automatic Speech Recognition (ASR) models have shown improved performance in low-resource languages due to the increased availability of benchmark corpora and the advantages of transfer learning. However, only a limited number of languages possess ample resources to fully leverage transfer learning. In such contexts, benchmark corpora become crucial for advancing methods. In this article, we introduce two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: the Lingala Read Speech Corpus, with 4 h of labelled audio, and the Congolese Speech Radio Corpus, which offers 741 h of unlabelled audio spanning four significant low-resource languages of the region. During data collection, Lingala Read Speech recordings of thirty-two distinct adult speakers, each with a unique context under various settings with different accents, were recorded. Concurrently, Congolese Speech Radio raw data were taken from the archive of broadcast station, followed by a designed curation process. During data preparation, numerous strategies have been utilised for pre-processing the data. The datasets, which have been made freely accessible to all researchers, serve as a valuable resource for not only investigating and developing monolingual methods and approaches that employ linguistically distant languages but also multilingual approaches with linguistically similar languages. Using techniques such as supervised learning and self-supervised learning, they are able to develop inaugural benchmarking of speech recognition systems for Lingala and mark the first instance of a multilingual model tailored for four Congolese languages spoken by an aggregated population of 95 million. Moreover, two models were applied to this dataset. The first is supervised learning modelling and the second is for self-supervised pre-training. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
更多
查看译文
关键词
Automatic speech recognition,Pre-trained models,Transfer learning,Self-supervised learning,Cross-lingual acoustic model,Multilingual acoustic model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要