BULBasaa: A Bilingual Basaa-French Speech Corpus for the Evaluation of Language Documentation Tools.

LREC(2018)

引用 23|浏览60
暂无评分
摘要
Basaa is one of the three Bantu languages of BULB (Breaking the Unwritten Language Barrier), a project whose aim is to provide NLP-based tools to support linguists in documenting under-resourced and unwritten languages. To develop technologies such as automatic phone transcription or machine translation, a massive amount of speech data is needed. Approximately 50 hours of Basaa speech were thus collected and then carefully re-spoken and orally translated into French in a controlled environment by a few bilingual speakers. For a subset of approximate to 10 hours of the corpus, each utterance was additionally phonetically transcribed to establish a golden standard for the output of our NLP tools. The experiments described in this paper are meant to provide an automatic phonetic transcription using a set of derived phone-like units. As every language features a specific set of idiosyncrasies, automating the process of phonetic unit discovery in its entirety is a challenging task. Within BULB, we envision a workflow where linguists are able to refine the set of automatically discovered units and the system is then able to re-iterate on the data, providing a better approximation of the actual phone set.
更多
查看译文
关键词
Basaa, Northwest Bantu, Computational linguistics, unsupervised phone discovery, under-resourced languages
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要