Leveraging Text Data For Word Segmentation For Underresourced Languages

Thomas Glarner,Benedikt T. Boenninghoff,Oliver Walter,Reinhold Haeb-Umbach

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION（2017）

引用 3|浏览11

暂无评分

摘要

In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segementation performance by a large margin.

查看译文

关键词

underresourced speech recognition, nonparametric Bayesian estimation, unsupervised word segmentation, unsupervised learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要