Exploiting Monolingual Speech Corpora for Code-Mixed Speech Recognition

Karan Taneja,Satarupa Guha,Preethi Jyothi,Basil Abraham

INTERSPEECH（2019）

引用 17|浏览2

暂无评分

摘要

One of the main challenges in building code-mixed ASR systems is the lack of annotated speech data. Often, however, monolingual speech corpora are available in abundance for the languages in the code-mixed speech. In this paper, we explore different techniques that use monolingual speech to create synthetic code-mixed speech and examine their effect on training models for code-mixed ASR. We assume access to a small amount of real code-mixed text, from which we extract probability distributions that govern the transition of phones across languages at code-switch boundaries and the span lengths corresponding to a particular language. We extract segments from monolingual data and concatenate them to form code-mixed utterances such that these probability distributions are preserved. Using this synthetic speech, we show significant improvements in Hindi-English code-mixed ASR performance compared to using synthetic speech naively constructed from complete utterances in different languages. We also present language modelling experiments that use synthetically constructed code-mixed text and discuss their benefits.

查看译文

关键词

code-mixed speech recognition, synthetic codemixed speech from monolingual data

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要