IIITH MM2 Speech-Text: A preliminary data for automatic spoken data validation with matched and mismatched speech-text content.

Nayan Anand, Meenakshi Sirigiraju,Chiranjeevi Yarra

Oriental COCOSDA International Conference on Speech Database and Assessments(2023)

引用 0|浏览0
暂无评分
摘要
The demand for high-quality speech data has been increasing as deep-learning approaches gain popularity in speech applications. Among these, automatic speech recognition (ASR) and text-to-speech (TTS) require large amount of data containing speech and the corresponding text. For these applications, high-quality data is often obtained through manual validation, which ensures matching between speech and text. The manual validation is not scalable as per the demand due to the cost and time involved. In order to cater to the high-quality data demand, validating the data automatically could be useful. In this work, for automatic data validation, a spoken English corpus named IIITH MM2 Speech-Text is created, containing matched and mismatched speech-text pairs under read speech conditions from Indian speakers with different nativities. For the creation, we consider 100 unique stimuli selected from the TIMIT corpus, ensuring phonetic richness, for which a joint entropy maximization is proposed. These stimuli are recorded from 50 speakers, resulting in matched and mismatched sets containing 5000 and 764 utterances with a total duration of 6 hours and 1 hour, respectively. The mismatched set contains speech from the instances where the speakers naturally made spoken errors while reading the reference text. It also contains two stimuli per utterance, one stimulus is the reference text, and the other is manually annotated text that reflects the erroneous speech. Thus, the reference and the annotated text are used for building the models of speech-text mismatch detection and correction, respectively. To the best of our knowledge, no such corpora exist containing both matched and mismatched speech-text. As a preliminary analysis for speech-text mismatch detection, a baseline considering Wav2Vec-2.0 representations and DTW results in the detection F 1 -score of 0.87.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要