Cross database audio visual speech adaptation for phonetic spoken term detection.

Computer Speech & Language(2017)

引用 0|浏览24
暂无评分
摘要
We show that the use of visual information helps both phone recognition and spoken term detection accuracy.Fused HMM adaptation could be utilized to benefit from multiple databases when training audio visual phone modelsAn additional audio adaptation improves cross-database training accuracy for phone recognition and spoken term detection.A post training step can be used to update all HMM parameters and further improve phone recognition accuracy Spoken term detection (STD), the process of finding all occurrences of a specified search term in a large amount of speech segments, has many applications in multimedia search and retrieval of information. It is known that use of video information in the form of lip movements can improve the performance of STD in the presence of audio noise. However, research in this direction has been hampered by the unavailability of large annotated audio visual databases for development. We propose a novel approach to develop audio visual spoken term detection when only a small (low resource) audio visual database is available for development. First, cross database training is proposed as a novel framework using the fused hidden Markov modeling (HMM) technique, which is used to train an audio model using extensive large and publicly available audio databases; then it is adapted to the visual data of the given audio visual database. This approach is shown to perform better than standard HMM joint-training method and also improves the performance of spoken term detection when used in the indexing stage. In another attempt, the external audio models are first adapted to the audio data of the given audio visual database and then they are adapted to the visual data. This approach also improves both phone recognition and spoken term detection accuracy. Finally, the cross database training technique is used as HMM initialization, and an extra parameter re-estimation step is applied on the initialized models using Baum Welch technique. The proposed approaches for audio visual model training have allowed for benefiting from both large extensive out of domain audio databases that are available and the small audio visual database that is given for development to create more accurate audio-visual models.
更多
查看译文
关键词
Spoken term detection,Synchronous hidden Markov model,Cross-database training,Phone recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要