Large Scale Deep Neural Network Acoustic Modeling With Semi-Supervised Training Data For Youtube Video Transcription
2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU)(2013)
摘要
YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Improving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely challenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009, YouTube has provided automatic generation of closed captions for videos detected to have English speech; the service now supports ten different languages. This paper describes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories. Applying an "island of confidence" filtering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low-rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.
更多查看译文
关键词
Large vocabulary speech recognition, deep neural networks, deep learning, audio indexing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络