AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We presented experimental evidence that spectrogram features of speech are superior to Mel-frequency cepstral coefficients with deep neural nets, in contrast to the earlier long-standing practice with Gaussian mixture model-HMMs

Recent advances in deep learning for speech research at Microsoft.

ICASSP, pp.8604-8608, (2013)

Cited by: 751|Views438
EI

Abstract

Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We org...More

Code:

Data:

0
Introduction
  • Speech recognition technology has been dominated by a “shallow” architecture using many Gaussians in the mixtures associated with HMM states to represent acoustic variability in the speech signal.
  • Since 2009, in collaboration with researchers at University of Toronto and other organizations, the authors at Microsoft have developed deep learning technology that has successfully replaced Gaussian mixtures for speech recognition and feature coding at an increasingly larger scale (e.g., [24][19][53][39][7][8][44][54][13][56][30][48]).
  • Representative experimental results are shown to facilitate the analysis on the strengths and weaknesses of the techniques the authors have developed and illustrated in this paper
Highlights
  • For many years, speech recognition technology has been dominated by a “shallow” architecture using many Gaussians in the mixtures associated with HMM states to represent acoustic variability in the speech signal
  • We provide an overview of this body of work, with emphasis on more recent experiments which shed light onto the understanding of the basic capabilities and limitations of the current deep learning technology for speech recognition and related applications
  • In Sections 2-5, we focus on several aspects of deep learning in the feature-domain with the theme of how deep models can enable the effective use of primitive, information-rich spectral features
  • We developed and experimentally evaluated the multilingual deep neural nets architecture shown in Figure 1b
  • We presented experimental evidence that spectrogram features of speech are superior to Mel-frequency cepstral coefficients with deep neural nets, in contrast to the earlier long-standing practice with Gaussian mixture model-HMMs
  • Our and other’s work over past few years has demonstrated that deep learning is a powerful technology; e.g. on the Switchboard ASR task the word error rate has reduced sharply from 23% in the Gaussian mixture model-HMM system as prior art to as low as 13% currently [32][48]
Results
  • When the data sets are obtained from the same source in training the DNN system, similar error rates are obtained with and without applying a sentence-level spectral feature normalization procedure (30.0% vs. 30.1%).
Conclusion
  • A major theme the authors adopt in writing this overview goes to the very core of deep learning --- automatic learning of representations in place of hand-tuned feature engineering
  • To this end, the authors presented experimental evidence that spectrogram features of speech are superior to MFCC with DNN, in contrast to the earlier long-standing practice with GMM-HMMs. New improvements on DNN architectures and learning are needed to push the features even further back to the raw level of acoustic measurements.
Tables
  • Table1: Comparing MFCC with filter-bank features
  • Table2: DNN performance on wideband and narrowband test sets
  • Table3: Comparing DNN word error rates on a resource-rich task (FRA training data=138 hrs) w. & wo other languages
  • Table4: Retraining only the top layer gives lower errors than retraining all layers due to the data sparsity in ENU. Adding three more source languages in training further reduces recognition errors. We see that the multilingual DNN provides an effective structure for transferring information learnt from multiple languages to the DNN for a resource-limited target language due to phonetic information sharing. Comparing DNN word error rates on a resourcelimited task (ENU training data=9 hrs) w. & wo other languages
  • Table5: Word error rate (%) for all four test sets (A, B, C, and D) of the Aurora 4 task. DNN outperforms GMM systems
  • Table6: DNN adaptation using SGD and batch implementations
  • Table7: Word error rates for varying number (200, 50, and 5) of adaptation utterances. DNN baseline error rate 34.1%
  • Table8: Goal tracking accuracy for five slots using a baseline maximum entropy model and a DSN. Experiments were done on a fixed corpus of dialogs with real users
Download tables as Excel
Funding
  • Provides an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology
  • Provides an overview of this body of work, with emphasis on more recent experiments which shed light onto the understanding of the basic capabilities and limitations of the current deep learning technology for speech recognition and related applications
  • Focuses on several aspects of deep learning in the feature-domain with the theme of how deep models can enable the effective use of primitive, information-rich spectral features
  • Explored a primitive convolutional neural net where the pooling configuration is fixed
  • When the data sets are obtained from the same source in training the DNN system , similar error rates are obtained with and without applying a sentence-level spectral feature normalization procedure
Reference
  • O. Abdel-Hamid and A. Mohamed, H. Jiang, and G. Penn. “Applying convolutional neural network concepts to hybrid NN-HMM model for speech recognition,” ICASSP, 2012.
    Google ScholarLocate open access versionFindings
  • V. Abrash, H. Franco, A. Sankar, and M. Cohen, "Connectionist speaker normalization and adaptation,” Eurospeech, 1995.
    Google ScholarLocate open access versionFindings
  • Y. Bengio. “Representation learning: A review and new perspectives,” IEEE Trans. PAMI, special issue Learning Deep Architectures, 2013.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, N. Boulanger, and R. Pascanu. “Advances in optimizing recurrent networks,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • A. Black, et al, “Spoken dialog challenge 2010: Comparison of live and control test results,” SIGdial Workshop, 2011.
    Google ScholarLocate open access versionFindings
  • X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide, “Pipelined backpropagation for context-dependent deep neural networks,” Interspeech, 2012.
    Google ScholarLocate open access versionFindings
  • G. Dahl, D. Yu, L. Deng, and A. Acero, “Large vocabulary continuous speech recognition with context-dependent DBN-HMMs,” ICASSP, 2011.
    Google ScholarLocate open access versionFindings
  • G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep neural networks for large vocabulary speech recognition,” IEEE Trans. Audio, Speech, Lang. Proc., vol. 20, pp. 30–42, 2012.
    Google ScholarLocate open access versionFindings
  • G. Dahl, T. Sainath, and G. Hinton. “Improving DNNs for LVCSR using RELU and dropout,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • J. Dean et al., “Large scale distributed deep networks,” NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • L. Deng and X. Li. “Machine learning paradigms for speech recognition: An overview,” IEEE Trans. Audio, Speech & Lang. Proc., Vol. 21, No. 5, May 2013.
    Google ScholarLocate open access versionFindings
  • L. Deng, A. Acero, M. Plumpe, and X. Huang. “Large-vocabulary speech recognition under adverse acoustic environments,” Proc. ICSLP, 2000.
    Google ScholarLocate open access versionFindings
  • L. Deng, D. Yu, and J. Platt. “Scalable stacking and learning for building deep architectures,” ICASSP, 2012.
    Google ScholarLocate open access versionFindings
  • L. Deng and D. Yu. “Deep convex net: A scalable architecture for speech pattern classification,” Interspeech, 2011.
    Google ScholarLocate open access versionFindings
  • L. Deng, G. Tur, X. He, and D. Hakkani-Tur, “Use of kernel deep convex networks and end-to-end learning for spoken language understanding,” IEEE SLT, 2012.
    Google ScholarLocate open access versionFindings
  • L. Deng. “Integrated-multilingual speech recognition using universal phonological features in a functional speech production model,” ICASSP, 1997.
    Google ScholarLocate open access versionFindings
  • L. Deng, O. Abdel-Hamid and D. Yu. “A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • L. Deng, X. He, and J. Gao. “Deep stacking networks for information retrieval,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton, “Binary coding of speech spectrograms using a deep auto-encoder,” Interspeech, 2010.
    Google ScholarLocate open access versionFindings
  • X. Fan, M. Seltzer, J. Droppo, H. Malvar, and A. Acero, “Joint encoding of the waveform and speech recognition features using a transform codec,” ICASSP, 2011.
    Google ScholarLocate open access versionFindings
  • F. Flego and M. Gales, “Factor analysis based VTS and JUD noise estimation and compensation,” Cambridge University, Tech. Rep. CUED/FINFENG/TR653, 2011.
    Google ScholarLocate open access versionFindings
  • X. He, L. Deng, D. Hakkani-Tur, G. Tur, “Multi-style adaptive training for robust cross-lingual spoken language understanding,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean. “Multilingual acoustic models using distributed deep neural networks,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Sig. Proc. Mag., vol. 29, 2012.
    Google ScholarLocate open access versionFindings
  • G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, & R. Salakhutdinov. “Improving neural networks by preventing coadaptation of feature detectors,” arXiv: 1207.0580v1, 2012.
    Findings
  • H. Hermansky. “Speech recognition from spectral dynamics,” Sadhana (Indian Academy of Sciences), 2011, pp. 729-744.
    Google ScholarLocate open access versionFindings
  • J. -T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong. “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • P. Huang, K. Kumar, C. Liu, Y. Gong, and L. Deng. “Predicting speech recognition confidence using deep learning with word identity and score features,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • P. Huang, L. Deng, M. Hasegawa-Johnson, X. He. “Random features for kernel deep convex networks,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • B. Hutchinson, L. Deng, and D. Yu, “Tensor deep stacking networks,” IEEE Trans. PAMI, 2013, to appear.
    Google ScholarLocate open access versionFindings
  • O. Kalinli, M. L. Seltzer, J. Droppo, A. Acero, “Noise adaptive training for robust automatic speech recognition”, IEEE Trans. Audio, Speech & Lang. Proc., vol. 18, no. 8, pp. 1889-1901, 2010.
    Google ScholarLocate open access versionFindings
  • B. Kingsbury, T. Sainath, and H. Soltau, “Scalable minimum Bayes risk training of DNN acoustic models using distributed Hessian-free optimization,” Interspeech, 2012.
    Google ScholarLocate open access versionFindings
  • X. Li and J. Bilmes, “Regularized adaptation of discriminative classifiers,” ICASSP, 2006.
    Google ScholarLocate open access versionFindings
  • J. Li, D. Yu, J. -T. Huang, and Y. Gong, “Improving wideband speech recognition using mixed-bandwidth training data in CD-DNNHMM,” IEEE SLT, 2012.
    Google ScholarLocate open access versionFindings
  • H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C.-H. Lee, “A study on multilingual acoustic modeling for large vocabulary ASR,” ICASSP, 2009.
    Google ScholarLocate open access versionFindings
  • Z. Ling, L. Deng, and D. Yu. “Modeling spectral envelopes using restricted Boltzmann machines for statistical Parametric speech synthesis,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • T. Mikolov and G. Zweig, “Context Dependent Recurrent Neural Network Language Model,” Proc. SLT, 2012.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, M. Karafiat, J. Cernocky, and S.Khudanpur, “Recurrent neural network based language model,” Interspeech, 2010.
    Google ScholarLocate open access versionFindings
  • A. Mohamed, D.Yu, L. Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” Interspeech,2010.
    Google ScholarLocate open access versionFindings
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, “Multimodal deep learning,” ICML, 2011.
    Google ScholarLocate open access versionFindings
  • A. Ragni and M. Gales, “Derivative kernels for noise robust ASR”, Proc. ASRU, 2011.
    Google ScholarLocate open access versionFindings
  • T. Sainath, A. Mohamed, B. Kingsbury, B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • T. Schultz and A. Waibel, “Multilingual and cross-lingual speech recognition,” DARPA Workshop on Broadcast News Transcription and Understanding, 1998.
    Google ScholarLocate open access versionFindings
  • F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” Interspeech 2011.
    Google ScholarLocate open access versionFindings
  • M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • H. Sheikhzadeh and L. Deng. “Waveform-based speech recognition using hidden filter models: Parameter Selection and sensitivity to power normalization,” IEEE Trans. Speech & Audio Proc., Vol.2, pp. 80-91, 1994.
    Google ScholarLocate open access versionFindings
  • Y. Shi, P. Wiggers and C.M. Jonker, “Towards recurrent neural network language models with linguistic and contextual features,” Interspeech, 2012.
    Google ScholarLocate open access versionFindings
  • H. Su, G. Li, D. Yu, and F. Seide, “Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription,” ICASSP, 2013.
    Google ScholarLocate open access versionFindings
  • P. Swietojanski, A. Ghoshal, and S. Renals, "Unsupervised crosslingual knowledge transfer in DNN-based LVCSR," Proc. SLT, 2012.
    Google ScholarLocate open access versionFindings
  • G. Tur, L. Deng, D. Hakkani-Tur, and X. He, “Towards deeper understanding: Deep convex networks for semantic utterance classification,” ICASSP, 2012.
    Google ScholarLocate open access versionFindings
  • N. Vu, W. Breiter, F. Metze, and T. Schultz, “An investigation on initialization schemes for multilayer perceptron training using multilingual data and their effect on ASR performance,” Interspeech, 2012.
    Google ScholarLocate open access versionFindings
  • K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation of context-dependent deep neural networks for automatic speech recognition,” IEEE SLT, 2012.
    Google ScholarLocate open access versionFindings
  • D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” NIPS Workshop on Deep Learning, 2010.
    Google ScholarLocate open access versionFindings
  • D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in deep neural networks for large vocabulary speech recognition,” ICASSP, 2012, pp. 4409–4412.
    Google ScholarLocate open access versionFindings
  • D. Yu, K. Yao, H. Su, G. Li and F. Seide, “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” ICASSP 2013.
    Google ScholarLocate open access versionFindings
  • D. Yu, L. Deng, and F. Seide. “The deep tensor neural network with applications to large vocabulary speech recognition,” IEEE Trans Audio, Speech, & Lang. Proc. vol. 21, no. 2, pp. 388-396, Feb, 2013.
    Google ScholarLocate open access versionFindings
  • F. Zamora-Martinez, S. Espana-Boquera, M.J. Castro-Bleda, and R. De-Mori, “Cache neural network language models based on longdistance dependencies for a spoken dialog system,” ICASSP, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Zhang, L. Deng, X. He, and A. Acero. “A novel decision function and the associated decision-feedback learning for speech translation,” ICASSP, 2011.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科