# Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

IEEE Signal Process. Mag., pp. 82-97, 2012.

EI WOS

Keywords:

Wei bo:

Abstract:

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to us...More

Code:

Data:

Introduction

- New machine learning algorithms can lead to significant advances in automatic speech recognition (ASR).
- Neither the hardware nor the learnceptual linear predictive coefficients (PLPs) [4] computed from ing algorithms were adequate for training neural networks the raw waveform and their first- and second-order temporal with many hidden layers on large amounts of data, and the differences [5].
- The main practical contribution of neural for discrimination and to express the remaining information in networks at that time was to provide extra features in tandem a form that facilitates discrimination with GMM-HMMs

Highlights

- New machine learning algorithms can lead to significant advances in automatic speech recognition (ASR)
- USING DBN-DNNs TO PROVIDE INPUT FEATURES FOR GMM-HMM SYSTEMS Here we describe a class of methods where neural networks are used to provide the feature vectors that the GMM in a GMMHMM system is trained to model
- We have described how three major speech research groups Communication Association (ISCA) and the IEEE

Methods

- PER CD-HMM [26]

AUGMENTED CONDITIONAL RANDOM FIELDS [26]

RANDOMLY INITIALIZED RECURRENT NEURAL NETS [27]

BAYESIAN TRIPHONE GMM-HMM [28]

MONOPHONE HTMS [29]

HETEROGENEOUS CLASSIFIERS [30]

MONOPHONE RANDOMLY INITIALIZED DNNs (SIX LAYERS) [13]

MONOPHONE DBN-DNNs (SIX LAYERS) [13]

MONOPHONE DBN-DNNs WITH MMI TRAINING [31]

TRIPHONE GMM-HMMs DT W/ BMMI [32]. - AUGMENTED CONDITIONAL RANDOM FIELDS [26].
- RANDOMLY INITIALIZED RECURRENT NEURAL NETS [27].
- MONOPHONE RANDOMLY INITIALIZED DNNs (SIX LAYERS) [13].
- MONOPHONE DBN-DNNs (SIX LAYERS) [13].
- MONOPHONE DBN-DNNs WITH MMI TRAINING [31].
- TRIPHONE GMM-HMMs DT W/ BMMI [32]

Results

**RECOGNITION RESULTS ON TIMIT data**

This was used to create a

GMM-HMM model composed of context-dependent crossword triphone HMMs that have a left-to-

AND SUBSEQUENTLY ON A VARIETY OF LVCSR TASKS.

new baseline system for which the input was nine frames of MFCCs that were transformed by right, three-state topology.**RECOGNITION RESULTS ON TIMIT data**.- This was used to create a.
- SA training was performed, model has a total of 7,969 senone states and uses as acoustic and decision tree clustering was used to obtain 17,552 triphone input PLP features that have been transformed by LDA.
- STCs were used in the GMMs to model the features.
- The tied covariances (STCs) are used in the GMMs to model the LDA acoustic models were further improved with BMMI.
- During transformed features and BMMI [46] was used to train the decoding, ML linear regression (MLLR) and feature space MLLR model discriminatively

Conclusion

**SUMMARY OF THE MAIN RESULTS**

FOR the discriminative pretraining after a single epoch instead of

DBN-DNN ACOUSTIC MODELS ON LVCSR TASKS multiple epochs as reported in [45].- Table 3 summarizes the acoustic modeling results described has been found effective for the architectures called “deep above.
- It shows that DNN-HMMs consistently outperform convex network” [51] and “deep stacking network” [52], where.
- GMM-HMMs that are trained on the same amount of data, pretraining is accomplished by convex optimization involving sometimes by a large margin.
- Dom initial weights works much better than had been thought, provided the scales of the initial [TABLE 3] A COMPARISON OF THE PERCENTAGE WERs USING DNN-HMMs AND GMM-HMMs ON FIVE DIFFERENT LARGE VOCABULARY TASKS.

Summary

## Introduction:

New machine learning algorithms can lead to significant advances in automatic speech recognition (ASR).- Neither the hardware nor the learnceptual linear predictive coefficients (PLPs) [4] computed from ing algorithms were adequate for training neural networks the raw waveform and their first- and second-order temporal with many hidden layers on large amounts of data, and the differences [5].
- The main practical contribution of neural for discrimination and to express the remaining information in networks at that time was to provide extra features in tandem a form that facilitates discrimination with GMM-HMMs
## Methods:

PER CD-HMM [26]

AUGMENTED CONDITIONAL RANDOM FIELDS [26]

RANDOMLY INITIALIZED RECURRENT NEURAL NETS [27]

BAYESIAN TRIPHONE GMM-HMM [28]

MONOPHONE HTMS [29]

HETEROGENEOUS CLASSIFIERS [30]

MONOPHONE RANDOMLY INITIALIZED DNNs (SIX LAYERS) [13]

MONOPHONE DBN-DNNs (SIX LAYERS) [13]

MONOPHONE DBN-DNNs WITH MMI TRAINING [31]

TRIPHONE GMM-HMMs DT W/ BMMI [32].- AUGMENTED CONDITIONAL RANDOM FIELDS [26].
- RANDOMLY INITIALIZED RECURRENT NEURAL NETS [27].
- MONOPHONE RANDOMLY INITIALIZED DNNs (SIX LAYERS) [13].
- MONOPHONE DBN-DNNs (SIX LAYERS) [13].
- MONOPHONE DBN-DNNs WITH MMI TRAINING [31].
- TRIPHONE GMM-HMMs DT W/ BMMI [32]
## Results:

**RECOGNITION RESULTS ON TIMIT data**

This was used to create a

GMM-HMM model composed of context-dependent crossword triphone HMMs that have a left-to-

AND SUBSEQUENTLY ON A VARIETY OF LVCSR TASKS.

new baseline system for which the input was nine frames of MFCCs that were transformed by right, three-state topology.**RECOGNITION RESULTS ON TIMIT data**.- This was used to create a.
- SA training was performed, model has a total of 7,969 senone states and uses as acoustic and decision tree clustering was used to obtain 17,552 triphone input PLP features that have been transformed by LDA.
- STCs were used in the GMMs to model the features.
- The tied covariances (STCs) are used in the GMMs to model the LDA acoustic models were further improved with BMMI.
- During transformed features and BMMI [46] was used to train the decoding, ML linear regression (MLLR) and feature space MLLR model discriminatively
## Conclusion:

**SUMMARY OF THE MAIN RESULTS**

FOR the discriminative pretraining after a single epoch instead of

DBN-DNN ACOUSTIC MODELS ON LVCSR TASKS multiple epochs as reported in [45].- Table 3 summarizes the acoustic modeling results described has been found effective for the architectures called “deep above.
- It shows that DNN-HMMs consistently outperform convex network” [51] and “deep stacking network” [52], where.
- GMM-HMMs that are trained on the same amount of data, pretraining is accomplished by convex optimization involving sometimes by a large margin.
- Dom initial weights works much better than had been thought, provided the scales of the initial [TABLE 3] A COMPARISON OF THE PERCENTAGE WERs USING DNN-HMMs AND GMM-HMMs ON FIVE DIFFERENT LARGE VOCABULARY TASKS.

Reference

- J. Baker, L. Deng, J. Glass, S. Khudanpur, Chin Hui Lee, N. Morgan, and D. O’Shaughnessy, “Developments and directions in speech recognition and understanding, part 1,” IEEE Signal Processing Mag., vol. 26, no. 3, pp. 75–80, May 2009.
- S. Furui, Digital Speech Processing, Synthesis, and Recognition. New York: Marcel Dekker, 2000.
- B. H. Juang, S. Levinson, and M. Sondhi, “Maximum likelihood estimation for multivariate mixture observations of Markov chains,” IEEE Trans. Inform. Theory, vol. 32, no. 2, pp. 307–309, 1986.
- H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Amer., vol. 87, no. 4, pp. 1738–1752, 1990.
- S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal, Processing, vol. 29, no. 2, pp. 254–272, 1981.
- S. Young, “Large vocabulary continuous speech recognition: A review,” IEEE Signal Processing Mag., vol. 13, no. 5, pp. 45–57, 1996.
- L. Bahl, P. Brown, P. de Souza, and R. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. ICASSP, 1986, pp. 49–52.
- H. Hermansky, D. P. W. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” in Proc. ICASSP. Los Alamitos, CA: IEEE Computer Society, 2000, vol. 3, pp. 1635–1638.
- H. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach, Norwell, MA: Kluwer, 1993.
- L. Deng, “Computational models for speech production,” in Computational Models of Speech Pattern Processing, K. M. Ponting, Ed. New York: SpringerVerlag, 1999, pp. 199–213.
- L. Deng, “Switching dynamic system models for speech articulation and acoustics,” in Mathematical Foundations of Speech and Language Processing, M. Johnsonm S. P. Khudanpur, M. Ostendorf, and R. Rosenfeld. New York: Springer-Verlag, 2003, pp. 115–134.
- A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.
- A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22, Jan. 2012.
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
- X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. AISTATS, 2010, pp. 249–256.
- D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep, big, simple neural nets for handwritten digit recognition,” Neural Comput., vol. 22, no. 12, pp. 3207–3220, 2010.
- G. E. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
- H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proc. 24th Int. Conf. Machine Learning, 2007, pp. 473–480.
- J. Pearl, Probabilistic Inference in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann, 1988.
- G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Comput., vol. 14, pp. 1771–1800, 2002.
- G. E. Hinton, “A practical guide to training restricted Boltzmann machines,” Tech. Rep. UTML TR 2010-003, Dept. Comput. Sci., Univ. Toronto, 2010.
- G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
- T. N. Sainath, B. Ramabhadran, and M. Picheny, “An exploration of large vocabulary tools for small vocabulary phonetic recognition,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2009, pp. 359–364.
- A. Mohamed, T. N. Sainath, G. E. Dahl, B. Ramabhadran, G. E. Hinton, and M. Picheny, “Deep belief networks using discriminative features for phone recognition,” in Proc. ICASSP, 2011, pp. 5060–5063.
- A. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in Proc. ICASSP, 2012, pp. 4273–4276.
- Y. Hifny and S. Renals, “Speech recognition using augmented conditional random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp. 354–365, 2009.
- A. Robinson, “An application to recurrent nets to phone probability estimation,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.
- J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone models,” in Proc. ICASSP, 1998, pp. 409–412.
- L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp. 445–448.
- A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple classifiers for speech recognition,” in Proc. ICSLP, 1998.
- A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp. 2846–2849.
- T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky, “Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.
- G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition with the mean-covariance restricted Boltzmann machine,” in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. ShaweTaylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp. 469–477.
- O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in Proc. ICASSP, 2012, pp. 4277–4280.
- X. He, L. Deng, and W. Chou, “Discriminative learning in sequential pattern recognition—A unifying review for optimization-oriented speech recognition,” IEEE Signal Processing Mag., vol. 25, no. 5, pp. 14–36, 2008.
- Y. Bengio, R. De Mori, G. Flammia, and F. Kompe, “Global optimization of a neural network—Hidden Markov model hybrid,” in Proc. EuroSpeech, 1991.
- B. Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in Proc. ICASSP, 2009, pp. 3761–3764.
- R. Prabhavalkar and E. Fosler-Lussier, “Backpropagation training for multilayer conditional random field based phone recognition,” in Proc. ICASSP, 2010, pp. 5534–5537.
- H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009, pp. 1096–1104.
- L. Deng, D. Yu, and A. Acero, “Structured speech modeling,” IEEE Trans. Audio Speech Lang. Processing, vol. 14, no. 5, pp. 1492–1504, 2006.
- H. Zen, M. Gales, Y. Nankaku, and K. Tokuda, “Product of experts for statistical parametric speech synthesis,” IEEE Trans. Audio Speech and Lang. Processing, vol. 20, no. 3, pp. 794–805, Mar. 2012.
- G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.
- F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp. 4 37– 4 40.
- D. Yu, L. Deng, and G. Dahl, “Roles of pretraining and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” in Proc. NIPS Workshop Deep Learning and Unsupervised Feature Learning, 2010.
- F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Proc. IEEE ASRU, 2011, pp. 24–29.
- D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “Boosted MMI for model and feature-space discriminative training,” in Proc. ICASSP, 2008, pp. 4057–4060.
- N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained deep neural networks to large vocabulary speech recognition,” submitted for publication.
- G. Zweig, P. Nguyen, D. V. Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, G. S. V. S. Sivaram, S. Bowman, and J. Kao, “Speech recognition with segmental conditional random fields: A summary of the JHU CLSP 2010 summer workshop,” in Proc. ICASSP, 2011, pp. 5044–5047.
- V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on CPUs,” in Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011 [Online]. Available: http://research.google.com/pubs/archive/37631.pdf
- T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using deep belief networks for large vocabulary continuous speech recognition,” Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech. Rep. UTML TR 2010-003, Feb. 2011.
- L. Deng and D. Yu, “Deep convex network: A scalable architecture for speech pattern classification,” in Proc. Interspeech, 2011, pp. 2285–2288.
- L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning for building deep architectures,” in Proc. ICASSP, 2012, pp. 2133–2136.
- D. Yu, L. Deng, G. Li, and Seide F, “Discriminative pretraining of deep neural networks,” U.S. Patent Filing, Nov. 2011.
- P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, no. 11, pp. 3371–3408, 2010.
- S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive autoencoders: Explicit invariance during feature extraction,” in Proc. 28th Int. Conf. Machine Learning, 2011, pp. 833–840.
- C. Plahl, T. N. Sainath, B. Ramabhadran, and D. Nahamoo, “Improved pretraining of deep belief networks using sparse encoding symmetric machines,” in Proc. ICASSP, 2012, pp. 4165–4168.
- B. Hutchinson, L. Deng, and D. Yu, “A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition,” in Proc. ICASSP, 2012, pp. 4805–4808.
- Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” in Proc. 28th Int. Conf. Machine Learning, 2011, pp. 265–272.
- J. Martens, “Deep learning via Hessian-free optimization,” in Proc. 27th Int. Conf. Machine learning, 2010, pp. 735–742.
- N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, Jan. 2012, pp. 7–13.
- G. Sivaram and H. Hermansky, “Sparse multilayer perceptron for phoneme recognition,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, Jan. 2012, pp. 23–29.
- T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” in Proc. ICASSP, 2012, pp. 4153–4156.
- N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf, P. Jain, H. Hermansky, D. Ellis, G. Doddington, B. Chen, O. Cretin, H. Bourlard, and M. Athineos, “Pushing the envelope aside,” IEEE Signal Processing Mag., vol. 22, no. 5, pp. 81–88, Sept. 2005.
- O. Vinyals and S. V. Ravuri, “Comparing multilayer perceptron to deep belief network tandem features for robust ASR,” in Proc. ICASSP, 2011, pp. 4596–4599.
- D. Yu, S. Siniscalchi, L. Deng, and C. Lee, “Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition,” in Proc. ICASSP, 2012, pp. 4169–4172.
- L. Deng and D. Sun, “A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features,” J. Acoust. Soc. Amer., vol. 85, no. 5, pp. 2702–2719, 1994.
- J. Sun and L. Deng, “An overlapping-feature based phonological model incorporating linguistic constraints: Applications to speech recognition,” J. Acoustic. Soc. Amer., vol. 111, no. 2, pp. 1086–1101, 2002.
- P. C. Woodland and D. Povey, “Large scale discriminative training of hidden Markov models for speech recognition,” Comput Speech Lang., vol. 16, no. 1, pp. 25–47, 2002.
- F. Grezl, M. Karaat, S. Kontar, and J. Cernocky. “Probabilistic and bottle-neck features for LVCSR of meetings,” in Proc. ICASSP, 2007.

Tags

Comments