Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

IEEE Signal Process. Mag., pp. 82-97, 2012.

Cited by: 7965|Bibtex|Views532|Links
EI WOS
Keywords:
deep neural networkshmm statesacoustic modelingspeech recognitiontemporal variabilityMore(6+)
Wei bo:
USING DBN-DNNs TO PROVIDE INPUT FEATURES FOR GMM-HMM SYSTEMS Here we describe a class of methods where neural networks are used to provide the feature vectors that the GMM in a GMMHMM system is trained to model

Abstract:

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to us...More

Code:

Data:

Introduction
  • New machine learning algorithms can lead to significant advances in automatic speech recognition (ASR).
  • Neither the hardware nor the learnceptual linear predictive coefficients (PLPs) [4] computed from ing algorithms were adequate for training neural networks the raw waveform and their first- and second-order temporal with many hidden layers on large amounts of data, and the differences [5].
  • The main practical contribution of neural for discrimination and to express the remaining information in networks at that time was to provide extra features in tandem a form that facilitates discrimination with GMM-HMMs
Highlights
  • New machine learning algorithms can lead to significant advances in automatic speech recognition (ASR)
  • USING DBN-DNNs TO PROVIDE INPUT FEATURES FOR GMM-HMM SYSTEMS Here we describe a class of methods where neural networks are used to provide the feature vectors that the GMM in a GMMHMM system is trained to model
  • We have described how three major speech research groups Communication Association (ISCA) and the IEEE
Methods
  • PER CD-HMM [26]

    AUGMENTED CONDITIONAL RANDOM FIELDS [26]

    RANDOMLY INITIALIZED RECURRENT NEURAL NETS [27]

    BAYESIAN TRIPHONE GMM-HMM [28]

    MONOPHONE HTMS [29]

    HETEROGENEOUS CLASSIFIERS [30]

    MONOPHONE RANDOMLY INITIALIZED DNNs (SIX LAYERS) [13]

    MONOPHONE DBN-DNNs (SIX LAYERS) [13]

    MONOPHONE DBN-DNNs WITH MMI TRAINING [31]

    TRIPHONE GMM-HMMs DT W/ BMMI [32].
  • AUGMENTED CONDITIONAL RANDOM FIELDS [26].
  • RANDOMLY INITIALIZED RECURRENT NEURAL NETS [27].
  • MONOPHONE RANDOMLY INITIALIZED DNNs (SIX LAYERS) [13].
  • MONOPHONE DBN-DNNs (SIX LAYERS) [13].
  • MONOPHONE DBN-DNNs WITH MMI TRAINING [31].
  • TRIPHONE GMM-HMMs DT W/ BMMI [32]
Results
  • RECOGNITION RESULTS ON TIMIT data

    This was used to create a

    GMM-HMM model composed of context-dependent crossword triphone HMMs that have a left-to-

    AND SUBSEQUENTLY ON A VARIETY OF LVCSR TASKS.

    new baseline system for which the input was nine frames of MFCCs that were transformed by right, three-state topology.
  • RECOGNITION RESULTS ON TIMIT data.
  • This was used to create a.
  • SA training was performed, model has a total of 7,969 senone states and uses as acoustic and decision tree clustering was used to obtain 17,552 triphone input PLP features that have been transformed by LDA.
  • STCs were used in the GMMs to model the features.
  • The tied covariances (STCs) are used in the GMMs to model the LDA acoustic models were further improved with BMMI.
  • During transformed features and BMMI [46] was used to train the decoding, ML linear regression (MLLR) and feature space MLLR model discriminatively
Conclusion
  • SUMMARY OF THE MAIN RESULTS

    FOR the discriminative pretraining after a single epoch instead of

    DBN-DNN ACOUSTIC MODELS ON LVCSR TASKS multiple epochs as reported in [45].
  • Table 3 summarizes the acoustic modeling results described has been found effective for the architectures called “deep above.
  • It shows that DNN-HMMs consistently outperform convex network” [51] and “deep stacking network” [52], where.
  • GMM-HMMs that are trained on the same amount of data, pretraining is accomplished by convex optimization involving sometimes by a large margin.
  • Dom initial weights works much better than had been thought, provided the scales of the initial [TABLE 3] A COMPARISON OF THE PERCENTAGE WERs USING DNN-HMMs AND GMM-HMMs ON FIVE DIFFERENT LARGE VOCABULARY TASKS.
Summary
  • Introduction:

    New machine learning algorithms can lead to significant advances in automatic speech recognition (ASR).
  • Neither the hardware nor the learnceptual linear predictive coefficients (PLPs) [4] computed from ing algorithms were adequate for training neural networks the raw waveform and their first- and second-order temporal with many hidden layers on large amounts of data, and the differences [5].
  • The main practical contribution of neural for discrimination and to express the remaining information in networks at that time was to provide extra features in tandem a form that facilitates discrimination with GMM-HMMs
  • Methods:

    PER CD-HMM [26]

    AUGMENTED CONDITIONAL RANDOM FIELDS [26]

    RANDOMLY INITIALIZED RECURRENT NEURAL NETS [27]

    BAYESIAN TRIPHONE GMM-HMM [28]

    MONOPHONE HTMS [29]

    HETEROGENEOUS CLASSIFIERS [30]

    MONOPHONE RANDOMLY INITIALIZED DNNs (SIX LAYERS) [13]

    MONOPHONE DBN-DNNs (SIX LAYERS) [13]

    MONOPHONE DBN-DNNs WITH MMI TRAINING [31]

    TRIPHONE GMM-HMMs DT W/ BMMI [32].
  • AUGMENTED CONDITIONAL RANDOM FIELDS [26].
  • RANDOMLY INITIALIZED RECURRENT NEURAL NETS [27].
  • MONOPHONE RANDOMLY INITIALIZED DNNs (SIX LAYERS) [13].
  • MONOPHONE DBN-DNNs (SIX LAYERS) [13].
  • MONOPHONE DBN-DNNs WITH MMI TRAINING [31].
  • TRIPHONE GMM-HMMs DT W/ BMMI [32]
  • Results:

    RECOGNITION RESULTS ON TIMIT data

    This was used to create a

    GMM-HMM model composed of context-dependent crossword triphone HMMs that have a left-to-

    AND SUBSEQUENTLY ON A VARIETY OF LVCSR TASKS.

    new baseline system for which the input was nine frames of MFCCs that were transformed by right, three-state topology.
  • RECOGNITION RESULTS ON TIMIT data.
  • This was used to create a.
  • SA training was performed, model has a total of 7,969 senone states and uses as acoustic and decision tree clustering was used to obtain 17,552 triphone input PLP features that have been transformed by LDA.
  • STCs were used in the GMMs to model the features.
  • The tied covariances (STCs) are used in the GMMs to model the LDA acoustic models were further improved with BMMI.
  • During transformed features and BMMI [46] was used to train the decoding, ML linear regression (MLLR) and feature space MLLR model discriminatively
  • Conclusion:

    SUMMARY OF THE MAIN RESULTS

    FOR the discriminative pretraining after a single epoch instead of

    DBN-DNN ACOUSTIC MODELS ON LVCSR TASKS multiple epochs as reported in [45].
  • Table 3 summarizes the acoustic modeling results described has been found effective for the architectures called “deep above.
  • It shows that DNN-HMMs consistently outperform convex network” [51] and “deep stacking network” [52], where.
  • GMM-HMMs that are trained on the same amount of data, pretraining is accomplished by convex optimization involving sometimes by a large margin.
  • Dom initial weights works much better than had been thought, provided the scales of the initial [TABLE 3] A COMPARISON OF THE PERCENTAGE WERs USING DNN-HMMs AND GMM-HMMs ON FIVE DIFFERENT LARGE VOCABULARY TASKS.
Reference
  • J. Baker, L. Deng, J. Glass, S. Khudanpur, Chin Hui Lee, N. Morgan, and D. O’Shaughnessy, “Developments and directions in speech recognition and understanding, part 1,” IEEE Signal Processing Mag., vol. 26, no. 3, pp. 75–80, May 2009.
    Google ScholarLocate open access versionFindings
  • S. Furui, Digital Speech Processing, Synthesis, and Recognition. New York: Marcel Dekker, 2000.
    Google ScholarFindings
  • B. H. Juang, S. Levinson, and M. Sondhi, “Maximum likelihood estimation for multivariate mixture observations of Markov chains,” IEEE Trans. Inform. Theory, vol. 32, no. 2, pp. 307–309, 1986.
    Google ScholarLocate open access versionFindings
  • H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Amer., vol. 87, no. 4, pp. 1738–1752, 1990.
    Google ScholarLocate open access versionFindings
  • S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal, Processing, vol. 29, no. 2, pp. 254–272, 1981.
    Google ScholarLocate open access versionFindings
  • S. Young, “Large vocabulary continuous speech recognition: A review,” IEEE Signal Processing Mag., vol. 13, no. 5, pp. 45–57, 1996.
    Google ScholarLocate open access versionFindings
  • L. Bahl, P. Brown, P. de Souza, and R. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. ICASSP, 1986, pp. 49–52.
    Google ScholarLocate open access versionFindings
  • H. Hermansky, D. P. W. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” in Proc. ICASSP. Los Alamitos, CA: IEEE Computer Society, 2000, vol. 3, pp. 1635–1638.
    Google ScholarLocate open access versionFindings
  • H. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach, Norwell, MA: Kluwer, 1993.
    Google ScholarFindings
  • L. Deng, “Computational models for speech production,” in Computational Models of Speech Pattern Processing, K. M. Ponting, Ed. New York: SpringerVerlag, 1999, pp. 199–213.
    Google ScholarFindings
  • L. Deng, “Switching dynamic system models for speech articulation and acoustics,” in Mathematical Foundations of Speech and Language Processing, M. Johnsonm S. P. Khudanpur, M. Ostendorf, and R. Rosenfeld. New York: Springer-Verlag, 2003, pp. 115–134.
    Google ScholarFindings
  • A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, 2009.
    Google ScholarLocate open access versionFindings
  • A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 14–22, Jan. 2012.
    Google ScholarLocate open access versionFindings
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
    Google ScholarLocate open access versionFindings
  • X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. AISTATS, 2010, pp. 249–256.
    Google ScholarLocate open access versionFindings
  • D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep, big, simple neural nets for handwritten digit recognition,” Neural Comput., vol. 22, no. 12, pp. 3207–3220, 2010.
    Google ScholarLocate open access versionFindings
  • G. E. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
    Google ScholarLocate open access versionFindings
  • H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proc. 24th Int. Conf. Machine Learning, 2007, pp. 473–480.
    Google ScholarLocate open access versionFindings
  • J. Pearl, Probabilistic Inference in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann, 1988.
    Google ScholarFindings
  • G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Comput., vol. 14, pp. 1771–1800, 2002.
    Google ScholarLocate open access versionFindings
  • G. E. Hinton, “A practical guide to training restricted Boltzmann machines,” Tech. Rep. UTML TR 2010-003, Dept. Comput. Sci., Univ. Toronto, 2010.
    Google ScholarFindings
  • G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
    Google ScholarLocate open access versionFindings
  • T. N. Sainath, B. Ramabhadran, and M. Picheny, “An exploration of large vocabulary tools for small vocabulary phonetic recognition,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2009, pp. 359–364.
    Google ScholarLocate open access versionFindings
  • A. Mohamed, T. N. Sainath, G. E. Dahl, B. Ramabhadran, G. E. Hinton, and M. Picheny, “Deep belief networks using discriminative features for phone recognition,” in Proc. ICASSP, 2011, pp. 5060–5063.
    Google ScholarLocate open access versionFindings
  • A. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in Proc. ICASSP, 2012, pp. 4273–4276.
    Google ScholarLocate open access versionFindings
  • Y. Hifny and S. Renals, “Speech recognition using augmented conditional random fields,” IEEE Trans. Audio Speech Lang. Processing, vol. 17, no. 2, pp. 354–365, 2009.
    Google ScholarLocate open access versionFindings
  • A. Robinson, “An application to recurrent nets to phone probability estimation,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.
    Google ScholarLocate open access versionFindings
  • J. Ming and F. J. Smith, “Improved phone recognition using Bayesian triphone models,” in Proc. ICASSP, 1998, pp. 409–412.
    Google ScholarLocate open access versionFindings
  • L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp. 445–448.
    Google ScholarLocate open access versionFindings
  • A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple classifiers for speech recognition,” in Proc. ICSLP, 1998.
    Google ScholarLocate open access versionFindings
  • A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” in Proc. Interspeech, 2010, pp. 2846–2849.
    Google ScholarLocate open access versionFindings
  • T. N. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and D. Kanevsky, “Exemplar-based sparse representation features: From TIMIT to LVCSR,” IEEE Trans. Audio Speech Lang. Processing, vol. 19, no. 8, pp. 2598–2613, Nov. 2011.
    Google ScholarLocate open access versionFindings
  • G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition with the mean-covariance restricted Boltzmann machine,” in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. ShaweTaylor, R.S. Zemel, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2010, pp. 469–477.
    Google ScholarLocate open access versionFindings
  • O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in Proc. ICASSP, 2012, pp. 4277–4280.
    Google ScholarLocate open access versionFindings
  • X. He, L. Deng, and W. Chou, “Discriminative learning in sequential pattern recognition—A unifying review for optimization-oriented speech recognition,” IEEE Signal Processing Mag., vol. 25, no. 5, pp. 14–36, 2008.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, R. De Mori, G. Flammia, and F. Kompe, “Global optimization of a neural network—Hidden Markov model hybrid,” in Proc. EuroSpeech, 1991.
    Google ScholarLocate open access versionFindings
  • B. Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in Proc. ICASSP, 2009, pp. 3761–3764.
    Google ScholarLocate open access versionFindings
  • R. Prabhavalkar and E. Fosler-Lussier, “Backpropagation training for multilayer conditional random field based phone recognition,” in Proc. ICASSP, 2010, pp. 5534–5537.
    Google ScholarLocate open access versionFindings
  • H. Lee, P. Pham, Y. Largman, and A. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009, pp. 1096–1104.
    Google ScholarLocate open access versionFindings
  • L. Deng, D. Yu, and A. Acero, “Structured speech modeling,” IEEE Trans. Audio Speech Lang. Processing, vol. 14, no. 5, pp. 1492–1504, 2006.
    Google ScholarLocate open access versionFindings
  • H. Zen, M. Gales, Y. Nankaku, and K. Tokuda, “Product of experts for statistical parametric speech synthesis,” IEEE Trans. Audio Speech and Lang. Processing, vol. 20, no. 3, pp. 794–805, Mar. 2012.
    Google ScholarLocate open access versionFindings
  • G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pretrained deep neural networks for large-vocabulary speech recognition,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012.
    Google ScholarLocate open access versionFindings
  • F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, 2011, pp. 4 37– 4 40.
    Google ScholarLocate open access versionFindings
  • D. Yu, L. Deng, and G. Dahl, “Roles of pretraining and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” in Proc. NIPS Workshop Deep Learning and Unsupervised Feature Learning, 2010.
    Google ScholarLocate open access versionFindings
  • F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Proc. IEEE ASRU, 2011, pp. 24–29.
    Google ScholarLocate open access versionFindings
  • D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “Boosted MMI for model and feature-space discriminative training,” in Proc. ICASSP, 2008, pp. 4057–4060.
    Google ScholarLocate open access versionFindings
  • N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “An application of pretrained deep neural networks to large vocabulary speech recognition,” submitted for publication.
    Google ScholarFindings
  • G. Zweig, P. Nguyen, D. V. Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, G. S. V. S. Sivaram, S. Bowman, and J. Kao, “Speech recognition with segmental conditional random fields: A summary of the JHU CLSP 2010 summer workshop,” in Proc. ICASSP, 2011, pp. 5044–5047.
    Google ScholarLocate open access versionFindings
  • V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on CPUs,” in Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011 [Online]. Available: http://research.google.com/pubs/archive/37631.pdf
    Findings
  • T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Improvements in using deep belief networks for large vocabulary continuous speech recognition,” Speech and Language Algorithm Group, IBM, Yorktown Heights, NY, Tech. Rep. UTML TR 2010-003, Feb. 2011.
    Google ScholarFindings
  • L. Deng and D. Yu, “Deep convex network: A scalable architecture for speech pattern classification,” in Proc. Interspeech, 2011, pp. 2285–2288.
    Google ScholarLocate open access versionFindings
  • L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning for building deep architectures,” in Proc. ICASSP, 2012, pp. 2133–2136.
    Google ScholarLocate open access versionFindings
  • D. Yu, L. Deng, G. Li, and Seide F, “Discriminative pretraining of deep neural networks,” U.S. Patent Filing, Nov. 2011.
    Google ScholarLocate open access versionFindings
  • P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, no. 11, pp. 3371–3408, 2010.
    Google ScholarLocate open access versionFindings
  • S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive autoencoders: Explicit invariance during feature extraction,” in Proc. 28th Int. Conf. Machine Learning, 2011, pp. 833–840.
    Google ScholarLocate open access versionFindings
  • C. Plahl, T. N. Sainath, B. Ramabhadran, and D. Nahamoo, “Improved pretraining of deep belief networks using sparse encoding symmetric machines,” in Proc. ICASSP, 2012, pp. 4165–4168.
    Google ScholarLocate open access versionFindings
  • B. Hutchinson, L. Deng, and D. Yu, “A deep architecture with bilinear modeling of hidden representations: Applications to phonetic recognition,” in Proc. ICASSP, 2012, pp. 4805–4808.
    Google ScholarLocate open access versionFindings
  • Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” in Proc. 28th Int. Conf. Machine Learning, 2011, pp. 265–272.
    Google ScholarLocate open access versionFindings
  • J. Martens, “Deep learning via Hessian-free optimization,” in Proc. 27th Int. Conf. Machine learning, 2010, pp. 735–742.
    Google ScholarLocate open access versionFindings
  • N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, Jan. 2012, pp. 7–13.
    Google ScholarLocate open access versionFindings
  • G. Sivaram and H. Hermansky, “Sparse multilayer perceptron for phoneme recognition,” IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 1, Jan. 2012, pp. 23–29.
    Google ScholarLocate open access versionFindings
  • T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” in Proc. ICASSP, 2012, pp. 4153–4156.
    Google ScholarLocate open access versionFindings
  • N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf, P. Jain, H. Hermansky, D. Ellis, G. Doddington, B. Chen, O. Cretin, H. Bourlard, and M. Athineos, “Pushing the envelope aside,” IEEE Signal Processing Mag., vol. 22, no. 5, pp. 81–88, Sept. 2005.
    Google ScholarLocate open access versionFindings
  • O. Vinyals and S. V. Ravuri, “Comparing multilayer perceptron to deep belief network tandem features for robust ASR,” in Proc. ICASSP, 2011, pp. 4596–4599.
    Google ScholarLocate open access versionFindings
  • D. Yu, S. Siniscalchi, L. Deng, and C. Lee, “Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition,” in Proc. ICASSP, 2012, pp. 4169–4172.
    Google ScholarLocate open access versionFindings
  • L. Deng and D. Sun, “A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features,” J. Acoust. Soc. Amer., vol. 85, no. 5, pp. 2702–2719, 1994.
    Google ScholarLocate open access versionFindings
  • J. Sun and L. Deng, “An overlapping-feature based phonological model incorporating linguistic constraints: Applications to speech recognition,” J. Acoustic. Soc. Amer., vol. 111, no. 2, pp. 1086–1101, 2002.
    Google ScholarLocate open access versionFindings
  • P. C. Woodland and D. Povey, “Large scale discriminative training of hidden Markov models for speech recognition,” Comput Speech Lang., vol. 16, no. 1, pp. 25–47, 2002.
    Google ScholarLocate open access versionFindings
  • F. Grezl, M. Karaat, S. Kontar, and J. Cernocky. “Probabilistic and bottle-neck features for LVCSR of meetings,” in Proc. ICASSP, 2007.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments