Speech recognition with deep recurrent neural networks

Acoustics, Speech and Signal Processing, Volume abs/1303.5778, 2013, Pages 6645-6649.

Cited by: 4913|Bibtex|Views257|Links
EI WOS
Keywords:
speech recognitionconnectionist temporal classificationdeep recurrent neural networksend-to-end training methodslong short-term memory RNN architectureMore(5+)
Weibo:
The combination of Long Short-term Memory, an recurrent neural networks architecture with an improved memory, with end-to-end training has proved especially effective for cursive handwriting recognition

Abstract:

Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture h...More

Code:

Data:

0
Introduction
  • Neural networks have a long history in speech recognition, usually in combination with hidden Markov models [1, 2].
  • The combination of Long Short-term Memory [11], an RNN architecture with an improved memory, with end-to-end training has proved especially effective for cursive handwriting recognition [12, 13]
  • It has so far made little impact on speech recognition
Highlights
  • Neural networks have a long history in speech recognition, usually in combination with hidden Markov models [1, 2]
  • The combination of Long Short-term Memory [11], an recurrent neural networks architecture with an improved memory, with end-to-end training has proved especially effective for cursive handwriting recognition [12, 13]
  • The question that inspired this paper was whether recurrent neural networks could benefit from depth in space; that is from stacking multiple recurrent hidden layers on top of each other, just as feedforward layers are stacked in conventional deep networks. To answer this question we introduce deep Long Short-term Memory recurrent neural networks and assess their potential for speech recognition
  • As far as we are aware this is the first time deep Long Short-Term Memory has been applied to speech recognition, and we find that it yields a dramatic improvement over single-layer Long Short-Term Memory
  • WORK We have shown that the combination of deep, bidirectional Long Short-term Memory recurrent neural networks with end-to-end training and weight noise gives state-of-the-art results in phoneme recognition on the TIMIT database
Methods
  • Phoneme recognition experiments were performed on the TIMIT corpus [25].
  • Each input vector was size 123.
  • The data were normalised so that every element of the input vectors had zero mean and unit variance over the training set.
  • All 61 phoneme labels were used during training and decoding, mapped to 39 classes for scoring [26].
  • Note that all experiments were run only once, so the variance due to random weight initialisation and weight noise is unknown
Conclusion
  • The authors have shown that the combination of deep, bidirectional Long Short-term Memory RNNs with end-to-end training and weight noise gives state-of-the-art results in phoneme recognition on the TIMIT database.
  • An obvious step is to extend the system to large vocabulary speech recognition.
  • Another interesting direction would be to combine frequencydomain convolutional neural networks [27] with deep LSTM
Summary
  • Introduction:

    Neural networks have a long history in speech recognition, usually in combination with hidden Markov models [1, 2].
  • The combination of Long Short-term Memory [11], an RNN architecture with an improved memory, with end-to-end training has proved especially effective for cursive handwriting recognition [12, 13]
  • It has so far made little impact on speech recognition
  • Methods:

    Phoneme recognition experiments were performed on the TIMIT corpus [25].
  • Each input vector was size 123.
  • The data were normalised so that every element of the input vectors had zero mean and unit variance over the training set.
  • All 61 phoneme labels were used during training and decoding, mapped to 39 classes for scoring [26].
  • Note that all experiments were run only once, so the variance due to random weight initialisation and weight noise is unknown
  • Conclusion:

    The authors have shown that the combination of deep, bidirectional Long Short-term Memory RNNs with end-to-end training and weight noise gives state-of-the-art results in phoneme recognition on the TIMIT database.
  • An obvious step is to extend the system to large vocabulary speech recognition.
  • Another interesting direction would be to combine frequencydomain convolutional neural networks [27] with deep LSTM
Tables
  • Table1: TIMIT Phoneme Recognition Results. ‘Epochs’ is the number of passes through the training set before convergence. ‘PER’ is the phoneme error rate on the core test set
Download tables as Excel
Funding
  • Investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs
  • Finds that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score
  • Introduces deep Long Short-term Memory RNNs and assess their potential for speech recognition
  • Presents an enhancement to a recently introduced end-to-end learning method that jointly trains two separate RNNs as acoustic and linguistic models
  • Are aware this is the first time deep LSTM has been applied to speech recognition, and finds that it yields a dramatic improvement over single-layer LSTM
Reference
  • H.A. Bourlard and N. Morgan, Connnectionist Speech Recognition: A Hybrid Approach, Kluwer Academic Publishers, 1994.
    Google ScholarFindings
  • Qifeng Zhu, Barry Chen, Nelson Morgan, and Andreas Stolcke, “Tandem connectionist feature extraction for conversational speech recognition,” in International Conference on Machine Learning for Multimodal Interaction, Berlin, Heidelberg, 2005, MLMI’04, pp. 223– 231, Springer-Verlag.
    Google ScholarFindings
  • A. Mohamed, G.E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 14 –22, jan. 2012.
    Google ScholarLocate open access versionFindings
  • G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82 –97, nov. 2012.
    Google ScholarLocate open access versionFindings
  • A. J. Robinson, “An Application of Recurrent Nets to Phone Probability Estimation,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Suman Ravuri, and Daniel Povey, “Revisiting Recurrent Neural Networks for Robust ASR,” in ICASSP, 2012.
    Google ScholarFindings
  • A. Maas, Q. Le, T. O’Neil, O. Vinyals, P. Nguyen, and A. Ng, “Recurrent neural networks for noise reduction in robust asr,” in INTERSPEECH, 2012.
    Google ScholarFindings
  • A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.
    Google ScholarLocate open access versionFindings
  • A. Graves, Supervised sequence labelling with recurrent neural networks, vol. 385, Springer, 2012.
    Google ScholarLocate open access versionFindings
  • A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Worksop, 2012.
    Google ScholarFindings
  • S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735– 1780, 1997.
    Google ScholarLocate open access versionFindings
  • A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, “Unconstrained Online Handwriting Recognition with Recurrent Neural Networks,” in NIPS. 2008.
    Google ScholarFindings
  • Alex Graves and Juergen Schmidhuber, “Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks,” in NIPS. 2009.
    Google ScholarFindings
  • F. Gers, N. Schraudolph, and J. Schmidhuber, “Learning Precise Timing with LSTM Recurrent Networks,” Journal of Machine Learning Research, vol. 3, pp. 115–143, 2002.
    Google ScholarLocate open access versionFindings
  • M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.
    Google ScholarLocate open access versionFindings
  • A. Graves and J. Schmidhuber, “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, June/July 2005.
    Google ScholarLocate open access versionFindings
  • David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams, Learning representations by backpropagating errors, pp. 696–699, MIT Press, 1988.
    Google ScholarFindings
  • Georey Zweig and Patrick Nguyen, “SCARF: A segmental CRF speech recognition system,” Tech. Rep., Microsoft Research, 2009.
    Google ScholarLocate open access versionFindings
  • Andrew W. Senior and Anthony J. Robinson, “Forwardbackward retraining of recurrent neural networks,” in NIPS, 1995, pp. 743–749.
    Google ScholarLocate open access versionFindings
  • Abdel rahman Mohamed, Dong Yu, and Li Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” in in Interspeech, 2010.
    Google ScholarLocate open access versionFindings
  • M. Lehr and I. Shafran, “Discriminatively estimated joint acoustic, duration, and language model for speech recognition,” in ICASSP, 2010, pp. 5542 –5545.
    Google ScholarFindings
  • Kam-Chuen Jim, C.L. Giles, and B.G. Horne, “An analysis of noise in recurrent neural networks: convergence and generalization,” Neural Networks, IEEE Transactions on, vol. 7, no. 6, pp. 1424 –1438, nov 1996.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton and Drew van Camp, “Keeping the neural networks simple by minimizing the description length of the weights,” in COLT, 1993, pp. 5–13.
    Google ScholarLocate open access versionFindings
  • Alex Graves, “Practical variational inference for neural networks,” in NIPS, pp. 2348–2356. 2011.
    Google ScholarLocate open access versionFindings
  • DARPA-ISTO, The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT), speech disc cd11.1 edition, 1990.
    Google ScholarFindings
  • Kai fu Lee and Hsiao wuen Hon, “Speaker-independent phone recognition using hidden markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 1989.
    Google ScholarLocate open access versionFindings
  • O. Abdel-Hamid, A. Mohamed, Hui Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in ICASSP, march 2012, pp. 4277 –4280.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments