End-to-End Attention-based Large Vocabulary Speech Recognition

2016 IEEE International Conference Acoustics, Speech and Signal Processing, 2016, Pages 4945-4949.

Cited by: 710|Bibtex|Views264|DOI:https://doi.org/10.1109/ICASSP.2016.7472618
EI WOS
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We show how a character-level Attentionbased Recurrent Sequence Generator and n−gram word-level language model can be combined into a complete system using the Weighted Finite State Transducers framework

Abstract:

Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic modelling, language modelling and sequence decoding. We investigate a more direct approach in wh...More

Code:

Data:

Introduction
  • Deep neural networks have become a popular replacement for Gaussian Mixture Models for acoustic modeling in stateof-the-art large vocabulary speech recognition systems [4].
  • Both of these models were trained to predict sequences of characters and were later combined with a word level language model.
  • While the results of CTC based systems are promising, alternative methods for neural sequence modelling have been recently proposed.
Highlights
  • Deep neural networks have become a popular replacement for Gaussian Mixture Models for acoustic modeling in stateof-the-art large vocabulary speech recognition systems [4]
  • We investigate the application of an Attentionbased Recurrent Sequence Generator (ARSG) as a part of an end-to-end Large Vocabulary Continuous Speech Recognition system
  • We show how a character-level Attentionbased Recurrent Sequence Generator and n−gram word-level language model can be combined into a complete system using the Weighted Finite State Transducers (WFST) framework
  • The attention mechanism selects the temporal locations over the input sequence that should be used to update the hidden state of the Recurrent Neural Network and to predict the output value
  • We investigate how a character-level Attentionbased Recurrent Sequence Generator can be integrated with a word-level language model
Results
  • 3. The authors show how a character-level ARSG and n−gram word-level language model can be combined into a complete system using the Weighted Finite State Transducers (WFST) framework.
  • The proposed system is an encoder-decoder [15, 16] network that can map sequences of speech frames to sequences of characters2.
  • It consists of a deep bi-directional RNN that encodes the speech signal into a suitable feature representation and of an Attention-based Recurrent Sequence Generator that decodes this representation into a sequence of characters.
  • The decoder network in the system is an Attention-based Recurrent Sequence Generator (ARSG) [6, 11].
  • The attention mechanism selects the temporal locations over the input sequence that should be used to update the hidden state of the RNN and to predict the output value.
  • The authors use the Weighted Finite State Transducer (WFST) framework [18, 19] to build a character-level language model from a word-level one.
  • The decoding algorithm uses a left-to-right beam search [16] to find the transcript y that minimizes the cost L which combines the encoder-decoder (ED) and the language model (LM) outputs as follows: L = − log pED(y|x) − β log pLM (y) − γT (4)
  • Unlike CTC, RNN transduction systems can generate output sequences that are longer than the input.
  • RNN Transducers have led to state-of-the-art results in phoneme recognition on the TIMIT dataset [17], which were recently matched by an ASRG network [11].
  • [14] proposes a similar to ours, character-based Encoder-Decoder network that employs pooling between BiRNN layers.
  • The improvement from adding an external language model is much larger for CTC-based systems.
Conclusion
  • The authors propose pooling over time between BiRNN layers to reduce the length of the encoded input sequence.
  • The authors propose to use windowing during training to ensure that the decoder network performs a constant number of operations for each output character.
  • It is possible to use the states of a pre-trained language model as additional inputs to an ARSG, possibly reducing the incentive to memorize the training prompts.
Summary
  • Deep neural networks have become a popular replacement for Gaussian Mixture Models for acoustic modeling in stateof-the-art large vocabulary speech recognition systems [4].
  • Both of these models were trained to predict sequences of characters and were later combined with a word level language model.
  • While the results of CTC based systems are promising, alternative methods for neural sequence modelling have been recently proposed.
  • 3. The authors show how a character-level ARSG and n−gram word-level language model can be combined into a complete system using the Weighted Finite State Transducers (WFST) framework.
  • The proposed system is an encoder-decoder [15, 16] network that can map sequences of speech frames to sequences of characters2.
  • It consists of a deep bi-directional RNN that encodes the speech signal into a suitable feature representation and of an Attention-based Recurrent Sequence Generator that decodes this representation into a sequence of characters.
  • The decoder network in the system is an Attention-based Recurrent Sequence Generator (ARSG) [6, 11].
  • The attention mechanism selects the temporal locations over the input sequence that should be used to update the hidden state of the RNN and to predict the output value.
  • The authors use the Weighted Finite State Transducer (WFST) framework [18, 19] to build a character-level language model from a word-level one.
  • The decoding algorithm uses a left-to-right beam search [16] to find the transcript y that minimizes the cost L which combines the encoder-decoder (ED) and the language model (LM) outputs as follows: L = − log pED(y|x) − β log pLM (y) − γT (4)
  • Unlike CTC, RNN transduction systems can generate output sequences that are longer than the input.
  • RNN Transducers have led to state-of-the-art results in phoneme recognition on the TIMIT dataset [17], which were recently matched by an ASRG network [11].
  • [14] proposes a similar to ours, character-based Encoder-Decoder network that employs pooling between BiRNN layers.
  • The improvement from adding an external language model is much larger for CTC-based systems.
  • The authors propose pooling over time between BiRNN layers to reduce the length of the encoded input sequence.
  • The authors propose to use windowing during training to ensure that the decoder network performs a constant number of operations for each output character.
  • It is possible to use the states of a pre-trained language model as additional inputs to an ARSG, possibly reducing the incentive to memorize the training prompts.
Tables
  • Table1: Character Error Rate (CER) and Word Error Rate
Download tables as Excel
Funding
  • The experiments were conducted using Theano [26, 27], Blocks and Fuel [28] libraries. The authors would like to acknowledge the support of the following agencies for research funding and computing support: National Science Center (Poland) grant 2014/15/D/ST6/04402, NSERC, Calcul Quebec, Compute Canada, the Canada Research Chairs and CIFAR
Reference
  • A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns,” arXiv preprint arXiv:1408.2873, 2014.
    Findings
  • A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al., “Deepspeech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
    Findings
  • A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in ICML-06, 2006.
    Google ScholarFindings
  • G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
    Google ScholarLocate open access versionFindings
  • A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in ICML14, 2014.
    Google ScholarFindings
  • D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML-15, 2015.
    Google ScholarFindings
  • A. Graves, “Generating sequences with recurrent neural networks,” arXiv:1308.0850, Aug. 2013.
    Findings
  • V. Mnih, N. Heess, A. Graves, et al., “Recurrent models of visual attention,” in NIPS, 2014, pp. 2204–2212.
    Google ScholarLocate open access versionFindings
  • J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: First results,” arXiv:1412.1602 [cs, stat], Dec. 2014.
    Findings
  • J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in NIPS, 2015, to appear.
    Google ScholarFindings
  • J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber, “A clockwork RNN,” in ICML-14, 2014.
    Google ScholarFindings
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Gated feedback recurrent neural networks,” in ICML-15, 2015.
    Google ScholarFindings
  • W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv:1508.01211 [cs, stat], Aug. 2015.
    Findings
  • K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Empirical Methods of Natural Language Processing, Oct. 2014.
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in ICASSP. IEEE, 2013, pp. 6645–6649.
    Google ScholarLocate open access versionFindings
  • M. Mohri, F. Pereira, and M. Riley, “Weighted finitestate transducers in speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 69–88, Jan. 2002.
    Google ScholarLocate open access versionFindings
  • C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: A general and efficient weighted finite-state transducer library,” in Implementation and Application of Automata, J. Holub and J. Zdarek, Eds., number 4783 in Lecture Notes in Computer Science, pp. 11–23. Springer Berlin Heidelberg, Jan. 2007.
    Google ScholarLocate open access versionFindings
  • Y. Miao, M. Gowayyed, and F. Metze, “EESEN: Endto-end speech recognition using deep RNN models and WFST-based decoding,” arXiv:1507.08240 [cs], July 2015.
    Findings
  • A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
    Findings
  • N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “High-dimensional sequence transduction,” in ICASSP. IEEE, 2013, pp. 3178–3182.
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler, “Adadelta: An adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
    Findings
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
    Findings
  • C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingual corpora in neural machine translation,” arXiv preprint arXiv:1503.03535, 2015.
    Findings
  • J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.
    Google ScholarLocate open access versionFindings
  • F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: new features and speed improvements,” Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2012.
    Google ScholarLocate open access versionFindings
  • B. van Merrienboer, D. Bahdanau, V. Dumoulin, D. Serdyuk, D. Warde-Farley, J. Chorowski, and Y. Bengio, “Blocks and fuel: Frameworks for deep learning,” arXiv:1506.00619 [cs, stat], June 2015.
    Findings
Full Text
Your rating :
0

 

Tags
Comments