Attention-Based Models for Speech Recognition

Annual Conference on Neural Information Processing Systems, 2015.

Cited by: 1260|Bibtex|Views208
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|dl.acm.org|arxiv.org
Weibo:
We proposed and evaluated a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism which combines both content and location information in order to select the position in the input sequence for decoding

Abstract:

Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks including machine translation, handwriting synthesis [1,2] and image caption generation [3]. We extend the attention-mechanism with features needed for speech recognition. We show that while ...More

Code:

Data:

0
Introduction
  • Attention-based recurrent networks have been successfully applied to a wide variety of tasks, such as handwriting synthesis [1], machine translation [2], image caption generation [3] and visual object classification [4].1 Such models iteratively process their input by selecting relevant content at every step.
  • Attention-based recurrent networks have been successfully applied to a wide variety of tasks, such as handwriting synthesis [1], machine translation [2], image caption generation [3] and visual object classification [4].1
  • Such models iteratively process their input by selecting relevant content at every step.
  • For these reasons speech recognition is an interesting testbed for developing new attention-based architectures capable of processing long and noisy inputs
Highlights
  • Attention-based recurrent networks have been successfully applied to a wide variety of tasks, such as handwriting synthesis [1], machine translation [2], image caption generation [3] and visual object classification [4].1
  • Compared to machine translation, speech recognition principally differs by requesting much longer input sequences, which introduces a challenge of distinguishing similar speech fragments2 in a single utterance
  • In the case of speech recognition, this type of location-based attention mechanism would have to predict the distance between consequent phonemes using si−1 only, which we expect to be hard due to large variance of this quantity. For these limitations associated with both content-based and location-based mechanisms, we argue that a hybrid attention mechanism is a natural candidate for speech recognition
  • We proposed and evaluated a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism which combines both content and location information in order to select the position in the input sequence for decoding
  • The proposed attention can be used without modification in neural Turing machines, or by using 2–D convolution instead of 1–D, for improving image caption generation [3]
Methods
  • The authors closely followed the procedure in [16]. All experiments were performed on the TIMIT corpus [19].
  • Networks were trained on the full 61-phone set extended with an extra “end-of-sequence” token that was appended to each target sequence.
  • ARSG models is that different subsets of parameters are reused different number of times; L times for those of the encoder, LT for the attention weights and T times for all the other parameters of the ARSG
  • This makes the scales of derivatives w.r.t. parameters vary significantly.
  • As shown in Fig. 2, decoding with a wider beam gives little-to-none benefit
Results
  • Baseline Model Baseline + Conv.
  • Features Baseline + Conv.
  • Features + Smooth Focus RNN Transducer [16].
  • All the models achieved competitive PERs. With the convolutional features, the authors see 3.7% relative improvement over the baseline and further 5.9% with the smoothing.
  • The baseline model learned to align properly.
  • An alignment produced by the baseline model on a sequence with repeated phonemes is presented in Fig. 3 which demonstrates that the baseline model is not confused by short-range repetitions.
  • The alignments produced by the other models were very similar visually
Conclusion
  • The authors proposed and evaluated a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism which combines both content and location information in order to select the position in the input sequence for decoding.
  • This work has contributed two novel ideas for attention mechanisms: a better normalization approach yielding smoother alignments and a generic principle for extracting and using features from the previous alignments.
  • Both of these can potentially be applied beyond speech recognition.
  • The proposed attention can be used without modification in neural Turing machines, or by using 2–D convolution instead of 1–D, for improving image caption generation [3]
Summary
  • Introduction:

    Attention-based recurrent networks have been successfully applied to a wide variety of tasks, such as handwriting synthesis [1], machine translation [2], image caption generation [3] and visual object classification [4].1 Such models iteratively process their input by selecting relevant content at every step.
  • Attention-based recurrent networks have been successfully applied to a wide variety of tasks, such as handwriting synthesis [1], machine translation [2], image caption generation [3] and visual object classification [4].1
  • Such models iteratively process their input by selecting relevant content at every step.
  • For these reasons speech recognition is an interesting testbed for developing new attention-based architectures capable of processing long and noisy inputs
  • Methods:

    The authors closely followed the procedure in [16]. All experiments were performed on the TIMIT corpus [19].
  • Networks were trained on the full 61-phone set extended with an extra “end-of-sequence” token that was appended to each target sequence.
  • ARSG models is that different subsets of parameters are reused different number of times; L times for those of the encoder, LT for the attention weights and T times for all the other parameters of the ARSG
  • This makes the scales of derivatives w.r.t. parameters vary significantly.
  • As shown in Fig. 2, decoding with a wider beam gives little-to-none benefit
  • Results:

    Baseline Model Baseline + Conv.
  • Features Baseline + Conv.
  • Features + Smooth Focus RNN Transducer [16].
  • All the models achieved competitive PERs. With the convolutional features, the authors see 3.7% relative improvement over the baseline and further 5.9% with the smoothing.
  • The baseline model learned to align properly.
  • An alignment produced by the baseline model on a sequence with repeated phonemes is presented in Fig. 3 which demonstrates that the baseline model is not confused by short-range repetitions.
  • The alignments produced by the other models were very similar visually
  • Conclusion:

    The authors proposed and evaluated a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism which combines both content and location information in order to select the position in the input sequence for decoding.
  • This work has contributed two novel ideas for attention mechanisms: a better normalization approach yielding smoother alignments and a generic principle for extracting and using features from the previous alignments.
  • Both of these can potentially be applied beyond speech recognition.
  • The proposed attention can be used without modification in neural Turing machines, or by using 2–D convolution instead of 1–D, for improving image caption generation [3]
Tables
  • Table1: Phoneme error rates (PER). The bold-faced PER corresponds to the best error rate with an attention-based recurrent sequence generator (ARSG) incorporating convolutional attention features and a smooth focus
Download tables as Excel
Related work
  • Speech recognizers based on the connectionist temporal classification (CTC, [13]) and its extension, RNN Transducer [14], are the closest to the ARSG model considered in this paper. They follow earlier work on end-to-end trainable deep learning over sequences with gradient signals flowing

    Phoneme Error Rate [%]

    Dependency of error rate on beam search width. Baseline 19 Conv Feats

    Smooth Focus q q q Beam width

    Dataset q dev test through the alignment process [15]. They have been shown to perform well on the phoneme recognition task [16]. Furthermore, the CTC was recently found to be able to directly transcribe text from speech without any intermediate phonetic representation [17].

    The considered ARSG is different from both the CTC and RNN Transducer in two ways. First, whereas the attention mechanism deterministically aligns the input and the output sequences, the CTC and RNN Transducer treat the alignment as a latent random variable over which MAP (maximum a posteriori) inference is performed. This deterministic nature of the ARSG’s alignment mechanism allows beam search procedure to be simpler. Furthermore, we empirically observe that a much smaller beam width can be used with the deterministic mechanism, which allows faster decoding (see Sec. 4 and Fig. 2). Second, the alignment mechanism of both the CTC and RNN Transducer is constrained to be “monotonic” to keep marginalization of the alignment tractable. On the other hand, the proposed attention mechanism can result in non-monotonic alignment, which makes it suitable for a larger variety of tasks other than speech recognition.
Funding
  • All experiments were conducted using Theano [27, 28], PyLearn2 [29], and Blocks [30] libraries. The authors would like to acknowledge the support of the following agencies for research funding and computing support: National Science Center (Poland) grant Sonata 8 2014/15/D/ST6/04402, NSERC, Calcul Quebec, Compute Canada, the Canada Research Chairs and CIFAR
Reference
  • A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, August 2013.
    Findings
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proc. of the 3rd ICLR, 2015. arXiv:1409.0473.
    Findings
  • K. Xu, J. Ba, R. Kiros, et al. Show, attend and tell: Neural image caption generation with visual attention. In Proc. of the 32nd ICML, 2015. arXiv:1502.03044.
    Findings
  • V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Proc. of the 27th NIPS, 201arXiv:1406.6247.
    Findings
  • J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv:1412.1602 [cs, stat], December 2014.
    Findings
  • A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv:1410.5401, 2014.
    Findings
  • J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv:1410.3916, 2014.
    Findings
  • M. Gales and S. Young. The application of hidden markov models in speech recognition. Found. Trends Signal Process., 1(3):195–304, January 2007.
    Google ScholarLocate open access versionFindings
  • G. Hinton, L. Deng, D. Yu, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, November 2012.
    Google ScholarLocate open access versionFindings
  • A. Hannun, C. Case, J. Casper, et al. Deepspeech: Scaling up end-to-end speech recognition. arXiv:1412.5567, 2014.
    Findings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural. Comput., 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • K. Cho, B. van Merrienboer, C. Gulcehre, et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation. In EMNLP, October 2014. to appear.
    Google ScholarFindings
  • A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proc. of the 23rd ICML-06, 2006.
    Google ScholarLocate open access versionFindings
  • A. Graves. Sequence transduction with recurrent neural networks. In Proc. of the 29th ICML, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document recognition. Proc. IEEE, 1998.
    Google ScholarFindings
  • A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP 2013, pages 6645–6649. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proc. of the 31st ICML, 2014.
    Google ScholarLocate open access versionFindings
  • S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. Weakly supervised memory networks. arXiv:1503.08895, 2015.
    Findings
  • J. S. Garofolo, L. F. Lamel, W. M. Fisher, et al. DARPA TIMIT acoustic phonetic continuous speech corpus, 1993.
    Google ScholarFindings
  • D. Povey, A. Ghoshal, G. Boulianne, et al. The kaldi speech recognition toolkit. In Proc. ASRU, 2011.
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler. ADADELTA: An adaptive learning rate method. arXiv:1212.5701, 2012.
    Findings
  • A. Graves. Practical variational inference for neural networks. In Proc of the 24th NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
    Findings
  • I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proc. of the 27th NIPS, 2014. arXiv:1409.3215.
    Findings
  • L. Toth. Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition. In Proc. ICASSP, 2014.
    Google ScholarLocate open access versionFindings
  • C. Gulcehre, O. Firat, K. Xu, et al. On using monolingual corpora in neural machine translation. arXiv:1503.03535, 2015.
    Findings
  • J. Bergstra, O. Breuleux, F. Bastien, et al. Theano: a CPU and GPU math expression compiler. In Proc.
    Google ScholarLocate open access versionFindings
  • F. Bastien, P. Lamblin, R. Pascanu, et al. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, D. Warde-Farley, P. Lamblin, et al. Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214, 2013.
    Findings
  • B. van Merrienboer, D. Bahdanau, V. Dumoulin, et al. Blocks and fuel: Frameworks for deep learning. arXiv:1506.00619 [cs, stat], June 2015.
    Findings
Full Text
Your rating :
0

 

Tags
Comments