AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have introduced a novel objective function that allows the network to be directly optimised for word error rate, and shown how to integrate the network outputs with a language model during decoding

Towards End-To-End Speech Recognition with Recurrent Neural Networks.

ICML, pp.1764-1772, (2014)

Cited by: 2118|Views408
EI
Full Text
Bibtex
Weibo

Abstract

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modificatio...More

Code:

Data:

0
Introduction
  • Recent advances in algorithms and computer hardware have made it possible to train neural networks in an endto-end fashion for tasks that previously required significant human expertise.
  • Convolutional neural networks are able to directly classify raw pixels into high-level concepts such as object categories (Krizhevsky et al, 2012) and messages on traffic signs (Ciresan et al, 2011), without using hand-designed feature extraction algorithms
  • Do such networks require less human effort than traditional approaches, they generally deliver superior performance.
  • These techniques are only suitable for retraining a system already trained at frame-level, and require the careful tuning of a large number of hyper-parameters— typically even more than the tuning required for deep neural networks
Highlights
  • Recent advances in algorithms and computer hardware have made it possible to train neural networks in an endto-end fashion for tasks that previously required significant human expertise
  • Neural networks are trained to classify individual frames of acoustic data, and their output distributions are reformulated as emission probabilities for a hidden Markov model (HMM)
  • These techniques are only suitable for retraining a system already trained at frame-level, and require the careful tuning of a large number of hyper-parameters— typically even more than the tuning required for deep neural networks
  • The recurrent neural network was retrained to minimise the expected word error rate using the method from Section 4, with five alignment samples per sequence
  • This paper has demonstrated that character-level speech transcription can be performed by a recurrent neural network with minimal preprocessing and no explicit phonetic representation
  • We have introduced a novel objective function that allows the network to be directly optimised for word error rate, and shown how to integrate the network outputs with a language model during decoding
Methods
  • For the version of LSTM used in this paper (Gers et al, 2002) H is implemented by the following composite function: it = σ (Wxixt + Whiht−1 + Wcict−1 + bi).
  • The RNN was trained on both the 14 hour subset ‘train-si84’ and the full 81 hour set, with the ‘test-dev93’ development set used for validation
  • For both training sets, the RNN was trained with CTC, as described in Section 3, using the characters in the transcripts as the target sequences.
  • The input data were presented as spectrograms derived from the raw audio files using the ‘specgram’ function of the ‘matplotlib’ python toolkit, with width 254 Fourier windows and an overlap of frames, giving inputs per frame
Results
  • The improvement of more than 1% absolute over the baseline is considerably larger than the slight gains usually seen with model averaging; this is presumably due to the greater difference between the systems.
  • By combining the new model with a baseline, the authors have achieved state-of-the-art accuracy on the Wall Street Journal corpus for speaker independent recognition
Conclusion
  • To provide character-level transcriptions, the network must learn how to recognise speech sounds, but how to transform them into letters.
  • The following examples from the evaluation set, decoded with no dictionary or language model, give some insight into how the network operates: target: TO ILLUSTRATE THE POINT A PROMINENT MIDDLE EAST ANALYST.
  • IN WASHINGTON RECOUNTS A CALL FROM ONE CAMPAIGN output: TWO ALSTRAIT THE POINT A PROMINENT MIDILLE EAST ANA-.
  • By combining the new model with a baseline, the authors have achieved state-of-the-art accuracy on the Wall Street Journal corpus for speaker independent recognition
Tables
  • Table1: Wall Street Journal Results. All scores are word error rate/character error rate (where known) on the evaluation set. ‘LM’ is the Language model used for decoding. ‘14 Hr’ and ‘81 Hr’ refer to the amount of data used for training
Download tables as Excel
Funding
  • This work was partially supported by the Canadian Institute for Advanced Research
Reference
  • Bahl, L., Brown, P., De Souza, P.V., and Mercer, R. Maximum mutual information estimation of hidden markov model parameters for speech recognition. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’86., volume 11, pp. 49–52, Apr 1986. doi: 10.1109/ICASSP.1986.1169179.
    Locate open access versionFindings
  • Bisani, Maximilian and Ney, Hermann. Open vocabulary speech recognition with flat hybrid models. In INTERSPEECH, pp. 725–728, 2005.
    Google ScholarLocate open access versionFindings
  • Bourlard, Herve A. and Morgan, Nelson. Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers, Norwell, MA, USA, 199ISBN 0792393961.
    Google ScholarFindings
  • Ciresan, Dan C., Meier, Ueli, Masci, Jonathan, and Schmidhuber, Jrgen. A committee of neural networks for traffic sign classification. In IJCNN, pp. 1918–1921. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • Davis, S. and Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4):357– 366, August 1980.
    Google ScholarLocate open access versionFindings
  • Eyben, F., Wllmer, M., Schuller, B., and Graves, A. From speech to letters - using a novel neural network architecture for grapheme based asr. In Proc. Automatic Speech Recognition and Understanding Workshop (ASRU 2009), Merano, Italy. IEEE, 2009. 13.17.12.2009.
    Google ScholarLocate open access versionFindings
  • Galescu, Lucian. Recognition of out-of-vocabulary words with sub-lexical language models. In INTERSPEECH, 2003.
    Google ScholarLocate open access versionFindings
  • Gers, F., Schraudolph, N., and Schmidhuber, J. Learning Precise Timing with LSTM Recurrent Networks. Journal of Machine Learning Research, 3:115–143, 2002.
    Google ScholarLocate open access versionFindings
  • Graves, A. and Schmidhuber, J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Networks, 18(5-6):602– 610, June/July 2005.
    Google ScholarLocate open access versionFindings
  • Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, Pittsburgh, USA, 2006.
    Google ScholarLocate open access versionFindings
  • Graves, Alex. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of Studies in Computational Intelligence. Springer, 2012.
    Google ScholarLocate open access versionFindings
  • Hinton, G. E. and Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, July 2006.
    Google ScholarLocate open access versionFindings
  • Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George, rahman Mohamed, Abdel, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara, and Kingsbury, Brian. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.
    Google ScholarLocate open access versionFindings
  • Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Jaitly, Navdeep and Hinton, Geoffrey E. Learning a better representation of speech soundwaves using restricted boltzmann machines. In ICASSP, pp. 5884–5887, 2011.
    Google ScholarLocate open access versionFindings
  • Jaitly, Navdeep, Nguyen, Patrick, Senior, Andrew W, and Vanhoucke, Vincent. Application of pretrained deep neural networks to large vocabulary speech recognition. In INTERSPEECH, 2012.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Lee, Li and Rose, R. A frequency warping approach to speaker normalization. Speech and Audio Processing, IEEE Transactions on, 6(1):49–60, Jan 1998.
    Google ScholarLocate open access versionFindings
  • Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. In Neural Networks, number 4, pp. 682–97, 2008.
    Google ScholarLocate open access versionFindings
  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011.
    Google ScholarLocate open access versionFindings
  • Schuster, M. and Paliwal, K. K. Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 45:2673–2681, 1997.
    Google ScholarLocate open access versionFindings
  • Graves, A., Mohamed, A., and Hinton, G. Speech recognition with deep recurrent neural networks. In Proc ICASSP 2013, Vancouver, Canada, May 2013.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科