AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We train the recurrent neural network models in a single step, and are able to reduce the complexity of Automatic speech recognition system development

Eesen: End-To-End Speech Recognition Using Deep Rnn Models And Wfst-Based Decoding

2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), (2015): 167-174

Cited: 722|Views131
EI

Abstract

The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources, multiple training stages and significant expertise. This paper presents our Eesen framework which d...More

Code:

Data:

0
Introduction
  • Automatic speech recognition (ASR) has traditionally leveraged the hidden Markov model/Gaussian mixture model (HMM/GMM) paradigm for acoustic modeling.
  • On a variety of ASR tasks, DNN models have shown significant gains over the GMM models.
  • Despite these advances, building a state-of-the-art ASR system remains a complicated, expertise-intensive task.
  • In the hybrid approach, training of DNNs still relies on GMM models to obtain frame-level labels.
  • The development of ASR systems highly relies on ASR experts to determine the optimal configurations of a multitude of hyper-parameters, for instance, the number of senones and Gaussians in the GMM models
Highlights
  • Automatic speech recognition (ASR) has traditionally leveraged the hidden Markov model/Gaussian mixture model (HMM/GMM) paradigm for acoustic modeling
  • The development of Automatic speech recognition systems highly relies on Automatic speech recognition experts to determine the optimal configurations of a multitude of hyper-parameters, for instance, the number of senones and Gaussians in the GMM models
  • We present our Eesen framework to build end-toend Automatic speech recognition systems
  • We train the recurrent neural network models in a single step, and are able to reduce the complexity of Automatic speech recognition system development
  • Because of its open-source property, Eesen can serve as a shared benchmark platform for research on end-to-end Automatic speech recognition
  • We are interested to apply Eesen to various languages [32, 33, 34] and different types of speech, and investigate how end-to-end Automatic speech recognition performs under these conditions
Methods
  • The experiments are conducted on the Wall Street Journal (WSJ) corpus that can be obtained from LDC under the catalog numbers LDC93S6B and LDC94S13B.
  • Data preparation gives us 81 hours of transcribed speech, from which the authors select 95% as the training set and the remaining 5% for cross validation.
  • As discussed in Section 2, the authors apply deep RNNs as the acoustic models.
  • Inputs of the RNNs are 40-dimensional filterbank features together with their first and second-order derivatives.
  • The features are normalized via mean subtraction and variance normalization on the speaker basis
Conclusion
  • The authors present the Eesen framework to build end-toend ASR systems.
  • Eesen exploits deep RNNs as the acoustic models and CTC as the training objective function.
  • The authors plan to further improve the WERs of Eesen systems via more advanced learning techniques and alternative decoding approach.
  • Due to the removal of GMMs, acoustic modeling in Eesen cannot leverage speaker adapted front-ends.
  • The authors will study new speaker adaptation [35, 36] and adaptive training [37, 38] techniques for the CTC models
Tables
  • Table1: Performance of the phoneme-based Eesen system, and its comparison with the hybrid HMM/DNN system built with Kaldi. “#Param” means the number of parameters
  • Table2: Comparisons of decoding speed between the phoneme-based Eesen system and the hybrid HMM/DNN system. “RTF” refers to the real-time factor in decoding. “Graph Size” means the size of the decoding graph in terms of megabytes
  • Table3: Performance of the character-based Eesen system using different vocabularies and language models, and its comparison with results presented in previous work
Download tables as Excel
Funding
  • This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575
  • This research was performed as part of the Speech Recognition Virtual Kitchen project, which is supported by the United States National Science Foundation under grant number CNS-1305365
  • This work was partially funded by Facebook, Inc
Reference
  • George E Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, 2012.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
    Google ScholarLocate open access versionFindings
  • Frank Seide, Gang Li, Xie Chen, and Dong Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 24–29.
    Google ScholarLocate open access versionFindings
  • Andrew Senior, Georg Heigold, Michiel Bacchiani, and Hank Liao, “GMM-free DNN training,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 5639–5643.
    Google ScholarLocate open access versionFindings
  • Michiel Bacchiani, Andrew Senior, and Georg Heigold, “Asynchronous, online, GMM-free training of a context dependent acoustic model for speech recognition,” in Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 2014.
    Google ScholarLocate open access versionFindings
  • Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1764–1772.
    Google ScholarLocate open access versionFindings
  • Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al., “Deepspeech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
    Findings
  • Awni Y Hannun, Andrew L Maas, Daniel Jurafsky, and Andrew Y Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs,” arXiv preprint arXiv:1408.2873, 2014.
    Findings
  • Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602, 2014.
    Findings
  • Andrew L Maas, Ziang Xie, Dan Jurafsky, and Andrew Y Ng, “Lexicon-free conversational speech recognition with neural networks,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” arXiv preprint arXiv:1508.04395, 2015.
    Findings
  • William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
    Findings
  • Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
    Google ScholarLocate open access versionFindings
  • Hasim Sak, Andrew Senior, Kanishka Rao, Ozan Irsoy, Alex Graves, Francoise Beaufays, and Johan Schalkwyk, “Learning acoustic frame labeling for speech recognition with recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4280–4284.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6645–6649.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, “Hybrid speech recognition with deep bidirectional LSTM,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 273–278.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Hasim Sak, Andrew Senior, and Francoise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 2014.
    Google ScholarLocate open access versionFindings
  • Tara N Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Yajie Miao and Florian Metze, “On speaker adaptation of long short-term memory recurrent neural networks,” in Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH) (To Appear). ISCA, 2015.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Patrice Simard, and Paolo Frasconi, “Learning long-term dependencies with gradient descent is difficult,” Neural Networks, IEEE Transactions on, vol. 5, no. 2, pp. 157–166, 1994.
    Google ScholarLocate open access versionFindings
  • Felix A Gers, Nicol N Schraudolph, and Jurgen Schmidhuber, “Learning precise timing with LSTM recurrent networks,” The Journal of Machine Learning Research, vol. 3, pp. 115–143, 2003.
    Google ScholarLocate open access versionFindings
  • Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang, “Phoneme recognition using time-delay neural networks,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 37, no. 3, pp. 328–339, 1989.
    Google ScholarLocate open access versionFindings
  • John B Hampshire, Alexander H Waibel, et al., “A novel objective function for improved phoneme recognition using time-delay neural networks,” Neural Networks, IEEE Transactions on, vol. 1, no. 2, pp. 216–228, 1990.
    Google ScholarLocate open access versionFindings
  • Tara N Sainath, Brian Kingsbury, George Saon, Hagen Soltau, Abdel-rahman Mohamed, George Dahl, and Bhuvana Ramabhadran, “Deep convolutional neural networks for large-scale speech tasks,” Neural Networks, 2014.
    Google ScholarLocate open access versionFindings
  • Lawrence R Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
    Google ScholarLocate open access versionFindings
  • Mehryar Mohri, Fernando Pereira, and Michael Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 69–88, 2002.
    Google ScholarLocate open access versionFindings
  • Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlıcek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The Kaldi speech recognition toolkit,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 1–4.
    Google ScholarLocate open access versionFindings
  • Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri, “OpenFst: A general and efficient weighted finite-state transducer library,” in Implementation and Application of Automata, pp. 11– 23.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
    Google ScholarLocate open access versionFindings
  • Hagen Soltau, Florian Metze, Christian Fugen, and Alex Waibel, “A one-pass decoder based on polymorphic linguistic context assignment,” in Automatic Speech Recognition and Understanding, 2001. ASRU’01. IEEE Workshop on. IEEE, 2001, pp. 214–217.
    Google ScholarLocate open access versionFindings
  • Yajie Miao, Hao Zhang, and Florian Metze, “Distributed learning of multilingual DNN feature extractors using GPUs,” in Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 2014.
    Google ScholarLocate open access versionFindings
  • Yajie Miao and Florian Metze, “Improving languageuniversal feature extraction with deep maxout and convolutional neural networks,” in Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 2014.
    Google ScholarLocate open access versionFindings
  • Jie Li, Heng Zhang, Xinyuan Cai, and Bo Xu, “Towards end-to-end speech recognition for chinese mandarin using long short-term memory recurrent neural networks,” in Sixteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 2015.
    Google ScholarLocate open access versionFindings
  • Hank Liao, “Speaker adaptation of context dependent deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7947–7951.
    Google ScholarLocate open access versionFindings
  • Kaisheng Yao, Dong Yu, Frank Seide, Hang Su, Li Deng, and Yifan Gong, “Adaptation of contextdependent deep neural networks for automatic speech recognition,” in 2012 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Yajie Miao, Hao Zhang, and Florian Metze, “Towards speaker adaptive training of deep neural network acoustic models,” in Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 2014.
    Google ScholarLocate open access versionFindings
  • Yajie Miao, Hao Zhang, and Florian Metze, “Speaker adaptive training of deep neural network acoustic models using i-vectors,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 11, pp. 1938–1949, 2015.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn