AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have shown competitive results with Transformer models for end-to-end automatic speech recognition, which can be trained faster and more stable than long short-term memory

A Comparison of Transformer and LSTM Encoder Decoder Models for ASR.

ASRU, pp.8-15, (2019)

Cited by: 36|Views18
EI
Full Text
Bibtex
Weibo

Abstract

We present competitive results using a Transformer encoderdecoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We observe that the Transformer training is in general more stable compared to the LSTM, although it also seems to overfit more, and thus shows more p...More

Code:

Data:

0
Introduction
  • End-to-end automatic speech recognition (ASR) tries to simplify the modeling aspect, the training pipeline, and the decoding algorithm [1,2,3,4,5,6,7,8].
  • The motivation for self-attention is two-fold: It allows for more direct information flow across the whole sequence, and more direct gradient flow
  • It allows for faster training, as most operations can be calculated in parallel for the standard cross-entropy loss.
  • More recent work suggested to use the Transformer directly as the acoustic model [34] or as an end-to-end ASR model [34,35,36,37,38,39,40,41,42]
Highlights
  • End-to-end automatic speech recognition (ASR) tries to simplify the modeling aspect, the training pipeline, and the decoding algorithm [1,2,3,4,5,6,7,8]. It tries to reduce any assumptions being made about the data, and better performance can potentially be expected compared with the conventional system, the hybrid hidden Markov model (HMM) - neural network (NN) approach on phoneme level [9,10,11,12]
  • Initial works on encoder-decoder models were usually based on long short-term memory (LSTM) [17] networks [18]
  • We explore a faster LSTM variant where the decoder LSTM is decoupled from the loop, and allows for faster training
  • Inspired by [40], and similar to the Transformer, we investigate a variant, where the decoder LSTM does only depend on the ground truth, instead of the attention context
  • We have shown competitive results with Transformer models for end-to-end ASR, which can be trained faster and more stable than LSTMs
Results
  • Data-augmentation, a variant of SpecAugment, helps to improve both the Transformer by 33% and the LSTM by 15% relative.
Conclusion
  • The authors have shown competitive results with Transformer models for end-to-end ASR, which can be trained faster and more stable than LSTMs. Overfitting and generalization might be a problem, though, which can partly be overcome with SpecAugment.
  • Speed perturbation No augmentation T6ime [days] 8.
  • LM WER [%] Train speed dev test [char/sec] LSTM [65] RNN 7.1 7.7 LSTM [35] none 14.6 14.7.
  • Transformer2 [35] none 15.3 16.7
Tables
  • Table1: LibriSpeech 1000h results. Comparing different encoder and decoder depths (number of layers)
  • Table2: LibriSpeech 1000h results (12.5 epochs). Comparing Transformer to LSTM, and its decoupled decoder LSTM (DecLSTM) variant; comparing to other results in the literature; data augmentation most similar to SpecAugment; initial convolutional network (Conv); stable MLP attention projection (ExpV); LongTrain trains for twice as long (25 epochs)
  • Table3: Switchboard 300h results
  • Table4: TED-LIUM-v2 results. 1 is a Transformer with an LSTM on top. 2 is a Transformer with interleaved LSTM. The training speed is with a single Nvidia 1080 Ti GPU. We use our own native CUDA LSTM implementation. All our models use a variant of SpecAugment
Download tables as Excel
Funding
  • This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694537, project ”SEQCLAS”) and from a Google Focused Award
  • The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains
Reference
  • [2] Hagen Soltau, Hank Liao, and Hasim Sak, “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition,” in Proc. Interspeech, 2017, pp. 3707–3711.
    Google ScholarLocate open access versionFindings
  • [3] Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, and David Nahamoo, “Direct acoustics-to-word models for english conversational speech recognition,” in Proc. Interspeech, 2017, pp. 959–963.
    Google ScholarLocate open access versionFindings
  • [4] Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve, “Wav2letter: an end-to-end convnetbased speech recognition system,” arXiv preprint arXiv:1609.03193, 2016.
    Findings
  • [5] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: first results,” arXiv preprint arXiv:1412.1602, 2014.
    Findings
  • [6] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
    Google ScholarFindings
  • [7] Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu, “Exploring neural transducers for end-to-end speech recognition,” in ASRU, Okinawa, Japan, Dec. 2017, pp. 206– 213.
    Google ScholarFindings
  • [8] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” in ICASSP, 2016, pp. 4945–4949.
    Google ScholarFindings
  • [9] Herve Bourlard and Nelson Morgan, Connectionist speech recognition: a hybrid approach, vol. 247, Springer, 1994.
    Google ScholarLocate open access versionFindings
  • [10] Anthony J Robinson, “An application of recurrent nets to phone probability estimation,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.
    Google ScholarLocate open access versionFindings
  • [11] Hasim Sak, Andrew W. Senior, and Francoise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Interspeech, Singapore, Sept. 2014, pp. 338–342.
    Google ScholarLocate open access versionFindings
  • [12] Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, Ralf Schluter, and Hermann Ney, “A comprehensive study of deep bidirectional lstm rnns for acoustic modeling in speech recognition,” in ICASSP, New Orleans, LA, USA, Mar. 2017, pp. 2462–2466.
    Google ScholarFindings
  • [13] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” Proc. ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • [14] Albert Zeyer, Kazuki Irie, Ralf Schluter, and Hermann Ney, “Improved training of end-to-end attention models for speech recognition,” in Interspeech, Hyderabad, India, Sept. 2018.
    Google ScholarFindings
  • [15] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in ICASSP, 2018, pp. 4774–4778.
    Google ScholarFindings
  • [16] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, Graz, Austria, Sept. 2019, pp. 2613–2617.
    Google ScholarFindings
  • [17] Sepp Hochreiter and Jurgen Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • [18] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • [19] Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in ICASSP, New Orleans, LA, USA, Mar. 2017, pp. 4845–4849.
    Google ScholarFindings
  • [20] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin, “Convolutional sequence to sequence learning,” in ICML. JMLR. org, 2017, pp. 1243–1252, arXiv:1705.03122.
    Findings
  • [21] Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde, “Jasper: An end-to-end convolutional neural acoustic model,” arXiv preprint arXiv:1904.03288, 2019.
    Findings
  • [22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 6000–6010.
    Google ScholarLocate open access versionFindings
  • [23] Jianpeng Cheng, Li Dong, and Mirella Lapata, “Long short-term memory-networks for machine reading,” in EMNLP, Austin, TX, USA, Nov. 2016, pp. 551–561.
    Google ScholarFindings
  • [24] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio, “A structured self-attentive sentence embedding,” ICLR, Apr. 2017.
    Google ScholarLocate open access versionFindings
  • [25] Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit, “A decomposable attention model for natural language inference,” in EMNLP, Austin, TX, USA, Nov. 2016, pp. 2249–2255.
    Google ScholarFindings
  • [26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidirectional Transformers for language understanding,” in Proc. NAACL-HLT, Minneapolis, MN, USA, June 2019, pp. 4171–4186.
    Google ScholarLocate open access versionFindings
  • [27] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” CoRR, vol. abs/1906.08237, 2019.
    Findings
  • [28] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, “Improving language understanding by generative pre-training,” OpenAI Blog, 2018.
    Google ScholarLocate open access versionFindings
  • [29] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, 2019.
    Google ScholarLocate open access versionFindings
  • [30] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” in ACL, Florence, Italy, July 2019.
    Google ScholarFindings
  • [31] Alexei Baevski and Michael Auli, “Adaptive input representations for neural language modeling,” in ICLR, New Orleans, LA, USA, May 2019.
    Google ScholarFindings
  • [32] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones, “Character-level language modeling with deeper self-attention,” in AAAI Conf. on AI., Honolulu, HI, USA, Jan. 2019.
    Google ScholarLocate open access versionFindings
  • [33] Kazuki Irie, Albert Zeyer, Ralf Schluter, and Hermann Ney, “Language modeling with deep Transformers,” in Interspeech, Graz, Austria, Sept. 2019, pp. 3905–3909.
    Google ScholarFindings
  • [34] Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur, “A time-restricted selfattention layer for ASR,” in ICASSP, 2018, pp. 5874– 5878.
    Google ScholarFindings
  • [35] Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stuker, and Alex Waibel, “Self-attentional acoustic models,” in Proc. Interspeech, 2018, pp. 3723–3727.
    Google ScholarLocate open access versionFindings
  • [36] Linhao Dong, Shuang Xu, and Bo Xu, “Speechtransformer: a no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP, 2018, pp. 5884–5888.
    Google ScholarFindings
  • [37] Julian Salazar, Katrin Kirchhoff, and Zhiheng Huang, “Self-attention networks for connectionist temporal classification in speech recognition,” in Proc. ICASSP, Brighton, UK, May 2019, pp. 7115–7119.
    Google ScholarLocate open access versionFindings
  • [38] Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer, “Transformers with convolutional context for ASR,” arXiv preprint arXiv:1904.11660, 2019.
    Findings
  • [39] Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Muller, and Alex Waibel, “Very deep selfattention networks for end-to-end speech recognition,” in Interspeech, Graz, Austria, Sept. 2019, pp. 66–70.
    Google ScholarFindings
  • [40] Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Collobert, “Sequence-to-sequence speech recognition with time-depth separable convolutions,” in Interspeech, Graz, Austria, Sept. 2019, pp. 3785–3789.
    Google ScholarFindings
  • [41] Jie Li, Xiaorui Wang, Yan Li, et al., “The speechtransformer for large-scale mandarin chinese speech recognition,” in ICASSP. IEEE, 2019, pp. 7095–7099.
    Google ScholarLocate open access versionFindings
  • [42] Linhao Dong, Feng Wang, and Bo Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” arXiv preprint arXiv:1902.06450, 2019.
    Findings
  • [43] Surafel M Lakew, Mauro Cettolo, and Marcello Federico, “A comparison of Transformer and recurrent neural networks on multilingual neural machine translation,” in International Conference on Computational Linguistics, 2018.
    Google ScholarLocate open access versionFindings
  • [44] Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes, “The best of both worlds: Combining recent advances in neural machine translation,” in Proc. ACL, Melbourne, Australia, July 2018, pp. 76–86.
    Google ScholarLocate open access versionFindings
  • [45] Mia Xu Chen, Benjamin N Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M Dai, Zhifeng Chen, et al., “Gmail smart compose: Real-time assisted writing,” arXiv preprint arXiv:1906.00080, 2019.
    Findings
  • [46] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al., “A comparative study on Transformer vs RNN in speech applications,” in ASRU, 2019.
    Google ScholarFindings
  • [47] Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu, “Efficient training of bert by progressively stacking,” in ICML, 2019, pp. 2337–2346.
    Google ScholarFindings
  • [48] Albert Zeyer, Andre Merboldt, Ralf Schluter, and Hermann Ney, “A comprehensive analysis on attention models,” in IRASL Workshop, NeurIPS, Montreal, Canada, Dec. 2018.
    Google ScholarFindings
  • [49] Christoph Luscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schluter, and Hermann Ney, “RWTH ASR systems for librispeech: Hybrid vs attention,” in Interspeech, Graz, Austria, Sept. 2019, pp. 231–235.
    Google ScholarFindings
  • [50] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li, “Modeling coverage for neural machine translation,” in ACL, 2016.
    Google ScholarFindings
  • [51] Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan, “Advances in joint CTC-attention based end-toend speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017.
    Google ScholarLocate open access versionFindings
  • [52] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Mach. Learn. Research, vol. 15, no. 1, pp. 1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • [53] Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton, “Regularizing neural networks by penalizing confident output distributions,” CoRR, vol. abs/1701.06548, 2017.
    Findings
  • [54] Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston, “Curriculum learning,” in Proc. ICML, Montreal, Canada, June 2009, pp. 41–48.
    Google ScholarLocate open access versionFindings
  • [55] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016, Version 1.
    Findings
  • [56] Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loıc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “On using monolingual corpora in neural machine translation,” Computer Speech & Language, vol. 45, pp. 137–148, Sept. 2017.
    Google ScholarLocate open access versionFindings
  • [57] Shubham Toshniwal, Anjuli Kannan, Chung-Cheng Chiu, Yonghui Wu, Tara N Sainath, and Karen Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in Proc. SLT, Athens, Greece, Dec. 2018.
    Google ScholarLocate open access versionFindings
  • [58] Kenton Murray and David Chiang, “Correcting length bias in neural machine translation,” in Proc. WMT, Belgium, Brussels, Oct. 2018, pp. 212–223.
    Google ScholarLocate open access versionFindings
  • [59] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in ICASSP. IEEE, 2015, pp. 5206–5210.
    Google ScholarLocate open access versionFindings
  • [60] John J Godfrey, Edward C Holliman, and Jane McDaniel, “Switchboard: Telephone speech corpus for research and development,” in Proc. ICASSP, San Francisco, CA, USA, Mar. 1992, vol. 1, pp. 517–520.
    Google ScholarLocate open access versionFindings
  • [61] Anthony Rousseau, Paul Deleglise, and Yannick Esteve, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014, pp. 3935–3939.
    Google ScholarLocate open access versionFindings
  • [62] Albert Zeyer, Tamer Alkhouli, and Hermann Ney, “RETURNN as a generic flexible neural toolkit with application to translation and speech recognition,” in ACL, Melbourne, Australia, July 2018.
    Google ScholarFindings
  • [63] Norman P Jouppi, Cliff Young, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proc. ISCA, Toronto, Canada, June 2017, pp. 1–12.
    Google ScholarLocate open access versionFindings
  • [64] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014, Version 9.
    Findings
  • [65] Kyu J Han, Akshay Chandrashekaran, Jungsuk Kim, and Ian Lane, “The CAPIO 2017 conversational speech recognition system,” arXiv preprint arXiv:1801.00059, 2018.
    Findings
Author
Albert Zeyer
Albert Zeyer
Parnia Bahar
Parnia Bahar
Kazuki Irie
Kazuki Irie
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科