AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We have shown competitive results with Transformer models for end-to-end automatic speech recognition, which can be trained faster and more stable than long short-term memory
A Comparison of Transformer and LSTM Encoder Decoder Models for ASR.
ASRU, pp.8-15, (2019)
We present competitive results using a Transformer encoderdecoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We observe that the Transformer training is in general more stable compared to the LSTM, although it also seems to overfit more, and thus shows more p...More
PPT (Upload PPT)
- End-to-end automatic speech recognition (ASR) tries to simplify the modeling aspect, the training pipeline, and the decoding algorithm [1,2,3,4,5,6,7,8].
- The motivation for self-attention is two-fold: It allows for more direct information flow across the whole sequence, and more direct gradient flow
- It allows for faster training, as most operations can be calculated in parallel for the standard cross-entropy loss.
- More recent work suggested to use the Transformer directly as the acoustic model  or as an end-to-end ASR model [34,35,36,37,38,39,40,41,42]
- End-to-end automatic speech recognition (ASR) tries to simplify the modeling aspect, the training pipeline, and the decoding algorithm [1,2,3,4,5,6,7,8]. It tries to reduce any assumptions being made about the data, and better performance can potentially be expected compared with the conventional system, the hybrid hidden Markov model (HMM) - neural network (NN) approach on phoneme level [9,10,11,12]
- Initial works on encoder-decoder models were usually based on long short-term memory (LSTM)  networks 
- We explore a faster LSTM variant where the decoder LSTM is decoupled from the loop, and allows for faster training
- Inspired by , and similar to the Transformer, we investigate a variant, where the decoder LSTM does only depend on the ground truth, instead of the attention context
- We have shown competitive results with Transformer models for end-to-end ASR, which can be trained faster and more stable than LSTMs
- Data-augmentation, a variant of SpecAugment, helps to improve both the Transformer by 33% and the LSTM by 15% relative.
- The authors have shown competitive results with Transformer models for end-to-end ASR, which can be trained faster and more stable than LSTMs. Overfitting and generalization might be a problem, though, which can partly be overcome with SpecAugment.
- Speed perturbation No augmentation T6ime [days] 8.
- LM WER [%] Train speed dev test [char/sec] LSTM  RNN 7.1 7.7 LSTM  none 14.6 14.7.
- Transformer2  none 15.3 16.7
- Table1: LibriSpeech 1000h results. Comparing different encoder and decoder depths (number of layers)
- Table2: LibriSpeech 1000h results (12.5 epochs). Comparing Transformer to LSTM, and its decoupled decoder LSTM (DecLSTM) variant; comparing to other results in the literature; data augmentation most similar to SpecAugment; initial convolutional network (Conv); stable MLP attention projection (ExpV); LongTrain trains for twice as long (25 epochs)
- Table3: Switchboard 300h results
- Table4: TED-LIUM-v2 results. 1 is a Transformer with an LSTM on top. 2 is a Transformer with interleaved LSTM. The training speed is with a single Nvidia 1080 Ti GPU. We use our own native CUDA LSTM implementation. All our models use a variant of SpecAugment
- This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694537, project ”SEQCLAS”) and from a Google Focused Award
- The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains
-  Hagen Soltau, Hank Liao, and Hasim Sak, “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition,” in Proc. Interspeech, 2017, pp. 3707–3711.
-  Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, and David Nahamoo, “Direct acoustics-to-word models for english conversational speech recognition,” in Proc. Interspeech, 2017, pp. 959–963.
-  Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve, “Wav2letter: an end-to-end convnetbased speech recognition system,” arXiv preprint arXiv:1609.03193, 2016.
-  Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: first results,” arXiv preprint arXiv:1412.1602, 2014.
-  William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
-  Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu, “Exploring neural transducers for end-to-end speech recognition,” in ASRU, Okinawa, Japan, Dec. 2017, pp. 206– 213.
-  Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” in ICASSP, 2016, pp. 4945–4949.
-  Herve Bourlard and Nelson Morgan, Connectionist speech recognition: a hybrid approach, vol. 247, Springer, 1994.
-  Anthony J Robinson, “An application of recurrent nets to phone probability estimation,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.
-  Hasim Sak, Andrew W. Senior, and Francoise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Interspeech, Singapore, Sept. 2014, pp. 338–342.
-  Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, Ralf Schluter, and Hermann Ney, “A comprehensive study of deep bidirectional lstm rnns for acoustic modeling in speech recognition,” in ICASSP, New Orleans, LA, USA, Mar. 2017, pp. 2462–2466.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” Proc. ICLR, 2015.
-  Albert Zeyer, Kazuki Irie, Ralf Schluter, and Hermann Ney, “Improved training of end-to-end attention models for speech recognition,” in Interspeech, Hyderabad, India, Sept. 2018.
-  Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in ICASSP, 2018, pp. 4774–4778.
-  Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, Graz, Austria, Sept. 2019, pp. 2613–2617.
-  Sepp Hochreiter and Jurgen Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” EMNLP, 2014.
-  Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in ICASSP, New Orleans, LA, USA, Mar. 2017, pp. 4845–4849.
-  Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin, “Convolutional sequence to sequence learning,” in ICML. JMLR. org, 2017, pp. 1243–1252, arXiv:1705.03122.
-  Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde, “Jasper: An end-to-end convolutional neural acoustic model,” arXiv preprint arXiv:1904.03288, 2019.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 6000–6010.
-  Jianpeng Cheng, Li Dong, and Mirella Lapata, “Long short-term memory-networks for machine reading,” in EMNLP, Austin, TX, USA, Nov. 2016, pp. 551–561.
-  Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio, “A structured self-attentive sentence embedding,” ICLR, Apr. 2017.
-  Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit, “A decomposable attention model for natural language inference,” in EMNLP, Austin, TX, USA, Nov. 2016, pp. 2249–2255.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidirectional Transformers for language understanding,” in Proc. NAACL-HLT, Minneapolis, MN, USA, June 2019, pp. 4171–4186.
-  Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” CoRR, vol. abs/1906.08237, 2019.
-  Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, “Improving language understanding by generative pre-training,” OpenAI Blog, 2018.
-  Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, 2019.
-  Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” in ACL, Florence, Italy, July 2019.
-  Alexei Baevski and Michael Auli, “Adaptive input representations for neural language modeling,” in ICLR, New Orleans, LA, USA, May 2019.
-  Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones, “Character-level language modeling with deeper self-attention,” in AAAI Conf. on AI., Honolulu, HI, USA, Jan. 2019.
-  Kazuki Irie, Albert Zeyer, Ralf Schluter, and Hermann Ney, “Language modeling with deep Transformers,” in Interspeech, Graz, Austria, Sept. 2019, pp. 3905–3909.
-  Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur, “A time-restricted selfattention layer for ASR,” in ICASSP, 2018, pp. 5874– 5878.
-  Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stuker, and Alex Waibel, “Self-attentional acoustic models,” in Proc. Interspeech, 2018, pp. 3723–3727.
-  Linhao Dong, Shuang Xu, and Bo Xu, “Speechtransformer: a no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP, 2018, pp. 5884–5888.
-  Julian Salazar, Katrin Kirchhoff, and Zhiheng Huang, “Self-attention networks for connectionist temporal classification in speech recognition,” in Proc. ICASSP, Brighton, UK, May 2019, pp. 7115–7119.
-  Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer, “Transformers with convolutional context for ASR,” arXiv preprint arXiv:1904.11660, 2019.
-  Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Muller, and Alex Waibel, “Very deep selfattention networks for end-to-end speech recognition,” in Interspeech, Graz, Austria, Sept. 2019, pp. 66–70.
-  Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Collobert, “Sequence-to-sequence speech recognition with time-depth separable convolutions,” in Interspeech, Graz, Austria, Sept. 2019, pp. 3785–3789.
-  Jie Li, Xiaorui Wang, Yan Li, et al., “The speechtransformer for large-scale mandarin chinese speech recognition,” in ICASSP. IEEE, 2019, pp. 7095–7099.
-  Linhao Dong, Feng Wang, and Bo Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” arXiv preprint arXiv:1902.06450, 2019.
-  Surafel M Lakew, Mauro Cettolo, and Marcello Federico, “A comparison of Transformer and recurrent neural networks on multilingual neural machine translation,” in International Conference on Computational Linguistics, 2018.
-  Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes, “The best of both worlds: Combining recent advances in neural machine translation,” in Proc. ACL, Melbourne, Australia, July 2018, pp. 76–86.
-  Mia Xu Chen, Benjamin N Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M Dai, Zhifeng Chen, et al., “Gmail smart compose: Real-time assisted writing,” arXiv preprint arXiv:1906.00080, 2019.
-  Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al., “A comparative study on Transformer vs RNN in speech applications,” in ASRU, 2019.
-  Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu, “Efficient training of bert by progressively stacking,” in ICML, 2019, pp. 2337–2346.
-  Albert Zeyer, Andre Merboldt, Ralf Schluter, and Hermann Ney, “A comprehensive analysis on attention models,” in IRASL Workshop, NeurIPS, Montreal, Canada, Dec. 2018.
-  Christoph Luscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schluter, and Hermann Ney, “RWTH ASR systems for librispeech: Hybrid vs attention,” in Interspeech, Graz, Austria, Sept. 2019, pp. 231–235.
-  Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li, “Modeling coverage for neural machine translation,” in ACL, 2016.
-  Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan, “Advances in joint CTC-attention based end-toend speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017.
-  Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Mach. Learn. Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton, “Regularizing neural networks by penalizing confident output distributions,” CoRR, vol. abs/1701.06548, 2017.
-  Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston, “Curriculum learning,” in Proc. ICML, Montreal, Canada, June 2009, pp. 41–48.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016, Version 1.
-  Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loıc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “On using monolingual corpora in neural machine translation,” Computer Speech & Language, vol. 45, pp. 137–148, Sept. 2017.
-  Shubham Toshniwal, Anjuli Kannan, Chung-Cheng Chiu, Yonghui Wu, Tara N Sainath, and Karen Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in Proc. SLT, Athens, Greece, Dec. 2018.
-  Kenton Murray and David Chiang, “Correcting length bias in neural machine translation,” in Proc. WMT, Belgium, Brussels, Oct. 2018, pp. 212–223.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in ICASSP. IEEE, 2015, pp. 5206–5210.
-  John J Godfrey, Edward C Holliman, and Jane McDaniel, “Switchboard: Telephone speech corpus for research and development,” in Proc. ICASSP, San Francisco, CA, USA, Mar. 1992, vol. 1, pp. 517–520.
-  Anthony Rousseau, Paul Deleglise, and Yannick Esteve, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014, pp. 3935–3939.
-  Albert Zeyer, Tamer Alkhouli, and Hermann Ney, “RETURNN as a generic flexible neural toolkit with application to translation and speech recognition,” in ACL, Melbourne, Australia, July 2018.
-  Norman P Jouppi, Cliff Young, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proc. ISCA, Toronto, Canada, June 2017, pp. 1–12.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014, Version 9.
-  Kyu J Han, Akshay Chandrashekaran, Jungsuk Kim, and Ian Lane, “The CAPIO 2017 conversational speech recognition system,” arXiv preprint arXiv:1801.00059, 2018.