AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Pre-training data with matching speaking style was found to be more useful on downstream recognition tasks

A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition

ICASSP, pp.6538-6542, (2021)

Cited by: 0|Views18
EI

Abstract

Building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, many unsupervised pre-training methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-l...More

Code:

Data:

0
Introduction
  • Current industrial end-to-end automatic speech recognition (ASR) systems rely heavily on large amounts of high quality transcribed audio data.
  • Unsupervised pre-training has shown promising results in several areas, including Computer Vision (CV) [1], Natural Language Processing (NLP) [2] and so on.
  • One work that stands out among these methods is Bidirectional Encoder Representations from Transformers (BERT) [2], which used a Masked Language Model (MLM) pre-training objective and obtained new state-of-the-art results on eleven NLP benchmarks.
  • Some other work [15, 16, 17, 18] got motivation from NLP and applied similar methods on speech tasks
Highlights
  • Current industrial end-to-end automatic speech recognition (ASR) systems rely heavily on large amounts of high quality transcribed audio data
  • It is worthwhile to explore how to effectively use un-transcribed data to improve the performance of speech recognition systems when labeled data are limited
  • Despite abundant work [21, 22, 23] on Natural Language Processing (NLP) about knowledge transfer between pre-trained model and downstream task, there are few work exploring how to perform better knowledge transfer in the area of speech. We investigate these aspects of Masked Predictive Coding (MPC) and discuss how we can extend MPC for better speech recognition
  • To apply MPC in streaming models, the Transformer encoder needs to be restricted to only use information that has appeared before
  • Inspired by [28], we propose to use a unified training objective that combines MPC and Autoregressive Predictive Coding (APC)
  • Pre-training data with matching speaking style was found to be more useful on downstream recognition tasks
Methods
  • To apply MPC in streaming models, the Transformer encoder needs to be restricted to only use information that has appeared before.
  • With probability p, the authors apply triangular matrix on Transformer encoder and use APC objective, with probability 1 - p, the authors use the Transformer encoder as-is with MPC objective.
  • This parameter sharing framework has the advantage of making the learned speech representations more general because they are jointly optimized for different pre-training objectives where context is utilized in different ways
Conclusion
  • The authors investigated three important aspects of MPC.
  • Pre-training data with matching speaking style was found to be more useful on downstream recognition tasks.
  • Using MPC directly on streaming models helps, but combining MPC with APC brings further improvements on streaming models.
  • The combination of target data adaption and layer-wise discriminative training provides consistent gains on knowledge transfer to downstream tasks
Summary
  • Introduction:

    Current industrial end-to-end automatic speech recognition (ASR) systems rely heavily on large amounts of high quality transcribed audio data.
  • Unsupervised pre-training has shown promising results in several areas, including Computer Vision (CV) [1], Natural Language Processing (NLP) [2] and so on.
  • One work that stands out among these methods is Bidirectional Encoder Representations from Transformers (BERT) [2], which used a Masked Language Model (MLM) pre-training objective and obtained new state-of-the-art results on eleven NLP benchmarks.
  • Some other work [15, 16, 17, 18] got motivation from NLP and applied similar methods on speech tasks
  • Methods:

    To apply MPC in streaming models, the Transformer encoder needs to be restricted to only use information that has appeared before.
  • With probability p, the authors apply triangular matrix on Transformer encoder and use APC objective, with probability 1 - p, the authors use the Transformer encoder as-is with MPC objective.
  • This parameter sharing framework has the advantage of making the learned speech representations more general because they are jointly optimized for different pre-training objectives where context is utilized in different ways
  • Conclusion:

    The authors investigated three important aspects of MPC.
  • Pre-training data with matching speaking style was found to be more useful on downstream recognition tasks.
  • Using MPC directly on streaming models helps, but combining MPC with APC brings further improvements on streaming models.
  • The combination of target data adaption and layer-wise discriminative training provides consistent gains on knowledge transfer to downstream tasks
Tables
  • Table1: Details of pre-training corpora
  • Table2: Character Error Rates(%) and Relative Error Rates Reduction(%) on HKUST and AISHELL test set
  • Table3: Word Error Rates(%) and Relative Error Rates Reduction(%) on Switchboard and CallHome test set
  • Table4: Character Error Rates(%) and Relative Error Rates Reduction(%) for uni-directional CTC and uni-directional RNN-T with pre-trained MPC
  • Table5: MPC + APC is the model pre-trained with APC 50% of time and MPC 50% of time. Relative Error Rates Reduction(%) is calculated with HKUST baseline without MPC
  • Table6: Results on HKUST and AISHELL with different knowledge transfer methods. HKUST is pre-trained with Didi Callcenter. AISHELL is pre-trained with Didi Dictation. Target Data + Layer-wise is the combination of target data adaption and layer-wise discriminative training
Download tables as Excel
Reference
  • C. Doersch, A. Gupta, and A. Efros, “Unsupervised visual representation learning by context prediction,” in ICCV, 2015, pp. 1422–1430.
    Google ScholarFindings
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pretraining of deep bidirectional transformers for language understanding,” in NAACL-HLT (1), 2019, pp. 4171–4186.
    Google ScholarLocate open access versionFindings
  • O. A. van den, Y. Li, and V. Oriol, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Selfsupervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019.
    Findings
  • A. Baevski, M. Auli, and A. rahman Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,” arXiv preprint arXiv:1911.03912, 2019.
    Findings
  • R. Mirco and B. Yoshua, “Learning speaker representations with mutual information,” Interspeech, Sep 2019.
    Google ScholarLocate open access versionFindings
  • S. Steffen, B. Alexei, C. Ronan, and A. Michael, “wav2vec: Unsupervised pre-training for speech recognition,” Interspeech, Sep 2019.
    Google ScholarLocate open access versionFindings
  • P. Santiago, R. Mirco, S. Joan, B. Antonio, and et al, “Learning problem-agnostic speech representations from multiple selfsupervised tasks,” Interspeech, Sep 2019.
    Google ScholarLocate open access versionFindings
  • K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. van den Oord, “Learning robust and multilingual speech representations,” arXiv preprint arXiv:2001.11128, 2020.
    Findings
  • M. Riviere, A. Joulin, P.-E. Mazare, and E. Dupoux, “Unsupervised pretraining transfers well across languages,” arXiv preprint arXiv:2002.02848, 2020.
    Findings
  • Z. Lian, Y. Li, J.Tao, and J. Huang, “Improving speech emotion recognition via transformer-based predictive coding through transfer learning,” arXiv preprint arXiv:1811.07691, 2018.
    Findings
  • C. Yu-An, H. Wei-Ning, T. Hao, and G. James, “An unsupervised autoregressive model for speech representation learning,” Interspeech, Sep 2019.
    Google ScholarLocate open access versionFindings
  • Y.-A. Chung and J. Glass, “Generative pre-training for speech with autoregressive predictive coding,” arXiv preprint arXiv:1910.12607, 2019.
    Findings
  • C. Yu-An and G. James, “Improved speech representations with multi-target autoregressive predictive coding,” arXiv preprint arXiv:2004.05274, 2020.
    Findings
  • A. T. Liu, S. Yang, P.-H. Chi, P.-C. Hsu, and et al, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” arXiv preprint arXiv:1910.12638, 2019.
    Findings
  • W. Wang, Q. Tang, and K. Livescu, “Unsupervised pre-training of bidirectional speech encoders via masked reconstruction,” arXiv preprint arXiv:2001.10603, 2020.
    Findings
  • X. Song, G. Wang, Z. Wu, Y. Huang, and et al, “Speech-xlnet: Unsupervised acoustic model pretraining for self-attention networks,” arXiv preprint arXiv:1910.10387, 2019.
    Findings
  • D. Jiang, X. Lei, W. Li, N. Luo, and et al, “Improving transformerbased speech recognition using unsupervised pre-training,” arXiv preprint arXiv:1910.09932, 2019.
    Findings
  • B. Mohamed, D. Renato, D. Olivier, D. Stephane, and et al, “Automatic speech recognition and speech variability: A review,” Speech communication, vol. 49, no. 10-11, pp. 763–786, 2007.
    Google ScholarLocate open access versionFindings
  • W. Mitch, T. Kelsey, H. Kate, and S. Amy, “Effect of speaking style on lvcsr performance,” in Proc. ICSLP, vol. 96, 1996.
    Google ScholarLocate open access versionFindings
  • J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” in ACL, 2018.
    Google ScholarFindings
  • A. Chronopoulou, C. Baziotis, and A. Potamianos, “An embarrassingly simple approach for transfer learning from pretrained language models,” Proceedings of the 2019 Conference of the North, 2019.
    Google ScholarLocate open access versionFindings
  • C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune bert for text classification?” arXiv preprint arXiv:1905.05583, 2019.
    Findings
  • Y. Liu, O. Myle, G. Naman, J. Du, and et al, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016, pp. 4960–4964.
    Google ScholarFindings
  • T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, and et al, “Improving the performance of online neural transducer models,” in ICASSP, 2018, pp. 5864–5868.
    Google ScholarFindings
  • H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, “Transformerbased online ctc/attention end-to-end speech recognition architecture,” arXiv preprint arXiv:2001.08290, 2020.
    Findings
  • L. Dong, N. Yang, W. Wang, F. Wei, and et al, “Unified language model pre-training for natural language understanding and generation,” in NeurIPS, 2019.
    Google ScholarFindings
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and et al, “An empirical investigation of catastrophic forgetting in gradientbased neural networks,” arXiv preprint arXiv:1312.6211, 2013.
    Findings
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, and et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the National Academy of Sciences, vol. 114, no. 13, p. 35213526, Mar 2017.
    Google ScholarLocate open access versionFindings
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in NIPS, 2014.
    Google ScholarFindings
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and et al, “Linguistic knowledge and transferability of contextual representations,” arXiv preprint arXiv:1903.08855, 2019.
    Findings
  • X. Shi, I. Padhi, and K. Knight, “Does string-based neural mt learn source syntax?” in EMNLP, 2016.
    Google ScholarFindings
  • Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and et al, “Fine-grained analysis of sentence embeddings using auxiliary prediction tasks,” arXiv preprint arXiv:1608.04207, 2017.
    Findings
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in ICASSP, 2015, pp. 5206–5210.
    Google ScholarFindings
  • C. Cieri, D. Miller, and K. Walker, “The fisher corpus: a resource for the next generations of speech-to-text,” in LREC, 2004.
    Google ScholarFindings
  • W. Zou, D. Jiang, S. Zhao, G. Yang, and et al, “Comparable study of modeling units for end-to-end mandarin speech recognition,” in ISCSLP, 2018, pp. 369–373.
    Google ScholarFindings
  • T. Zenkel, R. Sanabria, F. Metze, and A. H. Waibel, “Subword and crossword units for ctc acoustic models,” in Interspeech, 2018.
    Google ScholarFindings
  • K. Shigeki, C. Nanxin, H. Tomoki, H. Takaaki, and et al, “A comparative study on transformer vs rnn in speech applications,” arXiv preprint arXiv:1909.06317, 2019.
    Findings
  • J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks for connectionist temporal classification in speech recognition,” in ICASSP, 2019, pp. 7115–7119.
    Google ScholarFindings
  • C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, and et al, “Transformer-transducer: End-to-end speech recognition with self-attention,” arXiv preprint arXiv:1910.12977, 2019.
    Findings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, and et al, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
    Google ScholarLocate open access versionFindings
  • D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, and et al, “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • K. Suyoun, H. Takaaki, and W. Shinji, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, Mar 2017.
    Google ScholarFindings
Author
Jiang Dongwei
Jiang Dongwei
Li Wubo
Li Wubo
Zhang Ruixiong
Zhang Ruixiong
Cao Miao
Cao Miao
Luo Ne
Luo Ne
Han Yang
Han Yang
Your rating :
0

 

Tags
Comments
小科