Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

Cited by: 0|Bibtex|Views31
Other Links: arxiv.org
Weibo:
Our AMS achieves the best results for all target languages, which can improve the performance of both MML-Automatic Speech Recognition and multilingual transfer learning ASR with the proposed adversarial meta sampling method

Abstract:

Low-resource automatic speech recognition (ASR) is challenging, as the low-resource target language data cannot well train an ASR model. To solve this issue, meta-learning formulates ASR for each source language into many small ASR tasks and meta-learns a model initialization on all tasks from different source languages to access fast a...More

Code:

Data:

0
Introduction
  • Automatic Speech Recognition (ASR) has attracted a lot of attention recently and achieved significant improvements (Chan et al 2016; Graves et al 2006; Pratap et al 2019) brought by the success of deep neural networks.
  • As shown in Fig. 1 (a), the learnt initialization by TL-ASR often overfits the source language and cannot quickly adapt to a different target language
  • To resolve this issue, MTL-ASR and MML-ASR consider multiple source languages.
  • For each sampled task, MTL-ASR directly trains its model on this task, while MML-ASR adapts its ASR model to the validation data of the task via fine-tuning on a few training data of the task and minimizes the validation loss
  • In this way, the learnt initializations by MTL-ASR and MML-ASR can usually fast adapt to the target low-resource language, as both MTL-ASR and MML-ASR learn the common knowledge from all tasks from different language domains which facilitates learning target languages
Highlights
  • Automatic Speech Recognition (ASR) has attracted a lot of attention recently and achieved significant improvements (Chan et al 2016; Graves et al 2006; Pratap et al 2019) brought by the success of deep neural networks
  • The representative methods in this line are transfer learning ASR (TL-ASR) (Hu et al 2019; Kunze et al 2017), multilingual transfer learning ASR (MTLASR) (Adams et al 2019; Cho et al 2018; Tong, Garner, and Bourlard 2017) and multilingual meta-learning ASR (MMLASR) (Hsu, Chen, and yi Lee 2020) that all aim to learn an ASR model initialization from source languages such that the initialization can quickly adapt to target language via finetuning on a few data
  • Our method consistently achieves the state-of-the-art performance on Indo12 which eliminates task-quantity imbalance and Indo9 which has much fewer source languages for training
  • The results of Kabyle are much worse than that of Spanish and Dutch because Kabyle is an Afro-Asiatic language and all source languages are IndoEuropean language, which indicates that source languages from the same language family are more helpful for target languages
  • Our AMS achieves the best results for all target languages, which can improve the performance of both multilingual meta-learning ASR (MML-ASR) and multilingual transfer learning ASR (MTL-ASR) with the proposed adversarial meta sampling method
  • From Table 4 and 2, one can observe that with 80% training data, our method still works better than most baselines with 100% training data, which testifies our method can effectively alleviate the need of heavy annotated training data
  • To tackle the task-imbalance problem caused by language tasks difficulties and quantities, we develop a novel Adversarial Meta Sampling framework to adaptively sample language tasks for learning a better model initialization for target low-resource languages
Methods
  • Vietnamese Swahili Tamil

    Monolingual (Multi-CTC)(Hsu et al 2020)

    Monolingual (BLSTMP) (Cho et al 2018)

    Monolingual (VGG-Small) (Chen et al 2020)

    Monolingual (VGG-Large) (Chen et al 2020)

    Monolingual (Joint attention-CTC) (Hori et al 2017)

    MTL-ASR (Multi-CTC)(Hsu el al. 2020)

    MTL-ASR (Joint attention-CTC)(Watanabe et al 2017) 47.17 34.10 51.17 the AMS (MTL-ASR)

    MML-ASR (Multi-CTC)(Hsu et al 2020)

    MML-ASR (Joint attention-CTC)(Hsu et al 2020)

    45.10 36.14 50.61 the AMS (MML-ASR).
  • Monolingual (Multi-CTC)(Hsu et al 2020).
  • MML-ASR (Multi-CTC)(Hsu et al 2020).
  • MML-ASR (Joint attention-CTC)(Hsu et al 2020).
  • The authors select 8 source languages of 10 hours and 3 target languages less than 10 hours from CoVoST (Wang et al 2020), a multilingual speech translation (ST) corpus.
  • This indicates that the method can automatically sample tasks according to the task difficulty to alleviate the imbalance from different language difficulties
Results
  • As shown in Table 2, the method consistently achieves the state-of-the-art performance on Indo which eliminates task-quantity imbalance and Indo which has much fewer source languages for training
  • This is because the AMS uses adversarial sampling to select better tasks for effective learning and well overcomes the task-difficulty imbalance issue.
  • The authors' AMS achieves the best results for all target languages, which can improve the performance of both MML-ASR and MTL-ASR with the proposed adversarial meta sampling method.
Conclusion
  • To tackle the task-imbalance problem caused by language tasks difficulties and quantities, the authors develop a novel Adversarial Meta Sampling framework to adaptively sample language tasks for learning a better model initialization for target low-resource languages.
  • Extensive experimental results validate that the method effectively improves the few-shot learning ability of both meta-learning and transfer learning and shows its great generalization capacity in other low-resource speech tasks
Summary
  • Introduction:

    Automatic Speech Recognition (ASR) has attracted a lot of attention recently and achieved significant improvements (Chan et al 2016; Graves et al 2006; Pratap et al 2019) brought by the success of deep neural networks.
  • As shown in Fig. 1 (a), the learnt initialization by TL-ASR often overfits the source language and cannot quickly adapt to a different target language
  • To resolve this issue, MTL-ASR and MML-ASR consider multiple source languages.
  • For each sampled task, MTL-ASR directly trains its model on this task, while MML-ASR adapts its ASR model to the validation data of the task via fine-tuning on a few training data of the task and minimizes the validation loss
  • In this way, the learnt initializations by MTL-ASR and MML-ASR can usually fast adapt to the target low-resource language, as both MTL-ASR and MML-ASR learn the common knowledge from all tasks from different language domains which facilitates learning target languages
  • Methods:

    Vietnamese Swahili Tamil

    Monolingual (Multi-CTC)(Hsu et al 2020)

    Monolingual (BLSTMP) (Cho et al 2018)

    Monolingual (VGG-Small) (Chen et al 2020)

    Monolingual (VGG-Large) (Chen et al 2020)

    Monolingual (Joint attention-CTC) (Hori et al 2017)

    MTL-ASR (Multi-CTC)(Hsu el al. 2020)

    MTL-ASR (Joint attention-CTC)(Watanabe et al 2017) 47.17 34.10 51.17 the AMS (MTL-ASR)

    MML-ASR (Multi-CTC)(Hsu et al 2020)

    MML-ASR (Joint attention-CTC)(Hsu et al 2020)

    45.10 36.14 50.61 the AMS (MML-ASR).
  • Monolingual (Multi-CTC)(Hsu et al 2020).
  • MML-ASR (Multi-CTC)(Hsu et al 2020).
  • MML-ASR (Joint attention-CTC)(Hsu et al 2020).
  • The authors select 8 source languages of 10 hours and 3 target languages less than 10 hours from CoVoST (Wang et al 2020), a multilingual speech translation (ST) corpus.
  • This indicates that the method can automatically sample tasks according to the task difficulty to alleviate the imbalance from different language difficulties
  • Results:

    As shown in Table 2, the method consistently achieves the state-of-the-art performance on Indo which eliminates task-quantity imbalance and Indo which has much fewer source languages for training
  • This is because the AMS uses adversarial sampling to select better tasks for effective learning and well overcomes the task-difficulty imbalance issue.
  • The authors' AMS achieves the best results for all target languages, which can improve the performance of both MML-ASR and MTL-ASR with the proposed adversarial meta sampling method.
  • Conclusion:

    To tackle the task-imbalance problem caused by language tasks difficulties and quantities, the authors develop a novel Adversarial Meta Sampling framework to adaptively sample language tasks for learning a better model initialization for target low-resource languages.
  • Extensive experimental results validate that the method effectively improves the few-shot learning ability of both meta-learning and transfer learning and shows its great generalization capacity in other low-resource speech tasks
Tables
  • Table1: Multilingual dataset statistics in terms of hours (h)
  • Table2: Results of low resource ASR on Diversity11, Indo12 and Indo9 in terms of WER (%)
  • Table3: Results of low resource ASR on IARPA BABEL in terms of Character Error Rate (CER%)
  • Table4: Ablation study results on Diversity11 in terms of WER (%)
  • Table5: Results of speech classification in terms of accuracy(%)
  • Table6: Results of speech translation in terms of BLEU
Download tables as Excel
Related work
  • Transfer learning ASR. To alleviate the need for labeled data, recent works utilize unsupervised pre-training and semi-supervised methods to exploit unlabeled data, e.g. wav2vec (Schneider et al 2019), predictive coding (Chung and Glass 2020), self-training (Kahn, Lee, and Hannun 2020) and weak distillation (Li et al 2019). But they still require substantial unlabeled data which is unavailable for some minority languages. To solve this issue, transfer learning is explored via using other source languages to improve the performance of low-resource languages (Kunze et al 2017), which requires that the source and target languages are similar and the source language has sufficiently large data. Moreover, multilingual transfer learning ASR (Dalmia et al 2018; Watanabe, Hori, and Hershey 2017; Toshniwal et al 2018) is developed using different languages to learn language-independent representations for performance improvement under the low-resource setting. Meta-learning ASR. Meta-learning approaches (Zhou et al 2019, 2020) can meta-learn a model initialization from training tasks with fast adaptation ability to new tasks with only a few data and thus is suitable to handle low-resource data learning problems. Especially, Hsu et al(Hsu et al 2020) and Winata et al (Winata et al 2020) adopted MAML (Finn et al 2017) for low-resource ASR and code-switched ASR and both achieved promising results. But these method ignores task imbalance in real-world scenarios and equally utilizes the meta-knowledge across all the languages, which leads to performance degradation. To alleviate quantity imbalance, Wang et al (Wang, Tsvetkov, and Neubig 2020) improves differentiable data selection by optimizing a scorer with the average loss from different languages to balance the usage of data in multilingual model training. Besides the language quantity, our AMS also considers the language difficulty and learns the sampling policy in an adversarial manner. Adversarial learning ASR. Inspired by domain adversarial training (Ganin et al 2016), recent works introduced adversarial learning into ASR to learn robust features invariant to noise conditions (Shinohara 2016) and accents (Sun et al 2018b). Besides, some researchers use a domain-adversarial classification objective over many languages on multilingual ASR framework to force the shared layers to learn languageindependent representations (Yi et al 2018). Differently, our proposed method explores adversarial learning to solve the task imbalance problem in multilingual meta-learning ASR and can learn to adaptively sample the meta-training tasks for effectively training low-resource ASR models.
Funding
  • This work was supported in part by National Natural Science Foundation of China (NSFC) under Grant No.U19A2073 and No.61976233, Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Nature Science Foundation of Shenzhen Under Grant No 2019191361, Zhijiang Lab’s Open Fund (No 2020AA3AB14) and CSIG Young Fellow Support Fund
Study subjects and analysis
source datasets: 5
AMS on speech classification. Our speech classification datasets contain 5 source datasets and 5 target datasets provided by the AutoSpeech 2020 competition (InterSpeech 2020). Different datasets come from different speech classification domains with varying examples, classes and the quantity of examples, including speaker identification, emotion classification, etc

Reference
  • Adams, O.; Wiesner, M.; Watanabe, S.; and Yarowsky, D. 2019. Massively Multilingual Adversarial Speech Recognition. NAACL-HLT 96–108.
    Google ScholarLocate open access versionFindings
  • Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; and Sivic, J. 2016. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 5297–5307.
    Google ScholarLocate open access versionFindings
  • Chan, W.; Jaitly, N.; Le, Q. V.; and Vinyals, O. 2016.
    Google ScholarFindings
  • Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4960–4964.
    Google ScholarLocate open access versionFindings
  • Chen, Y.-C.; Hsu, J.-Y.; Lee, C.-W.; and yi Lee, H. 2020. DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation. In INTERSPEECH.
    Google ScholarFindings
  • Cho, J.; Baskar, M. K.; Li, R.; Wiesner, M.; Mallidi, S. H.; Yalta, N.; Karafiat, M.; Watanabe, S.; and Hori, T. 2018. Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling. 2018 IEEE Spoken Language Technology Workshop (SLT) 521– 527.
    Google ScholarLocate open access versionFindings
  • Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; and Bengio, Y. 2015. Attention-Based Models for Speech Recognition. In NIPS.
    Google ScholarFindings
  • Chung, Y.-A.; and Glass, J. 2020. Generative Pre-Training for Speech with Autoregressive Predictive Coding. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    Google ScholarLocate open access versionFindings
  • Dalmia, S.; Sanabria, R.; Metze, F.; and Black, A. W. 2018. Sequence-Based Multi-Lingual Low Resource Speech Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4909–4913.
    Google ScholarLocate open access versionFindings
  • Dou, Z.-Y.; Yu, K.; and Anastasopoulos, A. 2019. Investigating Meta-Learning Algorithms for Low-Resource Natural Language Understanding Tasks. In EMNLP/IJCNLP.
    Google ScholarFindings
  • Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML.
    Google ScholarFindings
  • FSI. 2007. Language Learning Difficulty for English Speakers. https://en.wikibooks.org/wiki/Wikibooks:Language Learning Difficulty for English Speakers.
    Locate open access versionFindings
  • Gales, M.; Knill, K.; Ragni, A.; and Rath, S. P. 2014. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED. In SLTU.
    Google ScholarFindings
  • Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. S. 2016. Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research vol. 17, no. 1, pp. 2096–2030.
    Google ScholarLocate open access versionFindings
  • Google. 2019. sentencepiece,Unsupervised text tokenizer for Neural Network-based text generation. https://github.com/google/sentencepiece.
    Findings
  • Graves, A.; Fernandez, S.; Gomez, F. J.; and Schmidhuber, J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML ’06.
    Google ScholarFindings
  • Graves, A.; and Jaitly, N. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In ICML.
    Google ScholarLocate open access versionFindings
  • Graves, A.; Jaitly, N.; and rahman Mohamed, A. 2013. Hybrid speech recognition with Deep Bidirectional LSTM. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding 273–278.
    Google ScholarFindings
  • Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-Term Memory. Neural Computation 9: 1735–1780.
    Google ScholarLocate open access versionFindings
  • Hori, T.; Watanabe, S.; Zhang, Y. L.; and Chan, W. 2017. Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM. In INTERSPEECH.
    Google ScholarFindings
  • Hsu, J.-Y.; Chen, Y.-J.; and yi Lee, H. 2020. Meta Learning for End-to-End Low-Resource Speech Recognition. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    Google ScholarLocate open access versionFindings
  • Hu, K.; Bruguier, A.; Sainath, T. N.; Prabhavalkar, R.; and Pundak, G. 2019. Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models. In Proc. Interspeech 2019, 2155–2159.
    Google ScholarLocate open access versionFindings
  • InterSpeech. 2020. AutoSpeech 2020 Challenge. https://www.automl.ai/competitions/2.
    Findings
  • Kahn, J.; Lee, A.; and Hannun, A. 2020. Self-Training for End-to-End Speech Recognition. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    Google ScholarLocate open access versionFindings
  • Kim, S.; Hori, T.; and Watanabe, S. 2017. Joint CTCattention based end-to-end speech recognition using multitask learning. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4835–4839.
    Google ScholarLocate open access versionFindings
  • Kudo, T. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In ACL.
    Google ScholarFindings
  • Kunze, J.; Kirsch, L.; Kurenkov, I.; Krug, A.; Johannsmeier, J.; and Stober, S. 2017. Transfer Learning for Speech Recognition on a Budget. In Rep4NLP,ACL.
    Google ScholarLocate open access versionFindings
  • Li, B.; Sainath, T. N.; Pang, R.; and Wu, Z. 2019. Semisupervised Training for End-to-end Models via Weak Distillation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2837–2841.
    Google ScholarLocate open access versionFindings
  • Mozilla.org. 2019. Common Voice. https://voice.mozilla.org/en.
    Findings
  • Nichol, A.; Achiam, J.; and Schulman, J. 2018. On FirstOrder Meta-Learning Algorithms. ArXiv abs/1803.02999.
    Findings
  • Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
    Google ScholarLocate open access versionFindings
  • Post, M. 2018. A Call for Clarity in Reporting BLEU Scores. In WMT.
    Google ScholarFindings
  • Pratap, V.; Hannun, A.; Xu, Q.; Cai, J.; Kahn, J.; Synnaeve, G.; Liptchinsky, V.; and Collobert, R. 2019. Wav2Letter++: A Fast Open-source Speech Recognition System. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 6460–6464.
    Google ScholarLocate open access versionFindings
  • Sandler, M.; Howard, A. G.; Zhu, M.; Zhmoginov, A.; and Chen, L.-C. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 4510–4520.
    Google ScholarFindings
  • Schneider, S.; Baevski, A.; Collobert, R.; and Auli, M. 2019. wav2vec: Unsupervised Pre-training for Speech Recognition. In INTERSPEECH 2019.
    Google ScholarLocate open access versionFindings
  • Senel, L. K.; Utlu, I.; Yucesoy, V.; Koc, A.; and Cukur, T. 2018. Generating Semantic Similarity Atlas for Natural Languages. 2018 IEEE Spoken Language Technology Workshop (SLT) 795–799.
    Google ScholarLocate open access versionFindings
  • Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 1715–1725.
    Google ScholarLocate open access versionFindings
  • Shinohara, Y. 2016. Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition. In INTERSPEECH.
    Google ScholarFindings
  • Simonyan, K.; and Zisserman, A. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556.
    Findings
  • Sun, Q.; Liu, Y.; Chua, T.-S.; and Schiele, B. 2018a. MetaTransfer Learning for Few-Shot Learning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 403–412.
    Google ScholarLocate open access versionFindings
  • Sun, S.; Yeh, C.-F.; Hwang, M.-Y.; Ostendorf, M.; and Xie, L. 2018b. Domain Adversarial Training for Accented Speech Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4854–4858.
    Google ScholarLocate open access versionFindings
  • Tong, S.; Garner, P. N.; and Bourlard, H. 2017. Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model. ArXiv abs/1711.10025.
    Findings
  • Toshniwal, S.; Sainath, T. N.; Weiss, R. J.; Li, B.; Moreno, P. J.; Weinstein, E.; and Rao, K. 2018. Multilingual Speech Recognition with a Single End-to-End Model. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4904–4908.
    Google ScholarLocate open access versionFindings
  • Waibel, A.; Soltau, H.; Schultz, T.; Schaaf, T.; and Metze, F. 2000. Multilingual Speech Recognition, 33–45. Springer Berlin Heidelberg.
    Google ScholarFindings
  • Wang, C.; Pino, J.; Wu, A.; and Gu, J. 2020. CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus. ArXiv abs/2002.01320.
    Findings
  • Wang, X.; Tsvetkov, Y.; and Neubig, G. 2020. Balancing Training for Multilingual Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8526–8537.
    Google ScholarLocate open access versionFindings
  • Watanabe, S.; Hori, T.; and Hershey, J. R. 2017. Language independent end-to-end architecture for joint language identification and speech recognition. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 265–271.
    Google ScholarLocate open access versionFindings
  • Williams, R. J. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning 8: 229–256.
    Google ScholarLocate open access versionFindings
  • Winata, G. I.; Cahyawijaya, S.; Lin, Z.; Liu, Z.; Xu, P.; and Fung, P. 2020. Meta-Transfer Learning for Code-Switched Speech Recognition. In ACL.
    Google ScholarFindings
  • Yi, J.; Tao, J.; Wen, Z.; and Bai, Y. 2018. Adversarial Multilingual Training for Low-Resource Speech Recognition. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4899–4903.
    Google ScholarLocate open access versionFindings
  • Zhou, P.; Yuan, X.; Xu, H.; Yan, S.; and Feng, J. 2019. Efficient Meta Learning via Minibatch Proximal Update. In NeurIPS.
    Google ScholarFindings
  • Zhou, P.; Zou, Y.; Yuan, X.; Feng, J.; Xiong, C.; and Hoi, S. C. 2020. Task Similarity Aware Meta Learning: Theory-inspired Improvement on MAML. In 4th Workshop on Meta-Learning at NeurIPS.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments