Sample Efficient Adaptive Text-to-Speech.

international conference on learning representations, (2019)

被引用44|浏览343
EI
下载 PDF 全文
引用
微博一下

摘要

We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system....更多

代码

数据

0
简介
  • Training a large model with lots of data and subsequently deploying this model to carry out classification or regression is an important and common methodology in machine learning
  • It has been successful in speech recognition (Hinton et al, 2012), machine translation (Wu et al, 2016) and image recognition (Krizhevsky et al, 2012; Szegedy et al, 2015).
  • The output of training is not longer a fixed model, but rather a fast learner
重点内容
  • Training a large model with lots of data and subsequently deploying this model to carry out classification or regression is an important and common methodology in machine learning
  • In this paper we describe a new WaveNet training procedure that facilitates adaptation to new speakers, allowing the synthesis of new voices from no more than 10 minutes of data with high sample quality
  • When fine-tuning by first estimating the speaker embedding and subsequently fine-tuning the entire model, we achieve state-of-the-art results in terms of sample naturalness and voice similarity to target speakers
  • These results are robust across speech datasets recorded under different conditions and, we demonstrate that the generated samples are capable of confusing the state-of-the-art text-independent speaker verification system (Wan et al, 2018)
  • Attempts of few-shot adaptation involved the attention models of Reed et al (2018) and model-agnostic meta learning (MAML) (Finn et al, 2017a), but we found both of these strategies failed to learn informative speaker embedding in our preliminary experiments
  • When adapted with a few minutes of data, our model matches the state-of-the-art performance in sample naturalness
方法
结果
  • The authors evaluate the quality of samples of SEA-ALL, SEA-EMB and SEA-ENC.
  • The authors evaluate the similarity of generated and real samples using the subjective MOS test and objectively using a speaker verification system (Wan et al, 2018).
  • The authors study these results varying the size of the adaptation dataset
结论
  • This paper studied three variants of meta-learning for sample efficient adaptive TTS.
  • When adapted with a few minutes of data, the model matches the state-of-the-art performance in sample naturalness.
  • It outperforms other recent works in matching the new speaker’s voice.
  • The authors' paper considers the adaptation to new voices with clean, high-quality training data collected in a controlled environment.
  • The few-shot learning of voices with noisy data is beyond the scope of this paper and remains a challenging open research problem
总结
  • Introduction:

    Training a large model with lots of data and subsequently deploying this model to carry out classification or regression is an important and common methodology in machine learning
  • It has been successful in speech recognition (Hinton et al, 2012), machine translation (Wu et al, 2016) and image recognition (Krizhevsky et al, 2012; Szegedy et al, 2015).
  • The output of training is not longer a fixed model, but rather a fast learner
  • Methods:

    The authors train a WaveNet model for each of the three methods using the same dataset, which combines the high-quality LibriSpeech audiobook corpus (Panayotov et al, 2015) and a proprietary speech corpus.
  • The multi-speaker WaveNet model has the same architecture as van den Oord et al (2016) except that the authors use a 200-dimensional speaker embedding space to model the large diversity of voices.
  • Dataset Real utterance van den Oord et al (2016) Nachmani et al (2018) Arik et al (2018)
  • Results:

    The authors evaluate the quality of samples of SEA-ALL, SEA-EMB and SEA-ENC.
  • The authors evaluate the similarity of generated and real samples using the subjective MOS test and objectively using a speaker verification system (Wan et al, 2018).
  • The authors study these results varying the size of the adaptation dataset
  • Conclusion:

    This paper studied three variants of meta-learning for sample efficient adaptive TTS.
  • When adapted with a few minutes of data, the model matches the state-of-the-art performance in sample naturalness.
  • It outperforms other recent works in matching the new speaker’s voice.
  • The authors' paper considers the adaptation to new voices with clean, high-quality training data collected in a controlled environment.
  • The few-shot learning of voices with noisy data is beyond the scope of this paper and remains a challenging open research problem
表格
  • Table1: Naturalness of the adapted voices using a 5-scale MOS score (higher is better) with 95% confidence interval on the LibriSpeech and VCTK held-out adaptation datasets. Numbers in bold are the best few-shot learning results on each dataset without statistically significant difference. <a class="ref-link" id="cvan_den_Oord_et+al_2016_a" href="#rvan_den_Oord_et+al_2016_a">van den Oord et al (2016</a>) was trained with 24-hour production quality data, <a class="ref-link" id="cNachmani_et+al_2018_a" href="#rNachmani_et+al_2018_a">Nachmani et al (2018</a>) used all samples of each new speaker, <a class="ref-link" id="cArik_et+al_2018_a" href="#rArik_et+al_2018_a">Arik et al (2018</a>) used 10 samples, and <a class="ref-link" id="cJia_et+al_2018_a" href="#rJia_et+al_2018_a">Jia et al (2018</a>) used 5 seconds
  • Table2: Voice similarity of generated voices using a 5-scale MOS score (higher is better) with 95% confidence interval on the LibriSpeech and VCTK held-out adaptation datasets
  • Table3: Equal error rate (EER) of real and few-shot adapted voice samples for evaluation of voice similarity. Varying adaptation dataset sizes were considered
Download tables as Excel
相关工作
  • Few-shot learning to build models, where one can rapidly learn using only a small amount of available data, is one of the most important open challenges in machine learning. Recent studies have attempted to address the problem of few-shot learning by using deep neural networks, and they have shown promising results on classification tasks in vision (Santoro et al, 2016; Shyam et al, 2017) and language (Vinyals et al, 2016). Few-shot learning can also be leveraged in reinforcement learning, such as by imitating human Atari gameplay from a single recorded action sequence (Pohlen et al, 2018) or online video (Aytar et al, 2018).

    Meta-learning offers a sound framework for addressing few-shot learning. Here, an expensive learning process results in machines with the ability to learn rapidly from few data. Meta-learning has a long history (Harlow, 1949; Thrun and Pratt, 2012), and recent studies include efforts to learn optimization processes (Andrychowicz et al, 2016; Chen et al, 2017) that have been shown to extend naturally to the few-shot setting (Ravi and Larochelle, 2016). An alternative approach is model-agnostic meta learning (MAML) (Finn et al, 2017a), which differs by using a fixed optimizer and learning a set of base parameters that can be adapted to minimize any task loss by few steps of gradient descent. This method has shown promise in robotics (Finn et al, 2017b; Yu et al, 2018).
基金
  • When fine-tuning by first estimating the speaker embedding and subsequently fine-tuning the entire model, we achieve state-of-the-art results in terms of sample naturalness and voice similarity to target speakers
  • We also initialize e with the optimal value from the SEA-EMB method, and we find this initialization significantly improves the generalization performance even with a few seconds of adaptation data
研究对象与分析
speakers: 2302
We train a WaveNet model for each of our three methods using the same dataset, which combines the high-quality LibriSpeech audiobook corpus (Panayotov et al, 2015) and a proprietary speech corpus. The LibriSpeech dataset consists of 2302 speakers from the train speaker subsets and approximately 500 hours of utterances, sampled at a frequency of 16 kHz. The proprietary speech corpus consists of 10 American English speakers and approximately 300 hours of utterances, and we down-sample the recording frequency to 16 kHz to match LibriSpeech

American English speakers: 10
The LibriSpeech dataset consists of 2302 speakers from the train speaker subsets and approximately 500 hours of utterances, sampled at a frequency of 16 kHz. The proprietary speech corpus consists of 10 American English speakers and approximately 300 hours of utterances, and we down-sample the recording frequency to 16 kHz to match LibriSpeech. The multi-speaker WaveNet model has the same architecture as van den Oord et al (2016) except that we use a 200-dimensional speaker embedding space to model the large diversity of voices.

Dataset Real utterance van den Oord et al (2016) Nachmani et al (2018) Arik et al (2018)

adapt embedding adapt whole-model encoding + fine-tuning Jia et al (2018) trained on LibriSpeech Adaptation data size SEA-ALL (ours) SEA-EMB (ours) SEA-ENC (ours)

LibriSpeech 4.38 ± 0.04 2.53 ± 1.11 4.21 ± 0.081 VCTK 4.45 ± 0.04 3.66 ± 0.84 2.67 ± 0.10 3.16 ± 0.09 2.99 ± 0.12

speakers: 2302
We train a WaveNet model for each of our three methods using the same dataset, which combines the high-quality LibriSpeech audiobook corpus (Panayotov et al, 2015) and a proprietary speech corpus. The LibriSpeech dataset consists of 2302 speakers from the train speaker subsets and approximately 500 hours of utterances, sampled at a frequency of 16 kHz. The proprietary speech corpus consists of 10 American English speakers and approximately 300 hours of utterances, and we down-sample the recording frequency to 16 kHz to match LibriSpeech

American English speakers: 10
The LibriSpeech dataset consists of 2302 speakers from the train speaker subsets and approximately 500 hours of utterances, sampled at a frequency of 16 kHz. The proprietary speech corpus consists of 10 American English speakers and approximately 300 hours of utterances, and we down-sample the recording frequency to 16 kHz to match LibriSpeech. The multi-speaker WaveNet model has the same architecture as van den Oord et al (2016) except that we use a 200-dimensional speaker embedding space to model the large diversity of voices

speakers: 39
Our few-shot model performance is evaluated using two hold-out datasets. First, the LibriSpeech test corpus consists of 39 speakers, with an average of approximately 52 utterances and 5 minutes of audio per speaker. For every test speaker, we randomly split their demonstration utterances into an adaptation set for adapting our WaveNet models and a test set for evaluation

American English speakers: 21
There are about 4.2 utterances on average per speaker in the test set and the rest in the adaptation set. Second, we consider a subset of the CSTR VCTK corpus (Veaux et al, 2017) consisting of 21 American English speakers, with approximately 368 utterances and 12 minutes of audio per speaker. We also apply the adaptation/test split with 10 utterances per speaker for test

samples: 10
. Naturalness of the adapted voices using a 5-scale MOS score (higher is better) with 95% confidence interval on the LibriSpeech and VCTK held-out adaptation datasets. Numbers in bold are the best few-shot learning results on each dataset without statistically significant difference. van den Oord et al (2016) was trained with 24-hour production quality data, Nachmani et al (2018) used all samples of each new speaker, Arik et al (2018) used 10 samples, and Jia et al (2018) used 5 seconds. Voice similarity of generated voices using a 5-scale MOS score (higher is better) with 95% confidence interval on the LibriSpeech and VCTK held-out adaptation datasets

引用论文
  • G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information processing Systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015.
    Google ScholarLocate open access versionFindings
  • T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, 1997. ISBN 0-7923-4498-7.
    Google ScholarFindings
  • P. Taylor. Text-to-Speech Synthesis. Cambridge University Press, New York, NY, USA, 1st edition, 2009. ISBN 0521899273, 9780521899277.
    Google ScholarFindings
  • A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
    Findings
  • A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. Parallel WaveNet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
    Findings
  • L. Wan, Q. Wang, A. Papir, and I. L. Moreno. Generalized end-to-end loss for speaker verification. In International Conference on Acoustics, Speech, and Signal Processing, pages 4879–4883. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In INTERSPEECH, pages 2273–2277, 2016.
    Google ScholarLocate open access versionFindings
  • L. Li, Y. Chen, Y. Shi, Z. Tang, and D. Wang. Deep speaker feature learning for text-independent speaker verification. In INTERSPEECH, pages 1542–1546, 2017.
    Google ScholarLocate open access versionFindings
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, pages 1842–1850, 2016.
    Google ScholarLocate open access versionFindings
  • P. Shyam, S. Gupta, and A. Dukkipati. Attentive recurrent comparators. In International Conference on Machine Learning, pages 3173–3181, 2017.
    Google ScholarLocate open access versionFindings
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • T. Pohlen, B. Piot, T. Hester, M. G. Azar, D. Horgan, D. Budden, G. Barth-Maron, H. van Hasselt, J. Quan, M. Vecerík, et al. Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
    Findings
  • Y. Aytar, T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. de Freitas. Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018.
    Findings
  • H. F. Harlow. The formation of learning sets. Psychological review, 56(1):51, 1949.
    Google ScholarLocate open access versionFindings
  • S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.
    Google ScholarFindings
  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. Freitas. Learning to learn without gradient descent by gradient descent. In International Conference on Machine Learning, pages 748–756, 2017.
    Google ScholarLocate open access versionFindings
  • S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135, 2017a.
    Google ScholarLocate open access versionFindings
  • C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. In Conference on Robot Learning, pages 357–368, 2017b.
    Google ScholarLocate open access versionFindings
  • T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. In International Conference on Learning Representations Workshop, 2018.
    Google ScholarLocate open access versionFindings
  • S. Bartunov and D. P. Vetrov. Fast adaptation in generative models with generative matching networks. In International Conference on Learning Representations Workshop, 2017.
    Google ScholarLocate open access versionFindings
  • J. Bornschein, A. Mnih, D. Zoran, and D. J. Rezende. Variational memory addressing in generative models. In Advances in Neural Information Processing Systems, pages 3923–3932, 2017.
    Google ScholarLocate open access versionFindings
  • D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization in deep generative models. In International Conference on Machine Learning, pages 1521–1529, 2016.
    Google ScholarLocate open access versionFindings
  • K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra. DRAW: A recurrent neural network for image generation. In International Conference on Machine Learning, pages 1462–1471, 2015.
    Google ScholarLocate open access versionFindings
  • S. Reed, Y. Chen, T. Paine, A. van den Oord, S. M. Eslami, D. Rezende, O. Vinyals, and N. de Freitas. Few-shot autoregressive density estimation: Towards learning to learn distributions. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747–1756, 2016.
    Google ScholarLocate open access versionFindings
  • J. Veness, T. Lattimore, A. Bhoopchand, A. Grabska-Barwinska, C. Mattern, and P. Toth. Online learning with gated linear networks. arXiv preprint arXiv:1712.01897, 2017.
    Findings
  • R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. arXiv preprint arXiv:1803.09047, 2018.
    Findings
  • Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous. Tacotron: Towards end-to-end speech synthesis. In INTERSPEECH, pages 4006–4010, 2017.
    Google ScholarLocate open access versionFindings
  • A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou. Deep voice 2: Multispeaker neural text-to-speech. In Advances in Neural Information Processing Systems, pages 2962–2970, 2017.
    Google ScholarLocate open access versionFindings
  • S. Ö. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, et al. Deep voice: Real-time neural text-to-speech. In International Conference on Machine Learning, pages 195–204, 2017.
    Google ScholarLocate open access versionFindings
  • W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller. Deep voice 3: 2000-speaker neural text-to-speech. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio. Char2wav: End-to-end speech synthesis. In International Conference on Learning Representations Workshop, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani. Voiceloop: Voice fitting and synthesis via a phonological loop. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • M. Morise, F. Yokomori, and K. Ozawa. World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE transactions on Information and Systems, 99(7):1877–1884, 2016.
    Google ScholarLocate open access versionFindings
  • E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf. Fitting new speakers based on a short untranscribed sample. arXiv preprint arXiv:1802.06984, 2018.
    Findings
  • Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv preprint arXiv:1806.04558, 2018.
    Findings
  • S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou. Neural voice cloning with a few samples. arXiv preprint arXiv:1802.06006, 2018.
    Findings
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • C. Veaux, J. Yamagishi, K. MacDonald, et al. CSTR VCTK corpus: English multi-speaker corpus for CSTR Voice Cloning Toolkit, 2017.
    Google ScholarFindings
  • Our encoding network is illustrated as the summation of two sub-network outputs in Figure 7. The first subnetwork is a pre-trained speaker verification model (TI-SV) (Wan et al., 2018), comprising 3 LSTM layers and a single linear layer. This model maps a waveform sequence of arbitrary length to a fixed 256-dimensional d-vector with a sliding window, and is trained from approximately 36M utterances from 18K speakers extracted from anonymized voice search logs. On top of this we add a shallow MLP to project the output d-vector to the speaker embedding space. The second sub-network comprises 16 1-D convolutional layers. This network reduces the temporal resolution to 256 ms per frame (for 16 kHz audio), then averages across time and projects into the speaker embedding space. The purpose of this network is to extract residual speaker information present in the demonstration waveforms but not captured by the pre-trained TI-SV model.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科