AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient adaptation of new speaker in custom voice.

AdaSpeech: Adaptive Text to Speech for Custom Voice

ICLR, (2021)

Cited by: 0|Views62
EI
Full Text
Bibtex
Weibo

Abstract

Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech from her/him. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle divers...More

Code:

Data:

0
Introduction
Highlights
  • Text to speech (TTS) aims to synthesize natural and intelligible voice from text, and attracts a lot of interests in machine learning community (Arik et al, 2017; Wang et al, 2017; Gibiansky et al, 2017; Ping et al, 2018; Shen et al, 2018; Ren et al, 2019)
  • 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation
  • We compare AdaSpeech with several settings: 1) GT, the ground-truth recordings; 2) GT mel + Vocoder, using ground-truth mel-spectrogram to synthesize waveform with MelGAN vocoder; 3) Baseline, a baseline system based on FastSpeech2 which only fine-tunes the speaker embedding during adaptation, and can be regarded as our lower bound; 4) Baseline, another baseline system based on FastSpeech2 which fine-tunes the whole decoder during adaptation, and can be regarded as a strong comparable system since it uses more parameters during adaptation; 5) AdaSpeech, our proposed AdaSpeech system with utterance-/phoneme-level acoustic condition modeling and conditional layer normalization during adaptation7
  • We have developed AdaSpeech, an adaptive TTS system to support the distinctive requirements in custom voice
  • We propose acoustic condition modeling to make the source TTS model more adaptable for custom voice with various acoustic conditions
  • Experiment results demonstrate that AdaSpeech can support custom voice with different acoustic conditions with few memory storage and at the same time with high voice quality
Methods
  • METHOD ANALYSIS

    the authors first conduct ablation studies to verify the effectiveness of each component in AdaSpeech, including utterance-level and phonemelevel acoustic condition modeling, and conditional layer normalization, and conduct more detailed analyses on the proposed AdaSpeech.

    AdaSpeech w/o UL-ACM AdaSpeech w/o PL-ACM AdaSpeech w/o CLN CMOS

    Analyses on Acoustic Condition Modeling The authors plot the training and validation loss curves of the AdaSpeech source model on LibriTTS datasets in Figure 4a.
  • The authors first conduct ablation studies to verify the effectiveness of each component in AdaSpeech, including utterance-level and phonemelevel acoustic condition modeling, and conditional layer normalization, and conduct more detailed analyses on the proposed AdaSpeech.
  • Analyses on Acoustic Condition Modeling The authors plot the training and validation loss curves of the AdaSpeech source model on LibriTTS datasets in Figure 4a.
  • Mel Loss (MAE) Second principal component Mean Opinion Score(MOS).
  • 10 First p5rincipa0l comp5onent 10 1 2 5 #S1a0mple.
Results
  • The authors first evaluate the quality of the adaptation voices of AdaSpeech, and conduct ablation study to verify the effectiveness of each component in AdaSpeech, and the authors show some analyses of the method.

    4.1 THE QUALITY OF ADAPTATION VOICE

    The authors evaluate the quality of adaption voices in terms of naturalness and similarity.
  • The authors first evaluate the quality of the adaptation voices of AdaSpeech, and conduct ablation study to verify the effectiveness of each component in AdaSpeech, and the authors show some analyses of the method.
  • The authors evaluate the quality of adaption voices in terms of naturalness and similarity.
  • The authors conduct human evaluations with MOS for naturalness and SMOS for similarity.
  • For VCTK and LibriTTS, the authors average the MOS and SMOS scores of multiple adapted speakers as the final scores.
Conclusion
  • The authors have developed AdaSpeech, an adaptive TTS system to support the distinctive requirements in custom voice.
  • The authors propose acoustic condition modeling to make the source TTS model more adaptable for custom voice with various acoustic conditions.
  • The authors further design conditional layer normalization to improve the adaptation efficiency: fine-tuning few model parameters to achieve high voice quality.
  • The authors will further improve the modeling of acoustic conditions in the source TTS model and study more diverse acoustic conditions such as noisy speech in custom voice
Tables
  • Table1: The MOS and SMOS scores with 95% confidence intervals when adapting the source AdaSpeech model (trained on LibriTTS) to LJSpeech, VCTK and LibriTTS datasets. The third column shows the number of additional parameters for each custom voice during adaptation (the number in bracket shows the number of parameters in inference following the practice in Section 2.3)
  • Table2: The CMOS of the ablation study on Ablation Study We compare the CMOS (compar- VCTK. UL-ACM and PL-ACM represents ison MOS) of the adaptation voice quality when re- utterance-level and phoneme-level acoustic moving each component in AdaSpeech on VCTK test- condition modeling, and CLN represents conset (each sentence is listened by 20 judgers). Specifi- ditional layer normalization. cally, when removing conditional layer normalization, we only fine-tune the speaker embedding. From Table 2, we can see that removing utterance-level and phoneme-level acoustic modeling, and conditional layer normalization all result in performance drop in voice quality, demonstrating the effectiveness of each component in AdaSpeech
  • Table3: The CMOS on VCTK for the other (similar or even larger amount of) parameters in the decoder9. The CMOS evaluations are shown in Table 3
Download tables as Excel
Study subjects and analysis
speakers: 2456
3 EXPERIMENTAL SETUP. Datasets We train the AdaSpeech source model on LibriTTS (Zen et al, 2019) dataset, which is a multi-speaker corpus (2456 speakers) derived from LibriSpeech (Panayotov et al, 2015) and contains 586 hours speech data. In order to evaluate AdaSpeech in custom voice scenario, we adapt the source model to the voices in other datasets including VCTK (Veaux et al, 2016) (a multi-speaker datasets with 108 speakers and 44 hours speech data) and LJSpeech (Ito, 2017) (a single-speaker high-quality dataset with 24 hours speech data), which have different acoustic conditions from LibriTTS

speakers: 108
Datasets We train the AdaSpeech source model on LibriTTS (Zen et al, 2019) dataset, which is a multi-speaker corpus (2456 speakers) derived from LibriSpeech (Panayotov et al, 2015) and contains 586 hours speech data. In order to evaluate AdaSpeech in custom voice scenario, we adapt the source model to the voices in other datasets including VCTK (Veaux et al, 2016) (a multi-speaker datasets with 108 speakers and 44 hours speech data) and LJSpeech (Ito, 2017) (a single-speaker high-quality dataset with 24 hours speech data), which have different acoustic conditions from LibriTTS. As a comparison, we also adapt the source model to the voices in the same LibriTTS dataset

adaptation datasets: 3
This also confirms the challenges of modeling different acoustic conditions in custom voice scenarios. 2) Compared with only fine-tuning speaker embedding, i.e., Baseline (spk emb), AdaSpeech achieves significant improvements in terms of both MOS and SMOS in the three adaptation datasets, by only leveraging slightly more parameters in conditional layer normalization. We also analyze in next subsection (Table 3) that even if we increase the adaptation parameters of baseline to match or surpass that in AdaSpeech, it still performs much worse than AdaSpeech

Reference
  • Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems, pp. 10019–10029, 2018.
    Google ScholarLocate open access versionFindings
  • Sercan O Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Real-time neural text-to-speech. arXiv preprint arXiv:1702.07825, 2017.
    Findings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, and Tao Qin. Multispeech: Multispeaker text to speech with transformer. arXiv preprint arXiv:2006.04664, 2020.
    Findings
  • Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C Cobo, Andrew Trask, Ben Laurie, et al. Sample efficient adaptive text-to-speech. arXiv preprint arXiv:1809.10460, 2018.
    Findings
  • Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, and Junichi Yamagishi. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6184–6188. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep voice 2: Multi-speaker neural text-to-speech. In Advances in neural information processing systems, pp. 2962–2970, 2017.
    Google ScholarLocate open access versionFindings
  • Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
    Findings
  • Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems, pp. 4480–4490, 2018.
    Google ScholarLocate open access versionFindings
  • Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. arXiv preprint arXiv:2005.11129, 2020.
    Findings
  • Zvi Kons, Slava Shechtman, Alex Sorin, Carmel Rabinovitz, and Ron Hoory. High quality, lightweight and adaptable tts using lpcnet. arXiv preprint arXiv:1905.00590, 2019.
    Findings
  • Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems, pp. 14910–14921, 2019.
    Google ScholarLocate open access versionFindings
  • Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu. Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304, 2017.
    Findings
  • Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
    Google ScholarLocate open access versionFindings
  • Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, pp. 498–502, 2017.
    Google ScholarLocate open access versionFindings
  • Henry B Moss, Vatsal Aggarwal, Nishant Prateek, Javier González, and Roberto Barra-Chicote. Boffin tts: Few-shot speaker adaptation by bayesian optimization. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7639–7643. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Kainan Peng, Wei Ping, Zhao Song, and Kexin Zhao. Non-autoregressive neural text-to-speech. ICML, 2020.
    Google ScholarLocate open access versionFindings
  • Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: 2000-speaker neural text-to-speech. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pp. 3165–3174, 2019.
    Google ScholarLocate open access versionFindings
  • Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text-to-speech. arXiv preprint arXiv:2006.04558, 2020.
    Findings
  • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Guangzhi Sun, Yu Zhang, Ron J Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, and Yonghui Wu. Generating diverse and natural text-to-speech samples using a quantized fine-grained vae and autoregressive prosody prior. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6699–6703. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng Zhao, Tao Qin, and Tie-Yan Liu. Token-level ensemble distillation for grapheme-to-phoneme conversion. In INTERSPEECH, 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016.
    Google ScholarFindings
  • Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
    Findings
  • Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
    Findings
  • Zhen Zeng, Jianzong Wang, Ning Cheng, and Jing Xiao. Prosody learning mechanism for speech synthesis system without text length limit. arXiv preprint arXiv:2008.05656, 2020.
    Findings
  • Zewang Zhang, Qiao Tian, Heng Lu, Ling-Hui Chen, and Shan Liu. Adadurian: Few-shot adaptation for neural text-to-speech with durian. arXiv preprint arXiv:2005.05642, 2020.
    Findings
Author
Mingjian Chen
Mingjian Chen
Bohan Li
Bohan Li
Yanqing Liu
Yanqing Liu
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科