AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose a non-autoregressive TTS model named FastSpeech 2 to better solve the one-to-many mapping problem in TTS and surpass autoregressive models in voice quality.

FastSpeech 2: Fast and High Quality End to End Text to Speech

ICLR 2021, (2021)

Cited by: 15|Views283
Full Text
Bibtex
Weibo

Abstract

Non-autoregressive text to speech (TTS) models such as FastSpeech (<a class="ref-link" id="cRen_et+al_2019_a" href="#rRen_et+al_2019_a">Ren et al., 2019</a>) can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for...More

Code:

Data:

0
Introduction
  • Neural network based text to speech (TTS) has made rapid progress and attracted a lot of attention in the machine learning and speech community in recent years (Wang et al, 2017; Shen et al, 2018; Ming et al, 2016; Arik et al, 2017; Ping et al, 2018; Ren et al, 2019; Li et al, 2019).
  • Previous neural TTS models (Wang et al, 2017; Shen et al, 2018; Ping et al, 2018; Li et al, 2019) first generate mel-spectrograms autoregressively from text and synthesize speech from the generated mel-spectrograms using a separately trained vocoder (Van Den Oord et al, 2016; Oord et al, 2017; Prenger et al, 2019; Kim et al, 2018; Yamamoto et al, 2020; Kumar et al, 2019)
  • They usually suffer from slow inference speed and robustness issues (Ren et al, 2019).
  • While these designs in FastSpeech ease the learning of the one-to-many mapping problem in TTS, they bring several disadvantages: 1) The two-stage teacher-student training pipeline makes the training process complicated. 2) The target mel-spectrograms generated from the teacher model have some information loss compared with the ground-truth ones, since the quality of the audio synthesized from the generated mel-spectrograms is usually worse than that from the ground-truth ones. 3) The duration extracted from the attention map of teacher model is not accurate enough
Highlights
  • Neural network based text to speech (TTS) has made rapid progress and attracted a lot of attention in the machine learning and speech community in recent years (Wang et al, 2017; Shen et al, 2018; Ming et al, 2016; Arik et al, 2017; Ping et al, 2018; Ren et al, 2019; Li et al, 2019)
  • 3) The duration extracted from the attention map of teacher model is not accurate enough
  • To reduce the information gap between the input and target output and alleviate the one-to-many mapping problem for non-autoregressive TTS model training, we introduce some variation information of speech including pitch, energy and more accurate duration into FastSpeech: in training, we extract duration, pitch and energy from the target speech waveform and directly take them as conditional inputs; in inference, we use values predicted by the predictors that are jointly trained with the FastSpeech 2 model
  • We replace the duration used in FastSpeech with that extracted by Montreal forced alignment (MFA), and conduct the CMOS (Loizou, 2011) test to compare the voice quality between the two FastSpeech models trained with different durations8
  • We proposed FastSpeech 2, a fast and high-quality end-to-end TTS system, to address the issues in FastSpeech and ease the one-to-many mapping problem: 1) we directly train the model with ground-truth mel-spectrograms to simplify the training pipeline and avoid information loss compared with FastSpeech; and 2) we improve the duration accuracy and introduce more variance information including pitch and energy to ease the one-to-many mapping problem, and improve pitch prediction by introducing continuous wavelet transform
  • Our experimental results show that FastSpeech 2 and 2s outperform FastSpeech, and FastSpeech 2 can even surpass autoregressive models in terms of voice quality, with much simpler training pipeline while inheriting the advantages of fast, robust and controllable speech synthesis of FastSpeech
Methods
  • To tackle the challenges above, the authors make several designs in the waveform decoder: 1) Considering that the phase information is difficult to predict using a variance predictor (Engel et al, 2020), the authors introduce adversarial training in the waveform decoder to force it to implicitly recover the phase information by itself (Yamamoto et al, 2020). 2) The authors leverage the mel-spectrogram decoder of FastSpeech 2, which is trained on the full text sequence to help on the text feature extraction.
  • The authors compute the mean absolute error (MAE) between the frame-wise energy extracted from the generated waveform and the ground-truth speech.
  • The authors replace the duration used in FastSpeech with that extracted by MFA, and conduct the CMOS (Loizou, 2011) test to compare the voice quality between the two FastSpeech models trained with different durations.
  • The results are listed in Table 5b and it can be seen that more accurate duration information improves the voice quality of FastSpeech, which verifies the effectiveness of the improved duration from MFA
Results
  • Datasets The authors evaluate FastSpeech 2 and 2s on LJSpeech dataset (Ito, 2017).
  • The authors randomly choose 100 samples in test set.
  • The authors transform the raw waveform into melspectrograms following Shen et al (2018) and set frame size and hop size to 1024 and 256 with respect to the sample rate 22050.
  • FastSpeech (Ren et al, 2019) (Mel + PWG).
  • FastSpeech 2 (Mel + PWG) FastSpeech 2s MOS 3.68 ± 0.09.
Conclusion
  • The authors' experimental results show that FastSpeech 2 and 2s outperform FastSpeech, and FastSpeech 2 can even surpass autoregressive models in terms of voice quality, with much simpler training pipeline while inheriting the advantages of fast, robust and controllable speech synthesis of FastSpeech.
  • The authors believe there will be more simpler solutions to achieve this goal in the future and the authors will certainly work on fully end-to-end TTS without external alignment models and tools.
  • The authors will consider more variance information to further improve the voice quality and speed up the inference with more light-weight model
Tables
  • Table1: Audio quality comparison
  • Table2: The comparison of training time and inference latency in waveform synthesis. The training time of FastSpeech includes teacher and student training. RTF denotes the real-time factor, that is the time (in seconds) required for the system to synthesize one second waveform. The training and inference latency tests are conducted on a server with 36 Intel Xeon CPUs, 256GB memory, 1 NVIDIA V100 GPU and batch size of 48 for training and 1 for inference. Besides, we do not include the time of GPU memory garbage collection and transferring input and output data between the CPU and the GPU. The speedup in waveform synthesis for FastSpeech is larger than that reported in <a class="ref-link" id="cRen_et+al_2019_a" href="#rRen_et+al_2019_a">Ren et al (2019</a>) since we use Parallel WaveGAN as the vocoder which is much faster than WaveGlow
  • Table3: Standard deviation (σ), skewness (γ), kurtosis (K) and average DTW distances (DTW) of pitch in ground-truth and synthesized audio
  • Table4: The mean absolute error (MAE) of the energy in synthesized speech audio
  • Table5: The comparison of the duration from teacher model and MFA. ∆ means the average of absolute boundary differences
  • Table6: CMOS comparison in the ablation studies
  • Table7: Hyperparameters of Transformer TTS, FastSpeech and FastSpeech 2/2s
Download tables as Excel
Funding
  • In this work, we proposed FastSpeech 2, a fast and high-quality end-to-end TTS system, to address the issues in FastSpeech and ease the one-to-many mapping problem: 1) we directly train the model with ground-truth mel-spectrograms to simplify the training pipeline and also avoid information loss compared with FastSpeech; and 2) we improve the duration accuracy and introduce more variance information including pitch and energy to ease the one-to-many mapping problem, and improve pitch prediction by introducing continuous wavelet transform
Study subjects and analysis
samples: 12228
LJSpeech contains 13,100 English audio clips (about 24 hours) and corresponding text transcripts. We split the dataset into three sets: 12,228 samples for training, 349 samples (with document title LJ003) for validation and 523 samples (with document title LJ001 and LJ002) for testing. For subjective evaluation, we randomly choose 100 samples in test set

samples: 100
We split the dataset into three sets: 12,228 samples for training, 349 samples (with document title LJ003) for validation and 523 samples (with document title LJ001 and LJ002) for testing. For subjective evaluation, we randomly choose 100 samples in test set. To alleviate the mispronunciation problem, we convert the text sequence into the phoneme sequence (Arik et al, 2017; Wang et al, 2017; Shen et al, 2018) with an open-source grapheme-to-phoneme tool5

native English speakers: 20
Audio Quality To evaluate the perceptual quality, we perform mean opinion score (MOS) (Chu & Peng, 2006) evaluation on the test set. Twenty native English speakers are asked to make quality judgments about the synthesized speech samples. The text content keeps consistent among different systems so that all testers only examine the audio quality without other interference factors

Reference
  • Bistra Andreeva, Grazyna Demenko, Bernd Mobius, Frank Zimmerer, Jeanin Jugler, and Magdalena Oleskowicz-Popiel. Differences of pitch profiles in germanic and slavic languages. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
    Google ScholarLocate open access versionFindings
  • Sercan O Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep voice: Real-time neural text-to-speech. arXiv preprint arXiv:1702.07825, 2017.
    Findings
  • Min Chu and Hu Peng. Objective measure for estimating mean opinion score of synthesized speech, April 4 2006. US Patent 7,024,362.
    Google ScholarFindings
  • Jeff Donahue, Sander Dieleman, Mikołaj Binkowski, Erich Elsen, and Karen Simonyan. End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575, 2020.
    Findings
  • Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, and Adam Roberts. Ddsp: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643, 2020.
    Findings
  • Yuchen Fan, Yao Qian, Feng-Long Xie, and Frank K Soong. Tts synthesis with bidirectional lstm based recurrent neural networks. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
    Google ScholarLocate open access versionFindings
  • Michael Gadermayr, Maximilian Tschuchnig, Dorit Merhof, Nils Kramer, Daniel Truhn, and Burkhard Gess. An asymetric cycle-consistency loss for dealing with many-to-one mappings in image translation: A study on thigh mr scans. arXiv preprint arXiv:2004.11001, 2020.
    Findings
  • Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep voice 2: Multi-speaker neural text-to-speech. In Advances in neural information processing systems, pp. 2962–2970, 2017.
    Google ScholarLocate open access versionFindings
  • Alexander Grossmann and Jean Morlet. Decomposition of hardy functions into square integrable wavelets of constant shape. SIAM journal on mathematical analysis, 15(4):723–736, 1984.
    Google ScholarLocate open access versionFindings
  • Keikichi Hirose and Jianhua Tao. Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis. Springer, 2015.
    Google ScholarFindings
  • Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
    Findings
  • Chrisina Jayne, Andreas Lanitis, and Chris Christodoulou. One-to-many neural network mapping techniques for face image synthesis. Expert Systems with Applications, 39(10):9778–9787, 2012.
    Google ScholarLocate open access versionFindings
  • Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. arXiv preprint arXiv:2005.11129, 2020.
    Findings
  • Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems, pp. 14881–14892, 2019.
    Google ScholarLocate open access versionFindings
  • Adrian Łancucki. Fastpitch: Parallel text-to-speech with pitch prediction. arXiv preprint arXiv:2006.06873, 2020.
    Findings
  • Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 6706–6713, 2019.
    Google ScholarLocate open access versionFindings
  • Dan Lim, Won Jang, Hyeyeong Park, Bongwan Kim, Jesam Yoon, et al. Jdi-t: Jointly trained duration informed transformer for text-to-speech without explicit alignment. arXiv preprint arXiv:2005.07799, 2020.
    Findings
  • Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, pp. 498– 502, 2017.
    Google ScholarLocate open access versionFindings
  • Chenfeng Miao, Shuang Liang, Minchuan Chen, Jun Ma, Shaojun Wang, and Jing Xiao. Flowtts: A non-autoregressive network for text to speech based on flow. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • Huaiping Ming, Dongyan Huang, Lei Xie, Jie Wu, Minghui Dong, and Haizhou Li. Deep bidirectional lstm modeling of timbre and prosody for emotional voice conversion. 2016.
    Google ScholarFindings
  • Meinard Muller. Dynamic time warping. Information retrieval for music and motion, pp. 69–84, 2007.
    Google ScholarFindings
  • Oliver Niebuhr and Radek Skarnitzl. Measuring a speaker’s acoustic correlates of pitch–but which? a contrastive analysis based on perceived speaker charisma. In Proceedings of 19th International Congress of Phonetic Sciences, 2019.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
    Findings
  • Kainan Peng, Wei Ping, Zhao Song, and Kexin Zhao. Parallel neural text-to-speech. arXiv preprint arXiv:1905.08459, 2019.
    Findings
  • Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: 2000-speaker neural text-to-speech. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-tospeech. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pp. 3165–3174, 2019.
    Google ScholarLocate open access versionFindings
  • Harold Ryan. Ricker, ormsby; klander, bntterwo-a choice of wavelets, 1994.
    Google ScholarFindings
  • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Antti Santeri Suni, Daniel Aalto, Tuomo Raitio, Paavo Alku, Martti Vainio, et al. Wavelets for intonation modeling in hmm speech synthesis. In 8th ISCA Workshop on Speech Synthesis, Proceedings, Barcelona, August 31-September 2, 2013. ISCA, 2013.
    Google ScholarLocate open access versionFindings
  • Franz B Tuteur. Wavelet transformations in signal detection. IFAC Proceedings Volumes, 21(9): 1061–1065, 1988.
    Google ScholarLocate open access versionFindings
  • Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016.
    Google ScholarLocate open access versionFindings
  • Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798, 2016.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
    Findings
  • Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • Heiga Ze, Andrew Senior, and Mike Schuster. Statistical parametric speech synthesis using deep neural networks. In 2013 ieee international conference on acoustics, speech and signal processing, pp. 7962–7966. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia, and Jing Xiao. Aligntts: Efficient feed-forward text-to-speech system without explicit alignment. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6714–6718. IEEE, 2020.
    Google ScholarLocate open access versionFindings
  • Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科