End-to-End Adversarial Text-to-Speech

ICLR 2021, 2021.

Cited by: 5|Bibtex|Views454
Other Links: arxiv.org
Weibo:
Efficient, adversarially trained feed-forward text-to-speech model producing high-quality speech, learnt end-to-end in a single stage.

Abstract:

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on ch...More

Code:

Data:

0
Introduction
  • A text-to-speech (TTS) system processes natural language text inputs to produce synthetic human-like speech outputs.
  • Typical TTS pipelines consist of a number of stages trained or designed independently – e.g. text normalisation, aligned linguistic featurisation, mel-spectrogram synthesis, and raw audio waveform synthesis (Taylor, 2009)
  • These pipelines have proven capable of realistic and high-fidelity speech synthesis and enjoy wide real-world use today, these modular approaches come with a number of drawbacks.
Highlights
  • A text-to-speech (TTS) system processes natural language text inputs to produce synthetic human-like speech outputs
  • The architecture and training setup of each ablation is identical to our base EATS model except in terms of the differences described by the columns in Table 1
  • Our main result achieved by the base multi-speaker model is a mean opinion score (MOS) of 4.083
  • We have presented an adversarial approach to text-to-speech synthesis which can learn from a relatively weak supervisory signal – normalised text or phonemes paired with corresponding speech audio
  • While there remains a gap between the fidelity of the speech produced by our method and the stateof-the-art systems, we believe that the end-to-end problem setup is a promising avenue for future advancements and research in text-to-speech
  • End-to-end learning enables the system as a whole to benefit from large amounts of training data, freeing models to optimise their intermediate representations for the task at hand, rather than constraining them to work with the typical bottlenecks imposed by most TTS pipelines today
Methods
  • The authors' goal is to learn a neural network which maps an input sequence of characters or phonemes to raw audio at 24 kHz.
  • The entire generator architecture is differentiable, and is trained end to end
  • It is a feed-forward convolutional network, which makes it well-suited for applications where fast batched inference is important: the EATS implementation generates speech at a speed of 200× realtime on a single NVIDIA V100 GPU.
  • It is illustrated in Figure 1
Results
  • It is difficult to compare directly with prior results from the literature due to dataset differences, the authors include MOS results from prior works (Binkowski et al, 2020; van den Oord et al, 2016; 2018), with MOS in the 4.2 to 4.4+ range
  • Compared to these prior models, which rely on aligned linguistic features, EATS uses substantially less supervision
Conclusion
  • The authors have presented an adversarial approach to text-to-speech synthesis which can learn from a relatively weak supervisory signal – normalised text or phonemes paired with corresponding speech audio.
  • The speech generated by the proposed model matches the given conditioning texts and generalises to unobserved texts, with naturalness judged by human raters approaching state-of-the-art systems with multi-stage training pipelines or additional supervision.
  • The proposed system described in Section 2 is efficient in both training and inference
  • It does not rely on autoregressive sampling or teacher forcing, avoiding issues like exposure bias (Bengio et al, 2015; Ranzato et al., 2016) and reduced parallelism at inference time, or the complexities introduced by distillation to a more efficient feed-forward model after the fact (van den Oord et al, 2018; Ping et al, 2019a).
  • The authors believe that a fully data-driven approach could prevail even in this setup given sufficient training data and model capacity
Summary
  • Introduction:

    A text-to-speech (TTS) system processes natural language text inputs to produce synthetic human-like speech outputs.
  • Typical TTS pipelines consist of a number of stages trained or designed independently – e.g. text normalisation, aligned linguistic featurisation, mel-spectrogram synthesis, and raw audio waveform synthesis (Taylor, 2009)
  • These pipelines have proven capable of realistic and high-fidelity speech synthesis and enjoy wide real-world use today, these modular approaches come with a number of drawbacks.
  • Objectives:

    The authors aim to simplify the TTS pipeline and take on the challenging task of synthesising speech from text or phonemes in an end-to-end manner.
  • The authors' goal is to learn a neural network which maps an input sequence of characters or phonemes to raw audio at 24 kHz
  • Methods:

    The authors' goal is to learn a neural network which maps an input sequence of characters or phonemes to raw audio at 24 kHz.
  • The entire generator architecture is differentiable, and is trained end to end
  • It is a feed-forward convolutional network, which makes it well-suited for applications where fast batched inference is important: the EATS implementation generates speech at a speed of 200× realtime on a single NVIDIA V100 GPU.
  • It is illustrated in Figure 1
  • Results:

    It is difficult to compare directly with prior results from the literature due to dataset differences, the authors include MOS results from prior works (Binkowski et al, 2020; van den Oord et al, 2016; 2018), with MOS in the 4.2 to 4.4+ range
  • Compared to these prior models, which rely on aligned linguistic features, EATS uses substantially less supervision
  • Conclusion:

    The authors have presented an adversarial approach to text-to-speech synthesis which can learn from a relatively weak supervisory signal – normalised text or phonemes paired with corresponding speech audio.
  • The speech generated by the proposed model matches the given conditioning texts and generalises to unobserved texts, with naturalness judged by human raters approaching state-of-the-art systems with multi-stage training pipelines or additional supervision.
  • The proposed system described in Section 2 is efficient in both training and inference
  • It does not rely on autoregressive sampling or teacher forcing, avoiding issues like exposure bias (Bengio et al, 2015; Ranzato et al., 2016) and reduced parallelism at inference time, or the complexities introduced by distillation to a more efficient feed-forward model after the fact (van den Oord et al, 2018; Ping et al, 2019a).
  • The authors believe that a fully data-driven approach could prevail even in this setup given sufficient training data and model capacity
Tables
  • Table1: Mean Opinion Scores (MOS) for our final EATS model and the ablations described in
  • Table2: Mean Opinion Scores (MOS) for the top four speakers with the most data in our training set. All evaluations are done using our single multi-speaker EATS model
  • Table3: EATS batched inference benchmarks, timing inference (speech generation) on a Google Cloud TPU v3 (1 chip with 2 cores), a single NVIDIA V100 GPU, or an Intel Xeon E5-1650 v4 CPU at 3.60 GHz (6 physical cores). We use a batch size of 2, 8, or 16 utterances (Utt.), each 30 seconds long (input length of 600 phoneme tokens, padded if necessary). One “run” consists of 10 consecutive forward passes at the given batch size. We perform 101 such runs and report the median run time (Med. Run Time (s)) and the resulting Realtime Factor, the ratio of the total duration of the generated speech (Length / Run (s)) to the run time. (Note: GPU benchmarking is done using single precision (IEEE FP32) floating point; switching to half precision (IEEE FP16) could yield further speedups.)
  • Table4: The symbols in this table are replaced or removed when they appear in phonemizer’s output
  • Table5: Mean Opinion Scores (MOS) and Fréchet DeepSpeech Distances (FDSD) for our final EATS model and the ablations described in Section 4, sorted by MOS. FDSD scores presented here were computed on held-out validation multi-speaker set and therefore could not be obtained for the Single Speaker ablation. Due to dataset differences, these are also not comparable with the FDSD values reported for GAN-TTS by <a class="ref-link" id="cBinkowski_et+al_2020_a" href="#rBinkowski_et+al_2020_a">Binkowski et al (2020</a>)
  • Table6: A comparison of TTS methods. The model stages described in each paper are shown by linking together the inputs, outputs and intermediate representations that are used: characters (Ch), phonemes (Ph), mel-spectrograms (MelS), magnitude spectrograms (MagS), cepstral features (Cep), linguistic features (Ling, such as phoneme durations and fundamental frequencies, or WORLD (<a class="ref-link" id="cMorise_et+al_2016_a" href="#rMorise_et+al_2016_a">Morise et al, 2016</a>) features for Char2wav (<a class="ref-link" id="cSotelo_et+al_2017_a" href="#rSotelo_et+al_2017_a">Sotelo et al, 2017</a>) and VoiceLoop (<a class="ref-link" id="cTaigman_et+al_2017_a" href="#rTaigman_et+al_2017_a">Taigman et al, 2017</a>)), and audio (Au). Arrows with various superscripts describe model components: autoregressive (AR), feed-forward (FF), or feed-forward requiring distillation (FF*). Arrows without a superscript indicate components that do not require learning. 1 Stage means the model is trained in a single stage to map from unaligned text/phonemes to audio (without, e.g., distillation or separate vocoder training). EATS is the only feed-forward model that fulfills this requirement
Download tables as Excel
Related work
  • Speech generation saw significant quality improvements once treating it as a generative modelling problem became the norm (Zen et al, 2009; van den Oord et al, 2016). Likelihood-based approaches dominate, but generative adversarial networks (GANs) (Goodfellow et al, 2014) have been making significant inroads recently. A common thread through most of the literature is a separation of the speech generation process into multiple stages: coarse-grained temporally aligned intermediate representations, such as mel-spectrograms, are used to divide the task into more manageable subproblems. Many works focus exclusively on either spectrogram generation or vocoding (generating a waveform from a spectrogram). Our work is different in this respect, and we will point out which stages of the generation process are addressed by each model. In Appendix J, Table 6 we compare these methods in terms of the inputs and outputs to each stage of their pipelines.
Study subjects and analysis
speakers: 4
At training time, we sample 2 second windows from the individual clips, post-padding those shorter than 2 seconds with silence. For evaluation, we focus on the single most prolific speaker in our dataset, with all our main MOS results reported with the model conditioned on that speaker ID, but also report MOS results for each of the top four speakers using our main multi-speaker model. 4.2 RESULTS

most prolific speakers: 4
We demonstrate that the aligner learns to use the latent vector z to vary the predicted token lengths in Appendix H. In Table 2 we present additional MOS results from our main multi-speaker EATS model for the four most prolific speakers in our training data3. MOS generally improves with more training data, although the correlation is imperfect (e.g., Speaker #3 achieves the highest MOS with only the third most training data)

speakers: 69
End-to-end learning enables the system as a whole to benefit from large amounts of training data, freeing models to optimise their intermediate representations for the task at hand, rather than constraining them to work with the typical bottlenecks (e.g., mel-spectrograms, aligned linguistic features) imposed by most TTS pipelines today. We see some evidence of this occurring in the comparison between our main result, trained using data from 69 speakers, against the Single Speaker ablation: the former is trained using roughly four times the data and synthesises more natural speech in the single voice on which the latter is trained. Notably, our current approach does not attempt to address the text normalisation and phonemisation problems, relying on a separate, fixed system for these aspects, while a fully end-to-end TTS system could operate on unnormalised raw text

Reference
  • Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems. arXiv:1603.04467, 2015.
    Findings
  • Sercan Ö Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, and Mohammad Shoeybi. Deep Voice: Real-time neural text-to-speech. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Sercan Ö Arık, Heewoo Jun, and Gregory Diamos. Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Processing Letters, 2018.
    Google ScholarLocate open access versionFindings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv:1607.06450, 2016.
    Findings
  • Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, and Tom Bagby. Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP, 2020.
    Google ScholarLocate open access versionFindings
  • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Mathieu Bernard. Phonemizer. https://github.com/bootphon/phonemizer, 2020.
    Findings
  • Mikołaj Binkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, and Karen Simonyan. High fidelity speech synthesis with adversarial networks. In ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Jonathan Chevelu, Damien Lolive, Sébastien Le Maguer, and David Guennec. How to compare TTS systems: A new subjective evaluation methodology focused on differences. In International Speech Communication Association, 2015.
    Google ScholarLocate open access versionFindings
  • Chung-Cheng Chiu and Colin Raffel. Monotonic chunkwise attention. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Marco Cuturi and Mathieu Blondel. Soft-DTW: a differentiable loss function for time-series. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Harm De Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C Courville. Modulating early visual processing by language. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Alexandre Défossez, Neil Zeghidour, Nicolas Usunier, Léon Bottou, and Francis Bach. SING: Symbol-to-instrument neural generator. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu, and Adam Roberts. DDSP: Differentiable digital signal processing. In ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Sequence transduction with recurrent neural networks. arXiv:1211.3711, 2012.
    Findings
  • Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
    Findings
  • Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, 2006.
    Google ScholarLocate open access versionFindings
  • Daniel Griffin and Jae Lim. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984.
    Google ScholarLocate open access versionFindings
  • Haohan Guo, Frank K Soong, Lei He, and Lei Xie. A new GAN-based end-to-end TTS training algorithm. In Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • Mutian He, Yan Deng, and Lei He. Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS. In Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
    Google ScholarFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Fumitada Itakura. Minimum prediction residual principle applied to speech recognition. IEEE Transactions on acoustics, speech, and signal processing, 23(1):67–72, 1975.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. arXiv:2005.11129, 2020.
    Findings
  • Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. FloWaveNet: A generative flow for raw audio. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. MelGAN: Generative adversarial networks for conditional waveform synthesis. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. A large-scale study on regularization and normalization in GANs. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Jae Hyun Lim and Jong Chul Ye. Geometric GAN. arXiv:1705.02894, 2017.
    Findings
  • Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Chenfeng Miao, Shuang Liang, Minchuan Chen, Jun Ma, Shaojun Wang, and Jing Xiao. Flow-TTS: A non-autoregressive network for text to speech based on flow. In ICASSP, 2020.
    Google ScholarLocate open access versionFindings
  • Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Paarth Neekhara, Chris Donahue, Miller Puckette, Shlomo Dubnov, and Julian McAuley. Expediting TTS synthesis with adversarial vocoding. In Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. arXiv:1912.02762, 2019.
    Findings
  • Kainan Peng, Wei Ping, Zhao Song, and Kexin Zhao. Parallel neural text-to-speech. arXiv:1905.08459, 2019.
    Findings
  • Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep Voice 3: 2000-speaker neural text-to-speech. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Wei Ping, Kainan Peng, and Jitong Chen. ClariNet: Parallel wave generation in end-to-end text-tospeech. In ICLR, 2019a.
    Google ScholarLocate open access versionFindings
  • Wei Ping, Kainan Peng, Kexin Zhao, and Zhao Song. Waveflow: A compact flow-based model for raw audio. arXiv:1912.01219, 2019b.
    Findings
  • Ryan Prenger, Rafael Valle, and Bryan Catanzaro. WaveGlow: A flow-based generative network for speech synthesis. In ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Minh-Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. Online and linear-time attention by enforcing monotonic alignments. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech: Fast, robust and controllable text to speech. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Hardik B Sailor and Hemant A Patil. Fusion of magnitude and phase-based features for objective evaluation of TTS voice. In International Symposium on Chinese Spoken Language Processing, 2014.
    Google ScholarLocate open access versionFindings
  • Hiroaki Sakoe. Dynamic-programming approach to continuous speech recognition. In International Congress of Acoustics, 1971.
    Google ScholarLocate open access versionFindings
  • Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978.
    Google ScholarLocate open access versionFindings
  • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerrv-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In ICASSP, 2018.
    Google ScholarLocate open access versionFindings
  • Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-end speech synthesis. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani. VoiceLoop: Voice fitting and synthesis via a phonological loop. arXiv:1705.03122, 2017.
    Findings
  • Paul Taylor. Text-to-speech synthesis. Cambridge University Press, 2009.
    Google ScholarFindings
  • Dustin Tran, Rajesh Ranganath, and David M. Blei. Hierarchical implicit models and likelihood-free variational inference. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Jean-Marc Valin and Jan Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. In ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • Rafael Valle, Kevin Shih, Ryan Prenger, and Bryan Catanzaro. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. arXiv:2005.05957, 2020.
    Findings
  • Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:1609.03499, 2016.
    Findings
  • Aäron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel WaveNet: Fast high-fidelity speech synthesis. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Sean Vasquez and Mike Lewis. MelNet: A generative model for audio in the frequency domain. arXiv:1906.01083, 2019.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Xin Wang, Shinji Takaki, and Junichi Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. In ICASSP, 2019.
    Google ScholarLocate open access versionFindings
  • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In Interspeech, 2017.
    Google ScholarLocate open access versionFindings
  • Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1989.
    Google ScholarLocate open access versionFindings
  • Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Probability density distillation with generative adversarial networks for high-quality parallel waveform generation. In Interspeech, 2019.
    Google ScholarLocate open access versionFindings
  • Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP, 2020.
    Google ScholarLocate open access versionFindings
  • Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, and Lei Xie. Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech. rXiv:2005.05106, 2020.
    Google ScholarFindings
  • Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. Speech Communication, 2009.
    Google ScholarLocate open access versionFindings
  • Hao Zhang, Richard Sproat, Axel H Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gorman, and Brian Roark. Neural models of text normalization for speech applications. Computational Linguistics, 2019.
    Google ScholarLocate open access versionFindings
  • Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai. Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In ICASSP, 2018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments