High Fidelity Speech Synthesis with Adversarial Networks
ICLR, 2020.
EI
Weibo:
Abstract:
Generative adversarial networks have seen rapid development in recent years and have led to remarkable improvements in generative modelling of images. However, their application in the audio domain has received limited attention,
and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio...More
Introduction
- The Text-to-Speech (TTS) task consists in the conversion of text into speech audio. In recent years, the TTS field has seen remarkable progress, sparked by the development of neural autoregressive models for raw audio waveforms such as WaveNet (van den Oord et al, 2016), SampleRNN (Mehri et al, 2017) and WaveRNN (Kalchbrenner et al, 2018).
- A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel, e.g. using flow-based models (van den Oord et al, 2018; Ping et al, 2019; Prenger et al, 2019; Kim et al, 2019)
- Such highly parallelisable models are more suitable to run efficiently on modern hardware.
- GANs currently constitute one of the dominant paradigms for generative modelling of images, and they are able to produce high-fidelity samples that are almost indistinguishable from real data
- Their application to audio generation tasks has seen relatively limited success so far.
Highlights
- The Text-to-Speech (TTS) task consists in the conversion of text into speech audio
- We propose a family of quantitative metrics for speech generation based on Frechet Inception Distance (FID, Heusel et al, 2017) and Kernel Inception Distance (KID, Binkowski et al, 2018), where we replace the Inception image recognition network with the DeepSpeech audio recognition network
- We provide subjective human evaluation of our model using Mean Opinion Scores (MOS), as well as quantitative metrics
- Our architectural exploration lead to the development of a model with an ensemble of unconditional and conditional Random Window Discriminators operating at different window sizes, which respectively assess the realism of the generated speech and its correspondence with the input text
- We have proposed a family of quantitative metrics for generative models of speech: Frechet DeepSpeech Distance and Kernel DeepSpeech Distance, and demonstrated experimentally that these metrics rank models in line with Mean Opinion Scores obtained through human evaluation
- The metrics are publicly available for machine learning community, as is the DeepSpeech recognition model they are based on
Methods
- The authors discuss the experiments, comparing GAN-TTS with WaveNet and carrying out ablations that validate the architectural choices.
As mentioned in Section 3, the main architectural choices made in the model include the use of multiple RWDs, conditional and unconditional, with a number of different downsampling factors. - The authors discuss the experiments, comparing GAN-TTS with WaveNet and carrying out ablations that validate the architectural choices.
- As mentioned in Section 3, the main architectural choices made in the model include the use of multiple RWDs, conditional and unconditional, with a number of different downsampling factors.
- Single conditional RWD: cRWD1, 3.
- Multiple conditional RWDs: cRWD{1,2,4,8,15} = k∈{1,2,4,8,15} cRWDk, 4.
- Single conditional and single unconditional RWD: cRWD1 + uRWD1, 5.
- 10 RWDs without downsampling but with different window sizes: RWD1,240×{1,2,4,8,15} = k∈{1,2,4,8,15}.
Results
- The authors provide subjective human evaluation of the model using Mean Opinion Scores (MOS), as well as quantitative metrics. 4.1 MOS
The authors evaluate the model on a set of 1000 sentences, using human evaluators. - Each evaluator was asked to mark the subjective naturalness of a sentence on a 1-5 Likert scale, comparing to the scores reported by van den Oord et al (2018) for WaveNet and Parallel WaveNet. the model was trained to generate 2 second audio clips with the starting point not necessarily aligned with the beginning of a sentence, the authors are able to generate samples of arbitrary length.
- Human evaluators scored full sentences with a length of up to 15 seconds. ➞ ➞
Conclusion
- Unlike state-ofthe-art text-to-speech models, GAN-TTS is adversarially trained and the resulting generator is a feed-forward convolutional network.
- This allows for very efficient audio generation, which is important in practical applications.
- The authors' architectural exploration lead to the development of a model with an ensemble of unconditional and conditional Random Window Discriminators operating at different window sizes, which respectively assess the realism of the generated speech and its correspondence with the input text.
Summary
Introduction:
The Text-to-Speech (TTS) task consists in the conversion of text into speech audio. In recent years, the TTS field has seen remarkable progress, sparked by the development of neural autoregressive models for raw audio waveforms such as WaveNet (van den Oord et al, 2016), SampleRNN (Mehri et al, 2017) and WaveRNN (Kalchbrenner et al, 2018).- A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel, e.g. using flow-based models (van den Oord et al, 2018; Ping et al, 2019; Prenger et al, 2019; Kim et al, 2019)
- Such highly parallelisable models are more suitable to run efficiently on modern hardware.
- GANs currently constitute one of the dominant paradigms for generative modelling of images, and they are able to produce high-fidelity samples that are almost indistinguishable from real data
- Their application to audio generation tasks has seen relatively limited success so far.
Methods:
The authors discuss the experiments, comparing GAN-TTS with WaveNet and carrying out ablations that validate the architectural choices.
As mentioned in Section 3, the main architectural choices made in the model include the use of multiple RWDs, conditional and unconditional, with a number of different downsampling factors.- The authors discuss the experiments, comparing GAN-TTS with WaveNet and carrying out ablations that validate the architectural choices.
- As mentioned in Section 3, the main architectural choices made in the model include the use of multiple RWDs, conditional and unconditional, with a number of different downsampling factors.
- Single conditional RWD: cRWD1, 3.
- Multiple conditional RWDs: cRWD{1,2,4,8,15} = k∈{1,2,4,8,15} cRWDk, 4.
- Single conditional and single unconditional RWD: cRWD1 + uRWD1, 5.
- 10 RWDs without downsampling but with different window sizes: RWD1,240×{1,2,4,8,15} = k∈{1,2,4,8,15}.
Results:
The authors provide subjective human evaluation of the model using Mean Opinion Scores (MOS), as well as quantitative metrics. 4.1 MOS
The authors evaluate the model on a set of 1000 sentences, using human evaluators.- Each evaluator was asked to mark the subjective naturalness of a sentence on a 1-5 Likert scale, comparing to the scores reported by van den Oord et al (2018) for WaveNet and Parallel WaveNet. the model was trained to generate 2 second audio clips with the starting point not necessarily aligned with the beginning of a sentence, the authors are able to generate samples of arbitrary length.
- Human evaluators scored full sentences with a length of up to 15 seconds. ➞ ➞
Conclusion:
Unlike state-ofthe-art text-to-speech models, GAN-TTS is adversarially trained and the resulting generator is a feed-forward convolutional network.- This allows for very efficient audio generation, which is important in practical applications.
- The authors' architectural exploration lead to the development of a model with an ensemble of unconditional and conditional Random Window Discriminators operating at different window sizes, which respectively assess the realism of the generated speech and its correspondence with the input text.
Tables
- Table1: Results from prior work, the ablation study and the proposed model. Mean opinion scores for natural speech, WaveNet and Parallel WaveNet are taken from <a class="ref-link" id="cvan_den_Oord_et+al_2018_a" href="#rvan_den_Oord_et+al_2018_a">van den Oord et al (2018</a>) and are not directly comparable due to dataset differences. For natural speech we present estimated FDSD – non-zero due to the bias of the estimator – and theoretical values of KDSD and cKDSD. cFDSD is unavailable; see Appendix B.2
- Table2: Architecture of GAN-TTS’s Generator. t denotes the temporal dimension, while ch denotes the number of channels. The rightmost three columns describe dimensions of the output of the corresponding layer
- Table3: Downsample factors in discriminators for different initial stride values k
Related work
- 2.1 AUDIO GENERATION
Most neural models for audio generation are likelihood-based: they represent an explicit probability distribution and the likelihood of the observed data is maximised under this distribution. Autoregressive models achieve this by factorising the joint distribution into a product of conditional distributions (van den Oord et al, 2016; Mehri et al, 2017; Kalchbrenner et al, 2018; Arik et al, 2017). Another strategy is to use an invertible feed-forward neural network to model the joint density directly (Prenger et al, 2019; Kim et al, 2019). Alternatively, an invertible feed-forward model can be trained by distilling an autoregressive model using probability density distillation (van den Oord et al, 2018; Ping et al, 2019), which enables it to focus on particular modes. Such mode-seeking behaviour is often desirable in conditional generation settings: we want the generated speech signals to sound realistic and correspond to the given text, but we are not interested in modelling every possible variation that occurs in the data. This reduces model capacity requirements, because parts of the data distribution may be ignored. Note that adversarial models exhibit similar behaviour, but without the distillation and invertibility requirements.
Funding
- Introduces GAN-TTS, a Generative Adversarial Network for Text-to-Speech
- Employs both subjective human evaluation , as well as novel quantitative metrics , which finds to be well correlated with MOS
- Shows that GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator
- Explores raw waveform generation with GANs, and demonstrate that adversarially trained feed-forward generators are able to synthesise high-fidelity speech audio
- Introduces GAN-TTS, a Generative Adversarial Network for text-conditional highfidelity speech synthesis
Reference
- Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep Speech 2: End-to-end speech recognition in English and Mandarin. In ICML, 2016.
- Sercan O Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al. Deep Voice: Real-time neural text-to-speech. In ICML, 2017.
- Mikołaj Binkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In ICLR, 2018.
- Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2016.
- Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
- Aidan Clark, Jeff Donahue, and Karen Simonyan. Efficient video generation on complex datasets. arXiv:1907.06571, 2019.
- Emily L Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. In NeurIPS, 2015.
- Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. In ICLR, 2019.
- Jeff Donahue and Karen Simonyan. arXiv:1907.02544, 2019.
- Jeff Donahue, Philipp Krahenbuhl, and Trevor Darrell. Adversarial feature learning. In ICLR, 2017.
- Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In ICLR, 2017a.
- Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In ICLR, 2017b.
- Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of musical notes with WaveNet autoencoders. In ICML, 2017.
- Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. GANSynth: Adversarial neural audio synthesis. In ICLR, 2019.
- Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
- Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. In NeurIPS, 2017.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
- A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. A kernel two-sample test. JMLR, 2012.
- Daniel Griffin and Jae Lim. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.
- Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-toimage translation. In ECCV, 2018.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
- Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In ICML, 2018.
- Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
- Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Frechet audio distance: A metric for evaluating music enhancement algorithms. In Interspeech, 2019.
- Sungwon Kim, Sang-Gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. FloWaveNet: A generative flow for raw audio. In ICML, 2019.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, and Paulius Micikevicius. OpenSeq2Seq: Extensible toolkit for distributed and mixed precision training of sequenceto-sequence models. In NLP-OSS, 2018.
- Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. In NeurIPS, 2019.
- Jonathan Le Roux, Hirokazu Kameoka, Nobutaka Ono, and Shigeki Sagayama. Fast signal reconstruction from magnitude STFT spectrogram based on spectrogram consistency. In DAFx, 2010.
- Chuan Li and Michael Wand. Precomputed real-time texture synthesis with Markovian generative adversarial networks. In ECCV, 2016.
- Jae Hyun Lim and Jong Chul Ye. Geometric GAN. arXiv:1705.02894, 2017.
- Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. In ICLR, 2017.
- Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
- Paarth Neekhara, Chris Donahue, Miller Puckette, Shlomo Dubnov, and Julian McAuley. Expediting TTS synthesis with adversarial vocoding. In Interspeech, 2019.
- Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep Voice 3: 2000-speaker neural text-to-speech. In ICLR, 2018.
- Wei Ping, Kainan Peng, and Jitong Chen. ClariNet: Parallel wave generation in end-to-end text-tospeech. In ICLR, 2019.
- Ryan Prenger, Rafael Valle, and Bryan Catanzaro. WaveGlow: A flow-based generative network for speech synthesis. In ICASSP, 2019.
- Masaki Saito and Shunta Saito. TGANv2: Efficient training of large models for video generation with multiple subsampling layers. arXiv:1811.09245, 2018.
- Y. Saito, S. Takamichi, and H. Saruwatari. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1):84–96, Jan 2018.
- Andrew Saxe, James McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.
- Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In ICASSP, 2018.
- Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-end speech synthesis. In ICLR, 2017.
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception architecture for computer vision. In CVPR, 2016.
- Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, and Hirokazu Kameoka. Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 632–639. IEEE, 2018.
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:1609.03499, 2016.
- Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel WaveNet: Fast high-fidelity speech synthesis. In ICML, 2018.
- Sean Vasquez and Mike Lewis. MelNet: A generative model for audio in the frequency domain. arXiv:1906.01083, 2019.
- Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. In Interspeech, 2017.
- Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Probability density distillation with generative adversarial networks for high-quality parallel waveform generation. arXiv preprint arXiv:1904.04472, 2019.
- Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
- Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
- Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019.
- Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
Full Text
Tags
Comments