AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose HiFi-generative adversarial networks, which achieves both efficient and high-fidelity speech synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

NIPS 2020, (2020)

Cited by: 0|Views9
EI
Full Text
Bibtex
Weibo

Abstract

Several recent studies on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this study, we propose HiFi-GAN, which achi...More

Code:

Data:

0
Introduction
  • Voice is one of the most frequent and naturally used communication interfaces for humans.
  • Most neural speech synthesis models use a two-stage pipeline: 1) predicting a low resolution intermediate representation such as mel-spectrograms (Shen et al, 2018, Ping et al, 2017, Li et al, 2019) or linguistic features (Oord et al, 2016) from text, and 2) synthesizing raw waveform audio from the intermediate representation (Oord et al, 2016, 2017, Prenger et al, 2019, Kumar et al, 2019).
  • The authors focus on designing a second stage model that efficiently synthesizes high fidelity waveforms from mel-spectrograms
Highlights
  • Voice is one of the most frequent and naturally used communication interfaces for humans
  • A phoneme duration can be longer than 100 ms, resulting in high correlation between more than 2200 adjacent samples in the raw waveform. This problem has been addressed in the previous study (Donahue et al, 2018) by increasing receptive fields of the generator and discriminator. We focus on another crucial problem that has yet been resolved; as speech audio consists of sinusoidal signals with various periods, the diverse periodic patterns underlying in the audio data need to be identified
  • We introduced HiFi-generative adversarial networks (GANs), which can efficiently synthesize high quality speech audio
  • We took inspiration from the characteristic of speech audio that consists of patterns with various periods and applied it to neural networks, and verified that the existence of the proposed discriminator greatly influences the quality of speech synthesis through the ablation study
  • Our experiments show that the generators of various configurations can be trained with the same discriminators and learning mechanism, which indicates the possibility of flexibly selecting a generator configuration according to the target specifications without the need for a time-consuming hyper-parameter search for the discriminators
  • HiFi-GAN will be released as open source, and we envisage that our work will serve as a basis for future speech synthesis studies
Methods
  • For fair and reproducible comparison with other models, the authors used the LJSpeech dataset (Ito, 2017) in which many speech synthesis models are trained.
  • LJSpeech consists of 13,100 short audio clips of a single speaker with a total length of approximately 24 hours.
  • To evaluate the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers, the authors used the VCTK multi-speaker dataset (Veaux et al, 2017), which consists of approximately 44,200 short audio clips uttered by 109 native English speakers with various accents.
  • The audio format is 16-bit PCM with a sample rate of 44 kHz. The authors reduced the sample rate to 22 kHz. The authors randomly selected nine speakers and excluded all their audio clips from the training set.
  • The authors trained MoL WaveNet, WaveGlow, and MelGAN with the same data settings; all the models were trained until 2.5M steps
Results
  • 4.1 Audio Quality and Synthesis Speed

    To evaluate the performance of the models in terms of both quality and speed, the authors performed the MOS test and the speed measurement.
  • In terms of synthesis speed, V 1 is faster than WaveGlow and MoL WaveNet. V 2 demonstrates similarity to human quality with a MOS score of 4.23 while significantly reducing the memory requirement and inference time, compared to V 1.
  • V 2 demonstrates similarity to human quality with a MOS score of 4.23 while significantly reducing the memory requirement and inference time, compared to V 1
  • It only requires 0.92M parameters.
  • Because V 3 efficiently synthesize speech on CPU, it can be well suited for on-device applications
Conclusion
  • The authors introduced HiFi-GAN, which can efficiently synthesize high quality speech audio.
  • The authors' small footprint model demonstrated comparable sample quality with the best publicly available autoregressive counterpart, while producing samples in an order-of-magnitude faster than real time on CPU.
  • This shows progress towards on-device natural speech synthesis, which requires low latency and memory footprint.
  • HiFi-GAN will be released as open source, and the authors envisage that the work will serve as a basis for future speech synthesis studies
Summary
  • Introduction:

    Voice is one of the most frequent and naturally used communication interfaces for humans.
  • Most neural speech synthesis models use a two-stage pipeline: 1) predicting a low resolution intermediate representation such as mel-spectrograms (Shen et al, 2018, Ping et al, 2017, Li et al, 2019) or linguistic features (Oord et al, 2016) from text, and 2) synthesizing raw waveform audio from the intermediate representation (Oord et al, 2016, 2017, Prenger et al, 2019, Kumar et al, 2019).
  • The authors focus on designing a second stage model that efficiently synthesizes high fidelity waveforms from mel-spectrograms
  • Methods:

    For fair and reproducible comparison with other models, the authors used the LJSpeech dataset (Ito, 2017) in which many speech synthesis models are trained.
  • LJSpeech consists of 13,100 short audio clips of a single speaker with a total length of approximately 24 hours.
  • To evaluate the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers, the authors used the VCTK multi-speaker dataset (Veaux et al, 2017), which consists of approximately 44,200 short audio clips uttered by 109 native English speakers with various accents.
  • The audio format is 16-bit PCM with a sample rate of 44 kHz. The authors reduced the sample rate to 22 kHz. The authors randomly selected nine speakers and excluded all their audio clips from the training set.
  • The authors trained MoL WaveNet, WaveGlow, and MelGAN with the same data settings; all the models were trained until 2.5M steps
  • Results:

    4.1 Audio Quality and Synthesis Speed

    To evaluate the performance of the models in terms of both quality and speed, the authors performed the MOS test and the speed measurement.
  • In terms of synthesis speed, V 1 is faster than WaveGlow and MoL WaveNet. V 2 demonstrates similarity to human quality with a MOS score of 4.23 while significantly reducing the memory requirement and inference time, compared to V 1.
  • V 2 demonstrates similarity to human quality with a MOS score of 4.23 while significantly reducing the memory requirement and inference time, compared to V 1
  • It only requires 0.92M parameters.
  • Because V 3 efficiently synthesize speech on CPU, it can be well suited for on-device applications
  • Conclusion:

    The authors introduced HiFi-GAN, which can efficiently synthesize high quality speech audio.
  • The authors' small footprint model demonstrated comparable sample quality with the best publicly available autoregressive counterpart, while producing samples in an order-of-magnitude faster than real time on CPU.
  • This shows progress towards on-device natural speech synthesis, which requires low latency and memory footprint.
  • HiFi-GAN will be released as open source, and the authors envisage that the work will serve as a basis for future speech synthesis studies
Tables
  • Table1: Comparison of the MOS and the synthesis speed. Speed of n kHz means that the model can generate n × 1000 raw audio samples per second. The numbers in () mean the speed compared to real time
  • Table2: Ablation study results. Comparison of the effect of each component on the synthesis quality
  • Table3: Quality comparison of synthesized utterances for unseen speakers
  • Table4: Quality comparison for end-to-end speech synthesis
  • Table5: Hyper-parameters of three generator V 1, V 2, and V 3
Download tables as Excel
Study subjects and analysis
native English speakers with various accents: 109
HiFi-GAN was compared against the best publicly available models: the popular mixture of logistics (MoL) WaveNet (Oord et al, 2017) implementation (Yamamoto, 2018) and the official implementation of WaveGlow (Valle, 2018b) and MelGAN (Kumar, 2019). We used the provided pretrained weights for all the models.

To evaluate the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers, we used the VCTK multi-speaker dataset (Veaux et al, 2017), which consists of approximately 44,200 short audio clips uttered by 109 native English speakers with various accents
. The total length of the audio clips is approximately 44 hours

speakers: 9
We reduced the sample rate to 22 kHz. We randomly selected nine speakers and excluded all their audio clips from the training set. We then trained MoL WaveNet, WaveGlow, and MelGAN with the same data settings; all the models were trained until 2.5M steps

samples: 24000
Most neural speech synthesis models use a two-stage pipeline: 1) predicting a low resolution intermediate representation such as mel-spectrograms (Shen et al, 2018, Ping et al, 2017, Li et al, 2019) or linguistic features (Oord et al, 2016) from text, and 2) synthesizing raw waveform audio from the intermediate representation (Oord et al, 2016, 2017, Prenger et al, 2019, Kumar et al, 2019). The first stage is to model low-level representations of human speech from text, whereas the second stage model synthesizes raw waveforms with up to 24,000 samples per second and up to 16 bit fidelity. In this study, we focus on designing a second stage model that efficiently synthesizes high fidelity waveforms from mel-spectrograms

adjacent samples: 2200
Identifying long-term dependencies is the key for modeling realistic speech audio. For example, a phoneme duration can be longer than 100 ms, resulting in high correlation between more than 2200 adjacent samples in the raw waveform. This problem has been addressed in the previous study (Donahue et al, 2018) by increasing receptive fields of the generator and discriminator

native English speakers with various accents: 109
We used the provided pretrained weights for all the models. To evaluate the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers, we used the VCTK multi-speaker dataset (Veaux et al, 2017), which consists of approximately 44,200 short audio clips uttered by 109 native English speakers with various accents. The total length of the audio clips is approximately 44 hours

unseen speakers: 9
4.3 Generalization to Unseen Speakers. We used 50 randomly selected utterances of nine unseen speakers in the VCTK dataset that were excluded in the training set for the MOS test. Table 3 shows the experimental results for the melspectrogram inversion of the unseen speakers

Reference
  • Mikołaj Binkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C Cobo, and Karen Simonyan. High fidelity speech synthesis with adversarial networks. arXiv preprint arXiv:1909.11646, 2019.
    Findings
  • Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.
    Findings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
    Google ScholarLocate open access versionFindings
  • Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
    Findings
  • Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
    Google ScholarLocate open access versionFindings
  • Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems 32, pages 14910–14921, 2019.
    Google ScholarLocate open access versionFindings
  • Rithesh Kumar. descriptinc/melgan-neurips. melgan-neurips, 2019.
    Google ScholarLocate open access versionFindings
  • https://github.com/descriptinc/
    Findings
  • Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
    Findings
  • Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6706–6713, 2019.
    Google ScholarLocate open access versionFindings
  • Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
    Findings
  • Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
    Google ScholarLocate open access versionFindings
  • Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
    Findings
  • Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
    Findings
  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
    Findings
  • Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
    Findings
  • Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654, 2017.
    Findings
  • Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-tospeech. arXiv preprint arXiv:1807.07281, 2018.
    Findings
  • Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems, pages 901–909, 2016.
    Google ScholarLocate open access versionFindings
  • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. arXiv preprint arXiv:1911.09070, 2019.
    Findings
  • Rafael Valle. Nvidia/tacotron2. https://github.com/NVIDIA/tacotron2, 2018a. Rafael Valle. Nvidia/waveglow.https://github.com/NVIDIA/waveglow, 2018b. Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. Cstr vctk corpus: English multispeaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017. Ryuichi Yamamoto.wavenet vocoder.https://github.com/r9y9/wavenet_vocoder/, 2018. Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203. IEEE, 2020. Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph E Gonzalez, and Kurt Keutzer. Squeezewave: Extremely lightweight vocoders for on-device speech synthesis.arXiv preprint arXiv:2001.05685, 2020.
    Findings
Author
Jungil Kong
Jungil Kong
Jaehyeon Kim
Jaehyeon Kim
Jaekyoung Bae
Jaekyoung Bae
Your rating :
0

 

Tags
Comments
小科