AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The proposed model achieved several orders of magnitude speed-up compared to the original WaveNet with no significant difference in quality

Parallel WaveNet: Fast High-Fidelity Speech Synthesis.

international conference on machine learning, (2018)

Cited by: 441|Views438
EI
Full Text
Bibtex
Weibo

Abstract

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to todayu0027s massively p...More

Code:

Data:

0
Introduction
  • Recent successes of deep learning go beyond achieving state-of-the-art results in research benchmarks, and push the frontiers in some of the most challenging real world applications such as speech recognition (Hinton et al, 2012), image recognition (Krizhevsky et al, 2012; Szegedy et al, 2015), and machine translation (Wu et al, 2016).
  • As WaveNet does, represents a extreme form of autoregression, with up to 24,000 samples predicted per second
  • Operating at such a high temporal resolution is not problematic during network training, where the complete sequence of input samples is already available and—thanks to the convolutional structure of the network—can be processed in parallel.
  • Each input sample must be drawn from the output distribution before it can be passed in as input at the time step, making parallel processing impossible
Highlights
  • Recent successes of deep learning go beyond achieving state-of-the-art results in research benchmarks, and push the frontiers in some of the most challenging real world applications such as speech recognition (Hinton et al, 2012), image recognition (Krizhevsky et al, 2012; Szegedy et al, 2015), and machine translation (Wu et al, 2016)
  • We present a new algorithm for distilling WaveNet into a feed-forward neural network which can synthesise high quality speech much more efficiently, and is deployed to millions of users
  • The goal of this paper is to marry the best features of both models: the efficient training of WaveNet and the efficient sampling of Inverse autoregressive flows (IAFs) networks
  • In this paper we have introduced a novel method for highfidelity speech synthesis based on WaveNet using Probability Density Distillation
  • The proposed model achieved several orders of magnitude speed-up compared to the original WaveNet with no significant difference in quality
Methods
  • In all the experiments the authors used text-to-speech models that were conditioned on linguistic features (similar to (van den Oord et al, 2016a)), providing phonetic and duration information to the network.
  • It is important to note that the difference in MOS scores of the WaveNet baseline result 4.41 compared to the previous reported result 4.21 (van den Oord et al, 2016a) is due to the improvement in audio fidelity as explained in Section 2.1: modelling a sample rate of 24kHz instead of 16kHz and bit-depth of 16-bit PCM instead of 8-bit μ-law
Results
  • This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality.
  • The proposed model achieved several orders of magnitude speed-up compared to the original WaveNet with no significant difference in quality
Conclusion
  • In this paper the authors have introduced a novel method for highfidelity speech synthesis based on WaveNet (van den Oord et al, 2016a) using Probability Density Distillation.
  • The proposed model achieved several orders of magnitude speed-up compared to the original WaveNet with no significant difference in quality.
  • The authors believe that the same method presented here can be used in many different domains to achieve similar speed improvements whilst maintaining output accuracy
Summary
  • Introduction:

    Recent successes of deep learning go beyond achieving state-of-the-art results in research benchmarks, and push the frontiers in some of the most challenging real world applications such as speech recognition (Hinton et al, 2012), image recognition (Krizhevsky et al, 2012; Szegedy et al, 2015), and machine translation (Wu et al, 2016).
  • As WaveNet does, represents a extreme form of autoregression, with up to 24,000 samples predicted per second
  • Operating at such a high temporal resolution is not problematic during network training, where the complete sequence of input samples is already available and—thanks to the convolutional structure of the network—can be processed in parallel.
  • Each input sample must be drawn from the output distribution before it can be passed in as input at the time step, making parallel processing impossible
  • Objectives:

    The goal of this paper is to marry the best features of both models: the efficient training of WaveNet and the efficient sampling of IAF networks.
  • Methods:

    In all the experiments the authors used text-to-speech models that were conditioned on linguistic features (similar to (van den Oord et al, 2016a)), providing phonetic and duration information to the network.
  • It is important to note that the difference in MOS scores of the WaveNet baseline result 4.41 compared to the previous reported result 4.21 (van den Oord et al, 2016a) is due to the improvement in audio fidelity as explained in Section 2.1: modelling a sample rate of 24kHz instead of 16kHz and bit-depth of 16-bit PCM instead of 8-bit μ-law
  • Results:

    This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality.
  • The proposed model achieved several orders of magnitude speed-up compared to the original WaveNet with no significant difference in quality
  • Conclusion:

    In this paper the authors have introduced a novel method for highfidelity speech synthesis based on WaveNet (van den Oord et al, 2016a) using Probability Density Distillation.
  • The proposed model achieved several orders of magnitude speed-up compared to the original WaveNet with no significant difference in quality.
  • The authors believe that the same method presented here can be used in many different domains to achieve similar speed improvements whilst maintaining output accuracy
Tables
  • Table1: Comparison of WaveNet distillation with the autoregressive teacher WaveNet, unit-selection (concatenative), and previous results (1) from (<a class="ref-link" id="cvan_den_Oord_et+al_2016_a" href="#rvan_den_Oord_et+al_2016_a">van den Oord et al, 2016a</a>). MOS stands for Mean Opinion Score
  • Table2: Comparison of MOS scores on English and Japanese with multi-speaker distilled WaveNets. Note that some speakers sounded less appealing to people and always get lower MOS, however distilled parallel WaveNet always achieved significantly better results
  • Table3: Performance with respect to different combinations of loss terms. We report preference comparison scores since their mean opinion scores tend to be very close and inconclusive. Last row (combination of KL + Power + Perceptual + Contrastive losses)is the default model used
Download tables as Excel
Funding
  • This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality
  • The proposed model achieved several orders of magnitude speed-up compared to the original WaveNet with no significant difference in quality
Study subjects and analysis
samples: 24000
WaveNet is one of a family of autoregressive deep generative models that have been applied with great success to data as diverse as text (Mikolov et al, 2010), images (Larochelle & Murray, 2011; Theis & Bethge, 2015; van den Oord et al, 2016c;b), video (Kalchbrenner et al, 2016), handwriting (Graves, 2013) as well as human speech and music. Modelling raw audio signals, as WaveNet does, represents a particularly extreme form of autoregression, with up to 24,000 samples predicted per second. Operating at such a high temporal resolution is not problematic during network training, where the complete sequence of input samples is already available and—thanks to the convolutional structure of the network—can be processed in parallel

samples: 16000
A version of WaveNet that generates in real-time has been developed (Paine et al, 2016), but it required the use of a much smaller network, resulting in severely degraded quality. Raw audio data is typically very high-dimensional (e.g. 16,000 samples per second for 16kHz audio), and contains complex, hierarchical structures spanning many thousands of time steps, such as words in speech or melodies in music. Modelling such long-term dependencies with standard causal convolution layers would require a very deep network to ensure a sufficiently broad receptive field

Reference
  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
    Findings
  • Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
    Findings
  • Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
    Findings
  • Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
    Findings
  • Gatys, L. A., Ecker, A. S., and Bethge, M. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
    Findings
  • Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: masked autoencoder for distribution estimation. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 881–889, 2015.
    Google ScholarLocate open access versionFindings
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
    Findings
  • Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.
    Findings
  • Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
    Google ScholarLocate open access versionFindings
  • Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., and Kavukcuoglu, K. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
    Findings
  • Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Kingma, D. P., Salimans, T., and Welling, M. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016.
    Findings
  • Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 29–37, 2011.
    Google ScholarLocate open access versionFindings
  • Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., and Khudanpur, S. Recurrent neural network based language model. In Interspeech, volume 2, pp. 3, 2010.
    Google ScholarLocate open access versionFindings
  • Paine, T. L., Khorrami, P., Chang, S., Zhang, Y., Ramachandran, P., Hasegawa-Johnson, M. A., and Huang, T. S. Fast wavenet generation algorithm. CoRR, abs/1611.09482, 2016. URL http://arxiv.org/abs/1611.09482.
    Findings
  • Papamakarios, G., Murray, I., and Pavlakou, T. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2335–2344, 2017.
    Google ScholarLocate open access versionFindings
  • Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
    Google ScholarLocate open access versionFindings
  • Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
    Findings
  • Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
    Findings
  • Sønderby, C. K., Caballero, J., Theis, L., Shi, W., and Huszár, F. Amortised map inference for image superresolution. arXiv preprint arXiv:1610.04490, 2016.
    Findings
  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
    Google ScholarLocate open access versionFindings
  • Theis, L. and Bethge, M. Generative image modeling using spatial lstms. In Advances in Neural Information Processing Systems, pp. 1927–1935, 2015.
    Google ScholarLocate open access versionFindings
  • van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
    Findings
  • van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016b.
    Google ScholarLocate open access versionFindings
  • van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016c.
    Findings
  • Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
    Findings
Author
Yazhe Li
Yazhe Li
Igor Babuschkin
Igor Babuschkin
George van den Driessche
George van den Driessche
Your rating :
0

 

Tags
Comments
小科