WaveNet: A Generative Model for Raw Audio

SSW, 2016.

Cited by: 3137|Bibtex|Views594
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
This paper has presented WaveNet, a deep generative model of audio data that operates directly at the waveform level

Abstract:

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second o...More

Code:

Data:

0
Introduction
  • This work explores raw audio generation techniques, inspired by recent advances in neural autoregressive generative models that model complex distributions such as images (van den Oord et al, 2016a;b) and text (Jozefowicz et al, 2016).
  • Modeling joint probabilities over pixels or words using neural architectures as products of conditional distributions yields state-of-the-art generation.
  • These architectures are able to model distributions over thousands of random variables (e.g. 64×64 pixels as in PixelRNN (van den Oord et al, 2016a)).
Highlights
  • This work explores raw audio generation techniques, inspired by recent advances in neural autoregressive generative models that model complex distributions such as images (van den Oord et al, 2016a;b) and text (Jozefowicz et al, 2016)
  • The question this paper addresses is whether similar approaches can succeed in generating wideband raw audio waveforms, which are signals with very high temporal resolution, at least 16,000 samples per second
  • In order to deal with long-range temporal dependencies needed for raw audio generation, we develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields
  • In this paper we introduce a new generative model operating directly on the raw audio waveform
  • This paper has presented WaveNet, a deep generative model of audio data that operates directly at the waveform level
  • WaveNets showed very promising results when applied to music audio modeling and speech recognition
Methods
  • To measure WaveNet’s audio modelling performance, the authors evaluate it on three different tasks: multispeaker speech generation, TTS, and music audio modelling.
  • For the first experiment the authors looked at free-form speech generation.
  • Because the model is not conditioned on text, it generates non-existent but human language-like words in a smooth way with realistic sounding intonations.
  • This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection.
  • The lack of long range coherence is partly due to the limited size of the model’s receptive field, which means it can only remember the last 2–3 phonemes it produced
Conclusion
  • This paper has presented WaveNet, a deep generative model of audio data that operates directly at the waveform level.
  • WaveNets are autoregressive and combine causal filters with dilated convolutions to allow their receptive fields to grow exponentially with depth, which is important to model the long-range temporal dependencies in audio signals.
  • When applied to TTS, WaveNets produced samples that outperform the current best TTS systems in subjective naturalness.
  • WaveNets showed very promising results when applied to music audio modeling and speech recognition
Summary
  • Introduction:

    This work explores raw audio generation techniques, inspired by recent advances in neural autoregressive generative models that model complex distributions such as images (van den Oord et al, 2016a;b) and text (Jozefowicz et al, 2016).
  • Modeling joint probabilities over pixels or words using neural architectures as products of conditional distributions yields state-of-the-art generation.
  • These architectures are able to model distributions over thousands of random variables (e.g. 64×64 pixels as in PixelRNN (van den Oord et al, 2016a)).
  • Methods:

    To measure WaveNet’s audio modelling performance, the authors evaluate it on three different tasks: multispeaker speech generation, TTS, and music audio modelling.
  • For the first experiment the authors looked at free-form speech generation.
  • Because the model is not conditioned on text, it generates non-existent but human language-like words in a smooth way with realistic sounding intonations.
  • This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection.
  • The lack of long range coherence is partly due to the limited size of the model’s receptive field, which means it can only remember the last 2–3 phonemes it produced
  • Conclusion:

    This paper has presented WaveNet, a deep generative model of audio data that operates directly at the waveform level.
  • WaveNets are autoregressive and combine causal filters with dilated convolutions to allow their receptive fields to grow exponentially with depth, which is important to model the long-range temporal dependencies in audio signals.
  • When applied to TTS, WaveNets produced samples that outperform the current best TTS systems in subjective naturalness.
  • WaveNets showed very promising results when applied to music audio modeling and speech recognition
Tables
  • Table1: Subjective 5-scale mean opinion scores of speech samples from LSTM-RNN-based statistical parametric, HMM-driven unit selection concatenative, and proposed WaveNet-based speech synthesizers, 8-bit μ-law encoded natural speech, and 16-bit linear pulse-code modulation (PCM) natural speech. WaveNet improved the previous state of the art significantly, reducing the gap between natural speech and best previous model by more than 50%
  • Table2: Subjective preference scores of speech samples between LSTM-RNN-based statistical parametric (LSTM), HMM-driven unit selection concatenative (Concat), and proposed WaveNet-based speech synthesizers. Each row of the table denotes scores of a paired comparison test between two synthesizers. Scores of the synthesizers which were significantly better than their competing ones at p < 0.01 level were shown in the bold type. Note that WaveNet (L) and WaveNet (L+F) correspond to WaveNet conditioned on linguistic features only and that conditioned on both linguistic features and F0 values
Download tables as Excel
Reference
  • Agiomyrgiannakis, Yannis. Vocaine the vocoder and applications is speech synthesis. In ICASSP, pp. 4230–4234, 2015.
    Google ScholarLocate open access versionFindings
  • Bishop, Christopher M. Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, 1994.
    Google ScholarFindings
  • Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015. URL http://arxiv.org/abs/1412.7062.
    Findings
  • Chiba, Tsutomu and Kajiyama, Masato. The Vowel: Its Nature and Structure. Tokyo-Kaiseikan, 1942.
    Google ScholarLocate open access versionFindings
  • Dudley, Homer. Remaking speech. The Journal of the Acoustical Society of America, 11(2):169– 177, 1939.
    Google ScholarLocate open access versionFindings
  • Dutilleux, Pierre. An implementation of the “algorithme atrous” to compute the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp. 298–304. Springer Berlin Heidelberg, 1989.
    Google ScholarLocate open access versionFindings
  • Fan, Yuchen, Qian, Yao, and Xie, Feng-Long, Soong Frank K. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Interspeech, pp. 1964–1968, 2014.
    Google ScholarLocate open access versionFindings
  • Fant, Gunnar. Acoustic Theory of Speech Production. Mouton De Gruyter, 1970.
    Google ScholarFindings
  • Garofolo, John S., Lamel, Lori F., Fisher, William M., Fiscus, Jonathon G., and Pallett, David S. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report, 93, 1993.
    Google ScholarFindings
  • Gonzalvo, Xavi, Tazari, Siamak, Chan, Chun-an, Becker, Markus, Gutkin, Alexander, and Silen, Hanna. Recent advances in Google real-time HMM-driven unit selection synthesizer. In Interspeech, 2016. URL http://research.google.com/pubs/pub45564.html.
    Locate open access versionFindings
  • He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
    Findings
  • Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Holschneider, Matthias, Kronland-Martinet, Richard, Morlet, Jean, and Tchamitchian, Philippe. A real-time algorithm for signal analysis with the help of the wavelet transform. In Combes, JeanMichel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp. 286–297. Springer Berlin Heidelberg, 1989.
    Google ScholarLocate open access versionFindings
  • Hoshen, Yedid, Weiss, Ron J., and Wilson, Kevin W. Speech acoustic modeling from raw multichannel waveforms. In ICASSP, pp. 4624–4628. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Hunt, Andrew J. and Black, Alan W. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP, pp. 373–376, 1996.
    Google ScholarLocate open access versionFindings
  • Imai, Satoshi and Furuichi, Chieko. Unbiased estimation of log spectrum. In EURASIP, pp. 203– 206, 1988.
    Google ScholarLocate open access versionFindings
  • Itakura, Fumitada. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoust. Society of America, 57(S1):S35–S35, 1975.
    Google ScholarLocate open access versionFindings
  • Itakura, Fumitada and Saito, Shuzo. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J53A:35–42, 1970.
    Google ScholarLocate open access versionFindings
  • ITU-T. Recommendation G. 711. Pulse Code Modulation (PCM) of voice frequencies, 1988.
    Google ScholarFindings
  • Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016. URL http://arxiv.org/abs/1602.02410.
    Findings
  • Juang, Biing-Hwang and Rabiner, Lawrence. Mixture autoregressive hidden Markov models for speech signals. IEEE Trans. Acoust. Speech Signal Process., pp. 1404–1413, 1985.
    Google ScholarLocate open access versionFindings
  • Kameoka, Hirokazu, Ohishi, Yasunori, Mochihashi, Daichi, and Le Roux, Jonathan. Speech analysis with multi-kernel linear prediction. In Spring Conference of ASJ, pp. 499–502, 2010. (in Japanese).
    Google ScholarLocate open access versionFindings
  • Karaali, Orhan, Corrigan, Gerald, Gerson, Ira, and Massey, Noel. Text-to-speech conversion with neural networks: A recurrent TDNN approach. In Eurospeech, pp. 561–564, 1997.
    Google ScholarLocate open access versionFindings
  • Kawahara, Hideki, Masuda-Katsuse, Ikuyo, and de Cheveigne, Alain. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased f0 extraction: possible role of a repetitive structure in sounds. Speech Commn., 27:187– 207, 1999.
    Google ScholarLocate open access versionFindings
  • Kawahara, Hideki, Estill, Jo, and Fujimura, Osamu. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT. In MAVEBA, pp. 13–15, 2001.
    Google ScholarLocate open access versionFindings
  • Law, Edith and Von Ahn, Luis. Input-agreement: a new mechanism for collecting data using human computation games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1197–1206. ACM, 2009.
    Google ScholarLocate open access versionFindings
  • Maia, Ranniery, Zen, Heiga, and Gales, Mark J. F. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In ISCA SSW7, pp. 88–93, 2010.
    Google ScholarLocate open access versionFindings
  • Morise, Masanori, Yokomori, Fumiya, and Ozawa, Kenji. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst., E99-D(7):1877–1884, 2016.
    Google ScholarLocate open access versionFindings
  • Moulines, Eric and Charpentier, Francis. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commn., 9:453–467, 1990.
    Google ScholarLocate open access versionFindings
  • Muthukumar, P. and Black, Alan W. A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis. arXiv:1409.8558, 2014.
    Findings
  • Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted Boltzmann machines. In ICML, pp. 807–814, 2010.
    Google ScholarLocate open access versionFindings
  • Nakamura, Kazuhiro, Hashimoto, Kei, Nankaku, Yoshihiko, and Tokuda, Keiichi. Integration of spectral feature extraction and modeling for HMM-based speech synthesis. IEICE Trans. Inf. Syst., E97-D(6):1438–1448, 2014.
    Google ScholarLocate open access versionFindings
  • Palaz, Dimitri, Collobert, Ronan, and Magimai-Doss, Mathew. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In Interspeech, pp. 1766–1770, 2013.
    Google ScholarLocate open access versionFindings
  • Peltonen, Sari, Gabbouj, Moncef, and Astola, Jaakko. Nonlinear filter design: methodologies and challenges. In IEEE ISPA, pp. 102–107, 2001.
    Google ScholarLocate open access versionFindings
  • Poritz, Alan B. Linear predictive hidden Markov models and the speech signal. In ICASSP, pp. 1291–1294, 1982.
    Google ScholarLocate open access versionFindings
  • Rabiner, Lawrence and Juang, Biing-Hwang. Fundamentals of Speech Recognition. PrenticeHall, 1993.
    Google ScholarFindings
  • Sagisaka, Yoshinori, Kaiki, Nobuyoshi, Iwahashi, Naoto, and Mimura, Katsuhiko. ATR ν-talk speech synthesis system. In ICSLP, pp. 483–486, 1992.
    Google ScholarLocate open access versionFindings
  • Sainath, Tara N., Weiss, Ron J., Senior, Andrew, Wilson, Kevin W., and Vinyals, Oriol. Learning the speech front-end with raw waveform CLDNNs. In Interspeech, pp. 1–5, 2015.
    Google ScholarLocate open access versionFindings
  • Takaki, Shinji and Yamagishi, Junichi. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In ICASSP, pp. 5535–5539, 2016.
    Google ScholarLocate open access versionFindings
  • Takamichi, Shinnosuke, Toda, Tomoki, Black, Alan W., Neubig, Graham, Sakriani, Sakti, and Nakamura, Satoshi. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process., 24(4):755–767, 2016.
    Google ScholarLocate open access versionFindings
  • Theis, Lucas and Bethge, Matthias. Generative image modeling using spatial LSTMs. In NIPS, pp. 1927–1935, 2015.
    Google ScholarLocate open access versionFindings
  • Toda, Tomoki and Tokuda, Keiichi. A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst., E90-D(5):816–824, 2007.
    Google ScholarLocate open access versionFindings
  • Toda, Tomoki and Tokuda, Keiichi. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In ICASSP, pp. 3925–3928, 2008.
    Google ScholarLocate open access versionFindings
  • Tokuda, Keiichi. Speech synthesis as a statistical machine learning problem. http://www.sp.nitech.ac.jp/̃tokuda/tokuda_asru2011_for_pdf.pdf, 2011. Invited talk given at ASRU.
    Locate open access versionFindings
  • Tokuda, Keiichi and Zen, Heiga. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In ICASSP, pp. 4215–4219, 2015.
    Google ScholarLocate open access versionFindings
  • Tokuda, Keiichi and Zen, Heiga. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In ICASSP, pp. 5640–5644, 2016.
    Google ScholarLocate open access versionFindings
  • Tuerk, Christine and Robinson, Tony. Speech synthesis using artificial neural networks trained on cepstral coefficients. In Proc. Eurospeech, pp. 1713–1716, 1993.
    Google ScholarLocate open access versionFindings
  • Tuske, Zoltan, Golik, Pavel, Schluter, Ralf, and Ney, Hermann. Acoustic modeling with deep neural networks using raw time signal for LVCSR. In Interspeech, pp. 890–894, 2014.
    Google ScholarLocate open access versionFindings
  • Uria, Benigno, Murray, Iain, Renals, Steve, Valentini-Botinhao, Cassia, and Bridle, John. Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE. In ICASSP, pp. 4465–4469, 2015.
    Google ScholarLocate open access versionFindings
  • van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016a.
    Findings
  • van den Oord, Aaron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. CoRR, abs/1606.05328, 2016b. URL http://arxiv.org/abs/1606.05328.
    Findings
  • Wu, Yi-Jian and Tokuda, Keiichi. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Interspeech, pp. 577–580, 2008.
    Google ScholarLocate open access versionFindings
  • Yamagishi, Junichi. English multi-speaker corpus for CSTR voice cloning toolkit, 2012. URL http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html.
    Findings
  • Yoshimura, Takayoshi. Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems. PhD thesis, Nagoya Institute of Technology, 2002.
    Google ScholarFindings
  • Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. URL http://arxiv.org/abs/1511.07122.
    Findings
  • Zen, Heiga. An example of context-dependent label format for HMM-based speech synthesis in English, 2006. URL http://hts.sp.nitech.ac.jp/?Download.
    Locate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments