AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
DeepSinger: Singing Voice Synthesis with Data Mined From the Web
KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event..., pp.1979-1989, (2020)
- Singing voice synthesis (SVS) [1, 20, 23, 28], which generates singing voices from lyrics, has attracted a lot of attention in both research and industrial community in recent years.
- Previous works on SVS include lyrics-to-singing alignment [6, 9, 11], parametric synthesis [1, 18], acoustic modeling [23, 26, 28], and adversarial synthesis [5, 14, 20]
- They achieve reasonably good performance, these systems typically require 1) a large amount of high-quality singing recordings as training data, and 2) strict data alignments between lyrics and singing audio for accurate singing modeling, both of which incur considerable data labeling cost.
- SVS systems are mostly inspired by TTS and follow the basic components in TTS such as text-to-audio alignment, parametric acoustic modeling and vocoder
- Singing voice synthesis (SVS) [1, 20, 23, 28], which generates singing voices from lyrics, has attracted a lot of attention in both research and industrial community in recent years
- We introduce the background of DeepSinger, including text to speech (TTS), singing voice synthesis (SVS), text-to-audio alignment that is the key component to SVS system, as well as some other works that leverage training data mined from the Web
- We introduce DeepSinger, a multi-lingual multisinger singing voice synthesis (SVS) system, which is built from scratch by leveraging singing training data mined from music websites
- DeepSinger consists of several data mining and modeling steps: data crawling, singing and accompaniment separation, lyricsto-singing alignment, data filtration, and singing modeling, in order to address the challenges of mining data from the Web
- Experiments on our mined dataset demonstrate that the alignment model of DeepSinger achieves high alignment accuracy and the singing model generates voices with high pitch accuracy and voice naturalness
- The authors conduct experimental studies on Chinese to analyze some specific designs in DeepSinger, including the effectiveness of the reference encoder, the benefits of leveraging TTS data for auxiliary training, and the influence of multilingual training on voice quality.
- The authors analyze the effectiveness of the reference encoder in DeepSinger to handle noisy training data, from three perspectives:.
- According to the MOS, it can be seen that clean voice can be synthesized given clean reference audio while noisy reference leads to noisy synthesized voice, which indicates that the reference encoder can learn the characteristics from the reference audio, verifying the analyses in Section 3.3
- The authors first describe the experimental setup, report the accuracy of the alignment model in DeepSinger, and evaluate the synthesized voices both quantitatively in terms of pitch accuracy and qualitatively in terms of mean opinion score (MOS).
- The authors have proposed DeepSinger, a multi-lingual multisinger singing voice synthesis (SVS) system, which is built from scratch by leveraging singing training data mined from music websites.
- The authors will leverage more advanced neural-based vocoders such as WaveNet , and jointly train the singing model and vocoder for better voice quality
- Table1: The statistics of the Singing-Wild dataset
- Table2: The accuracy of the alignment model on the three languages, in terms of the sentence-level metric: percentage of correct segments (PCS) and word/character-level metric: average absolute error (ASE). For ASE, we use word level for English and character level for Chinese and Cantonese
- Table3: The pitch accuracy of DeepSinger and the corresponding upper bound
- Table4: The MOS of DeepSinger with 95% confidence intervals on the three languages
- Table5: The MOS with 95% confidence intervals for different methods
- Table6: The MOS with 95% confidence intervals. We generate singing with different (clean, normal and noisy) reference audios of the same singer. Ref and Syn MOS represent the MOS of the reference audio and synthesized audio respectively
- Table7: The preference scores of singing model with reference encoder (our model, denoted as DeepSinger) and singing model with speaker embedding (denoted as DeepSinger w/o RE) when training with clean and noisy singing data
- Table8: The preference scores of DeepSinger and DeepSinger (only TTS). We choose one TTS audio as the reference audio to generate singing voices on Chinese singing test set
- Table9: Model settings of the singing model in DeepSinger
- Table10: The similarity scores of DeepSinger with 95% confidence intervals
- This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002), National Natural Science Foundation of China (Grant No.U1611461), National Natural Science Foundation of China (Grant No.61751209) and the Fundamental Research Funds for the Central Universities (2020QNA5024)
- This work was also partially funded by Microsoft Research Asia
- Merlijn Blaauw and Jordi Bonada. 2017. A neural parametric singing synthesizer modeling timbre and expression from natural songs. Applied Sciences 7, 12 (2017), 1313.
- Paul Boersma et al. 200Praat, a system for doing phonetics by computer. Glot international 5 (2002).
- Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.
- Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. 129–136.
- Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia Gomez. 2019. WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN. arXiv preprint arXiv:1903.10729 (2019).
- Yu-Ren Chien, Hsin-Min Wang, Shyh-Kang Jeng, Yu-Ren Chien, Hsin-Min Wang, and Shyh-Kang Jeng. 201Alignment of lyrics with accompanied singing audio based on acoustic-phonetic vowel likelihood modeling. TASLP 24, 11 (2016), 1998–2008.
- Zhiyong Zhang Dong Wang, Xuewei Zhang. 2015. THCHS-30: A Free Chinese Speech Corpus. http://arxiv.org/abs/1512.01882
- Georgi Dzhambazov et al. 2017. Knowledge-based probabilistic modeling for tracking lyrics in music audio signals. Ph.D. Dissertation. Universitat Pompeu Fabra.
- Hiromasa Fujihara, Masataka Goto, Jun Ogata, and Hiroshi G Okuno. 2011. LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics. IJSTSP 5, 6 (2011), 1252–1261.
- Daniel Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. ICASSP 32, 2 (1984), 236–243.
- Chitralekha Gupta, Rong Tong, Haizhou Li, and Ye Wang. 2018. Semi-supervised Lyrics and Solo-singing Alignment.. In ISMIR. 600–607.
- Chitralekha Gupta, Emre Yılmaz, and Haizhou Li. 2019. Acoustic Modeling for Automatic Lyrics-to-Audio Alignment. arXiv preprint arXiv:1906.10369 (2019).
- Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussalam. 2019. Spleeter: A fast and state-of-the art music source separation tool with pre-trained models. In Proc. International Society for Music Information Retrieval Conference.
- Yukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. 2019. Singing Voice Synthesis Based on Generative Adversarial Networks. In ICASSP 2019. IEEE, 6955–6959.
- Andrew J Hunt and Alan W Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP 1996, Vol. 1. IEEE, 373–376.
- Herbert Jaeger. 2002. Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the" echo state network" approach. Vol. 5. GMD-Forschungszentrum
- Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435 (2018).
- Juntae Kim, Heejin Choi, Jinuk Park, Sangjin Kim, Jongjin Kim, and Minsoo Hahn. 20Korean Singing Voice Synthesis System based on an LSTM Recurrent Neural Network. In INTERSPEECH 2018. ISCA.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee. 2019. Adversarially Trained End-to-end Korean Singing Voice Synthesis System. arXiv preprint arXiv:1908.01919 (2019).
- Hao Li, Yongguo Kang, and Zhenyu Wang. 2018. EMPHASIS: An emotional phoneme-based acoustic model for speech synthesis system. arXiv preprint arXiv:1806.09276 (2018).
- Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and Luc Van Gool. 2017. Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862 (2017).
- Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. 2020. XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. arXiv preprint arXiv:2006.06261 (2020).
- Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Interspeech. 498–502.
- Annamaria Mesaros and Tuomas Virtanen. 2008. Automatic alignment of music audio and lyrics. In Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08).
- Kazuhiro Nakamura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. 2019. Singing voice synthesis based on convolutional neural networks. arXiv preprint arXiv:1904.06868 (2019).
- Thi Hao Nguyen. 2018. A Study on Correlates of Acoustic Features to Emotional Singing Voice Synthesis. (2018).
- Masanari Nishimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. 2016. Singing Voice Synthesis Based on Deep Neural Networks.. In Interspeech. 2478–2482.
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
- Wei Ping, Kainan Peng, and Jitong Chen. 2019. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. In ICLR.
- Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 13, 4 (2010), 346–374.
- Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv (2020), arXiv–2006.
- Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: Fast, Robust and Controllable Text to Speech. arXiv preprint arXiv:1905.09263 (2019).
- Bidisha Sharma, Chitralekha Gupta, Haizhou Li, and Ye Wang. 2019. Automatic Lyrics-to-audio Alignment on Polyphonic Music Using Singing-adapted Acoustic Models. In ICASSP 2019. IEEE, 396–400.
- Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP 2018. IEEE, 4779–4783.
- Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara. 2018. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In ICASSP 2018. IEEE, 4784–4788.
- Marti Umbert, Jordi Bonada, Masataka Goto, Tomoyasu Nakano, and Johan Sundberg. 2015. Expression control in singing voice synthesis: Features, approaches, evaluation, and challenges. IEEE Signal Processing Magazine 32, 6 (2015), 55–73.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
- Zhizheng Wu, Oliver Watts, and Simon King. 2016. Merlin: An Open Source Neural Network Speech Synthesis System.. In SSW. 202–207.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems. 5754–5764.
- Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai. 2019. Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling. (2019).
- Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. LibriTTS: A corpus derived from librispeech for text-tospeech. arXiv preprint arXiv:1904.02882 (2019).
- Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Yusong Wu, Xiang Xie, Zijin Li, and Dong Yu. 2019. Learning Singing From Speech. arXiv preprint arXiv:1912.10128 (2019). 12 https://github.com/psf/requests 13We show the training process when mini batch size is 1 for simplicity.