AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have proposed DeepSinger, a multi-lingual multisinger singing voice synthesis system, which is built from scratch by leveraging singing training data mined from music websites

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Virtual Event..., pp.1979-1989, (2020)

Cited by: 13|Views250
EI

Abstract

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data f...More

Code:

Data:

0
Introduction
  • Singing voice synthesis (SVS) [1, 20, 23, 28], which generates singing voices from lyrics, has attracted a lot of attention in both research and industrial community in recent years.
  • Previous works on SVS include lyrics-to-singing alignment [6, 9, 11], parametric synthesis [1, 18], acoustic modeling [23, 26, 28], and adversarial synthesis [5, 14, 20]
  • They achieve reasonably good performance, these systems typically require 1) a large amount of high-quality singing recordings as training data, and 2) strict data alignments between lyrics and singing audio for accurate singing modeling, both of which incur considerable data labeling cost.
  • SVS systems are mostly inspired by TTS and follow the basic components in TTS such as text-to-audio alignment, parametric acoustic modeling and vocoder
Highlights
  • Singing voice synthesis (SVS) [1, 20, 23, 28], which generates singing voices from lyrics, has attracted a lot of attention in both research and industrial community in recent years
  • We introduce the background of DeepSinger, including text to speech (TTS), singing voice synthesis (SVS), text-to-audio alignment that is the key component to SVS system, as well as some other works that leverage training data mined from the Web
  • We introduce DeepSinger, a multi-lingual multisinger singing voice synthesis (SVS) system, which is built from scratch by leveraging singing training data mined from music websites
  • DeepSinger consists of several data mining and modeling steps: data crawling, singing and accompaniment separation, lyricsto-singing alignment, data filtration, and singing modeling, in order to address the challenges of mining data from the Web
  • Experiments on our mined dataset demonstrate that the alignment model of DeepSinger achieves high alignment accuracy and the singing model generates voices with high pitch accuracy and voice naturalness
Methods
  • The authors conduct experimental studies on Chinese to analyze some specific designs in DeepSinger, including the effectiveness of the reference encoder, the benefits of leveraging TTS data for auxiliary training, and the influence of multilingual training on voice quality.
  • The authors analyze the effectiveness of the reference encoder in DeepSinger to handle noisy training data, from three perspectives:.
  • According to the MOS, it can be seen that clean voice can be synthesized given clean reference audio while noisy reference leads to noisy synthesized voice, which indicates that the reference encoder can learn the characteristics from the reference audio, verifying the analyses in Section 3.3
Results
  • The authors first describe the experimental setup, report the accuracy of the alignment model in DeepSinger, and evaluate the synthesized voices both quantitatively in terms of pitch accuracy and qualitatively in terms of mean opinion score (MOS).
Conclusion
  • The authors have proposed DeepSinger, a multi-lingual multisinger singing voice synthesis (SVS) system, which is built from scratch by leveraging singing training data mined from music websites.
  • The authors will leverage more advanced neural-based vocoders such as WaveNet [29], and jointly train the singing model and vocoder for better voice quality
Tables
  • Table1: The statistics of the Singing-Wild dataset
  • Table2: The accuracy of the alignment model on the three languages, in terms of the sentence-level metric: percentage of correct segments (PCS) and word/character-level metric: average absolute error (ASE). For ASE, we use word level for English and character level for Chinese and Cantonese
  • Table3: The pitch accuracy of DeepSinger and the corresponding upper bound
  • Table4: The MOS of DeepSinger with 95% confidence intervals on the three languages
  • Table5: The MOS with 95% confidence intervals for different methods
  • Table6: The MOS with 95% confidence intervals. We generate singing with different (clean, normal and noisy) reference audios of the same singer. Ref and Syn MOS represent the MOS of the reference audio and synthesized audio respectively
  • Table7: The preference scores of singing model with reference encoder (our model, denoted as DeepSinger) and singing model with speaker embedding (denoted as DeepSinger w/o RE) when training with clean and noisy singing data
  • Table8: The preference scores of DeepSinger and DeepSinger (only TTS). We choose one TTS audio as the reference audio to generate singing voices on Chinese singing test set
  • Table9: Model settings of the singing model in DeepSinger
  • Table10: The similarity scores of DeepSinger with 95% confidence intervals
Download tables as Excel
Funding
  • This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002), National Natural Science Foundation of China (Grant No.U1611461), National Natural Science Foundation of China (Grant No.61751209) and the Fundamental Research Funds for the Central Universities (2020QNA5024)
  • This work was also partially funded by Microsoft Research Asia
Reference
  • Merlijn Blaauw and Jordi Bonada. 2017. A neural parametric singing synthesizer modeling timbre and expression from natural songs. Applied Sciences 7, 12 (2017), 1313.
    Google ScholarLocate open access versionFindings
  • Paul Boersma et al. 200Praat, a system for doing phonetics by computer. Glot international 5 (2002).
    Google ScholarFindings
  • Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.
    Google ScholarFindings
  • Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. 129–136.
    Google ScholarLocate open access versionFindings
  • Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia Gomez. 2019. WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN. arXiv preprint arXiv:1903.10729 (2019).
    Findings
  • Yu-Ren Chien, Hsin-Min Wang, Shyh-Kang Jeng, Yu-Ren Chien, Hsin-Min Wang, and Shyh-Kang Jeng. 201Alignment of lyrics with accompanied singing audio based on acoustic-phonetic vowel likelihood modeling. TASLP 24, 11 (2016), 1998–2008.
    Google ScholarLocate open access versionFindings
  • Zhiyong Zhang Dong Wang, Xuewei Zhang. 2015. THCHS-30: A Free Chinese Speech Corpus. http://arxiv.org/abs/1512.01882
    Findings
  • Georgi Dzhambazov et al. 2017. Knowledge-based probabilistic modeling for tracking lyrics in music audio signals. Ph.D. Dissertation. Universitat Pompeu Fabra.
    Google ScholarFindings
  • Hiromasa Fujihara, Masataka Goto, Jun Ogata, and Hiroshi G Okuno. 2011. LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics. IJSTSP 5, 6 (2011), 1252–1261.
    Google ScholarLocate open access versionFindings
  • Daniel Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. ICASSP 32, 2 (1984), 236–243.
    Google ScholarLocate open access versionFindings
  • Chitralekha Gupta, Rong Tong, Haizhou Li, and Ye Wang. 2018. Semi-supervised Lyrics and Solo-singing Alignment.. In ISMIR. 600–607.
    Google ScholarLocate open access versionFindings
  • Chitralekha Gupta, Emre Yılmaz, and Haizhou Li. 2019. Acoustic Modeling for Automatic Lyrics-to-Audio Alignment. arXiv preprint arXiv:1906.10369 (2019).
    Findings
  • Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussalam. 2019. Spleeter: A fast and state-of-the art music source separation tool with pre-trained models. In Proc. International Society for Music Information Retrieval Conference.
    Google ScholarLocate open access versionFindings
  • Yukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. 2019. Singing Voice Synthesis Based on Generative Adversarial Networks. In ICASSP 2019. IEEE, 6955–6959.
    Google ScholarLocate open access versionFindings
  • Andrew J Hunt and Alan W Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP 1996, Vol. 1. IEEE, 373–376.
    Google ScholarLocate open access versionFindings
  • Herbert Jaeger. 2002. Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the" echo state network" approach. Vol. 5. GMD-Forschungszentrum
    Google ScholarFindings
  • Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435 (2018).
    Findings
  • Juntae Kim, Heejin Choi, Jinuk Park, Sangjin Kim, Jongjin Kim, and Minsoo Hahn. 20Korean Singing Voice Synthesis System based on an LSTM Recurrent Neural Network. In INTERSPEECH 2018. ISCA.
    Google ScholarFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
    Findings
  • Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee. 2019. Adversarially Trained End-to-end Korean Singing Voice Synthesis System. arXiv preprint arXiv:1908.01919 (2019).
    Findings
  • Hao Li, Yongguo Kang, and Zhenyu Wang. 2018. EMPHASIS: An emotional phoneme-based acoustic model for speech synthesis system. arXiv preprint arXiv:1806.09276 (2018).
    Findings
  • Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and Luc Van Gool. 2017. Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862 (2017).
    Findings
  • Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. 2020. XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. arXiv preprint arXiv:2006.06261 (2020).
    Findings
  • Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Interspeech. 498–502.
    Google ScholarLocate open access versionFindings
  • Annamaria Mesaros and Tuomas Virtanen. 2008. Automatic alignment of music audio and lyrics. In Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08).
    Google ScholarLocate open access versionFindings
  • Kazuhiro Nakamura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. 2019. Singing voice synthesis based on convolutional neural networks. arXiv preprint arXiv:1904.06868 (2019).
    Findings
  • Thi Hao Nguyen. 2018. A Study on Correlates of Acoustic Features to Emotional Singing Voice Synthesis. (2018).
    Google ScholarFindings
  • Masanari Nishimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. 2016. Singing Voice Synthesis Based on Deep Neural Networks.. In Interspeech. 2478–2482.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
    Findings
  • Wei Ping, Kainan Peng, and Jitong Chen. 2019. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. In ICLR.
    Google ScholarFindings
  • Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 13, 4 (2010), 346–374.
    Google ScholarLocate open access versionFindings
  • Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv (2020), arXiv–2006.
    Google ScholarFindings
  • Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: Fast, Robust and Controllable Text to Speech. arXiv preprint arXiv:1905.09263 (2019).
    Findings
  • Bidisha Sharma, Chitralekha Gupta, Haizhou Li, and Ye Wang. 2019. Automatic Lyrics-to-audio Alignment on Polyphonic Music Using Singing-adapted Acoustic Models. In ICASSP 2019. IEEE, 396–400.
    Google ScholarLocate open access versionFindings
  • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP 2018. IEEE, 4779–4783.
    Google ScholarLocate open access versionFindings
  • Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara. 2018. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In ICASSP 2018. IEEE, 4784–4788.
    Google ScholarLocate open access versionFindings
  • Marti Umbert, Jordi Bonada, Masataka Goto, Tomoyasu Nakano, and Johan Sundberg. 2015. Expression control in singing voice synthesis: Features, approaches, evaluation, and challenges. IEEE Signal Processing Magazine 32, 6 (2015), 55–73.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
    Google ScholarLocate open access versionFindings
  • Zhizheng Wu, Oliver Watts, and Simon King. 2016. Merlin: An Open Source Neural Network Speech Synthesis System.. In SSW. 202–207.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems. 5754–5764.
    Google ScholarFindings
  • Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai. 2019. Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling. (2019).
    Google ScholarFindings
  • Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. LibriTTS: A corpus derived from librispeech for text-tospeech. arXiv preprint arXiv:1904.02882 (2019).
    Findings
  • Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Yusong Wu, Xiang Xie, Zijin Li, and Dong Yu. 2019. Learning Singing From Speech. arXiv preprint arXiv:1912.10128 (2019). 12 https://github.com/psf/requests 13We show the training process when mini batch size is 1 for simplicity.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科