AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We have proposed SongMASS, an automatic song writing system for both lyric-to-melody and melodyto-lyric generation, which leverages masked sequence to sequence pre-training and attention-based alignment constraint
SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint
Automatic song writing aims to compose a song (lyric and/or melody) by machine, which is an interesting topic in both academia and industry. In automatic song writing, lyric-to-melody generation and melody-to-lyric generation are two important tasks, both of which usually suffer from the following challenges: 1) the paired lyric and mel...More
PPT (Upload PPT)
- Automatic song writing is an interesting and challenging task in both research and industry.
- Previous works (Bao et al 2019; Li et al 2020; Watanabe et al 2018; Lee, Fang, and Ma 2019) on L2M and M2L have not considered the scenario of limited paired data, and only leverage some greedy decisions for lyric and melody alignment, which cannot well address these challenges.
- The authors propose SongMASS, which uses sequence to sequence pre-training method to leverage the unpaired lyric and melody data, and attention-based alignment constraints for global and precise lyric-melody alignment
- Automatic song writing is an interesting and challenging task in both research and industry
- We propose SongMASS, an automatic song writing system for L2M and M2L, which addresses the first challenge with masked sequence to sequence pre-training and the second challenge with attention based alignment constraint
- The main results of the objective evaluation of lyric-tomelody and melody-to-lyric generations are shown in Table 1
- We have proposed SongMASS, an automatic song writing system for both lyric-to-melody and melodyto-lyric generation, which leverages masked sequence to sequence pre-training and attention-based alignment constraint
- We introduce the sentence-level and token-level alignment constraints, and a dynamic programming algorithm to obtain accurate alignments between lyric and melody
- Experimental results show that our proposed SongMASS greatly improves the quality of lyric-to-melody and melody-to-lyric generation compared with the baseline
- 3.1 System Overview
The overall architecture of SongMASS for L2M and M2L is shown in Figure 2, which adopts the Transformer (Vaswani et al 2017) based encoder-decoder framework.
- Pre-training Method The authors further investigate the effectiveness of each design in pre-training method, including using separate encoder-decoder for lyric-to-lyric and melodyto-melody pre-training and using supervised pre-training to learn a shared latent space between lyric and melody.
- From Table 1, removing separate encoder-decoder and removing supervised loss both result in worse performance than SongMASS, which demonstrates the effectiveness of the two designs.
- Alignment Strategy The authors study the effectiveness of the sentence-level and token-level alignment constraints on the alignment accuracy between melodies and lyrics.
- The authors find that the alignment accuracy is drastically decreased without DP in Table 3, showing the importance of DP for accurate alignments
- 4.1 Experimental Setup
Dataset Unpaired Lyric and Melody. The authors use “380,000+ lyrics from MetroLyrics” as the unpaired lyrics for pretraining, which contains 362,237 songs.
- The subjective evaluations are shown in Table 2, from which the authors can see that the lyrics and melodies generated by SongMASS obtain better average scores in all subjective metrics.
- These results demonstrate the effectiveness of SongMASS in generating high-quality lyric and melody.
- As shown in Table 1, removing each component results in worse performance than SongMASS6, demonstrating the contribution of pre-training and alignment constraint
- The authors have proposed SongMASS, an automatic song writing system for both lyric-to-melody and melodyto-lyric generation, which leverages masked sequence to sequence pre-training and attention-based alignment constraint.
- Experimental results show that the proposed SongMASS greatly improves the quality of lyric-to-melody and melody-to-lyric generation compared with the baseline.
- The authors will investigate other sequence to sequence pre-training methods and more advanced alignment algorithms for lyric-to-melody and melody-to-lyric generation
- Table1: Results of lyric-to-melody and melody-to-lyric generation in objective evaluation
- Table2: Subjective evaluation results. Average scores and standard deviations are shown for each measure
- Table3: Analyses of the designs in alignment constraints
- This research was supported by the National Key Research And Development Program of China (No.2019YFB1405802)
Study subjects and analysis
participants with professional knowledge in music and singing: 5
We calculate the ratio of equals among all source tokens and all songs in the test set to obtain the alignment accuracy. Subjective Evaluation For subjective evaluation, we invite 5 participants with professional knowledge in music and singing as human annotators to evaluate 10 songs (338 pairs of generated lyric sentences and melody phrases) randomly selected from our test set. We require each annotator to answer some questions using a five-point scale, from 1 (Poor) to 5 (Perfect)
- Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv e-prints abs/1409.0473.
- Bao, H.; Huang, S.; Wei, F.; Cui, L.; Wu, Y.; Tan, C.; Piao, S.; and Zhou, M. 2019. Neural Melody Composition from Lyrics. In NLPCC, volume 11838, 499–511.
- Berndt, D. J.; and Clifford, J. 1994. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, 359–370.
- Choi, K.; Fazekas, G.; and Sandler, M. B. 2016. Text-based LSTM networks for Automatic Music Composition. CoRR abs/1604.05358.
- Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 4171–4186.
- Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In Bengio, Y.; and LeCun, Y., eds., ICLR.
- Lee, H.-P.; Fang, J.-S.; and Ma, W.-Y. 2019. iComposer: An Automatic Songwriting System for Chinese Popular Music. In NAACL, 84–88.
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Li, P.; Zhang, H.; Liu, X.; and Shi, S. 2020. Rigid Formats Controlled Text Generation. In ACL, 742–751.
- Lu, X.; Wang, J.; Zhuang, B.; Wang, S.; and Xiao, J. 2019. A Syllable-Structured, Contextually-Based Conditionally Generation of Chinese Lyrics. In PRICAI, volume 11672, 257–265.
- Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective Approaches to Attention-based Neural Machine Translation. In EMNLP, 1412–1421.
- Malmi, E.; Takala, P.; Toivonen, H.; Raiko, T.; and Gionis, A. 2015. DopeLearning: A Computational Approach to Rap Lyrics Generation. CoRR abs/1505.04771.
- Needleman, S. B.; and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443–453.
- Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.
- Raffel, C. 2016. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. PhD Thesis.
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Ren, Y.; He, J.; Tan, X.; Qin, T.; Zhao, Z.; and Liu, T. 2020. PopMAG: Pop Music Accompaniment Generation. CoRR abs/2008.07703. URL https://arxiv.org/abs/2008.07703.
- Rush, A. M.; Chopra, S.; and Weston, J. 2015. A Neural Attention Model for Abstractive Sentence Summarization. In EMNLP, 379–389.
- Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T. 20MASS: Masked Sequence to Sequence Pre-training for Language Generation. In ICML, volume 97, 5926–5936.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In NIPS, 5998–6008.
- Watanabe, K.; Matsubayashi, Y.; Fukayama, S.; Goto, M.; Inui, K.; and Nakano, T. 2018. A Melody-Conditioned Lyrics Language Model. In NAACL, 163–172.
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R. R.; and Le, Q. V. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NIPS, 5753–5763.
- Yu, Y.; and Canales, S. 2019. Conditional LSTM-GAN for Melody Generation from Lyrics. CoRR abs/1908.05551.
- Zhu, H.; Liu, Q.; Yuan, N. J.; Qin, C.; Li, J.; Zhang, K.; Zhou, G.; Wei, F.; Xu, Y.; and Chen, E. 2018. XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music. In KDD, 2837–2846.