A Transformer Based Pitch Sequence Autoencoder with MIDI Augmentation

Mingshuo Ding
Mingshuo Ding
Yinghao Ma
Yinghao Ma
Cited by: 0|Bibtex|Views53
Other Links: arxiv.org
Weibo:
A brand-new method is provided in our research on tiny dataset training including data augmentation in music tune transposition and MIDI sequence truncation and prevention of over-fitting

Abstract:

Algorithms based on deep learning have been widely put forward for automatic music generated. However, few objective approaches have been proposed to assess whether a melody was created by automatons or Homo sapiens. Conference of Sound and Music Technology (2020) provides us a great opportunity to cope with the problem. In this paper, ...More

Code:

Data:

0
Introduction
  • Methods based on machine learning have been widely proposed for automatic music generated, especially since significant progress in deep learning fields.
  • More and more melody can be composed by a deep-learning automaton, using the pitch and length of the notes in human music as a primary inputs to mimic human [1,2,3,4].
  • Finding a relatively common and objective way to evaluate the way melodies are produced in various musical styles can make different music task comparable.
  • The purpose of this study is to find an objective and effective method to generate the indicator value of whether a melody is human-composed by analyzing the AI-made melodies
Highlights
  • Methods based on machine learning have been widely proposed for automatic music generated, especially since significant progress in deep learning fields
  • Random insert and delete run a high risk in this case, so we proposed two methods to do data augmentation
  • The probability pi of the ith masked note is predicted by the trained A Lite BERT (ALBERT), and the average probability of all notes is the probability that this data is composed by AI
  • A brand-new method is provided in our research on tiny dataset training including data augmentation in music tune transposition and MIDI sequence truncation and prevention of over-fitting
  • A approach based on mask language model with ALBERT has been put forward to distinguish whether the composer of a music piece is human or not
Methods
  • The pipeline is shown in Fig.2 as follows.
  • The training set will undergo a data preprocessing part and be expanded by data augmentation.
  • The details is available in the following subsections.
  • A Masked Language Model (MLM) task based on ALBERT is trained for autoencoder on the expanded training set.
  • The trained model will be used for evaluation.
  • See Section 4 for details of training and evaluating
Results
  • For a pitch sequence, each note will be masked successively.
  • The probability pi of the ith masked note is predicted by the trained ALBERT, and the average probability of all notes is the probability that this data is composed by AI.
  • The number of notes in this pitch sequence is n suggests the probability of AI generating is as follows: p= n i=1 pi n.
  • The probability of each data created by human, which this task required, can be obtained by 1 − p
Conclusion
  • A brand-new method is provided in the research on tiny dataset training including data augmentation in music tune transposition and MIDI sequence truncation and prevention of over-fitting.
  • Because of the well behavior of the pre-trained model, the authors believe that it worth more extensive application in MIDI sequences encoding tasks.
  • Due to the limited computing resources, there is no experiment on whether the model can run well with a larger batch size or on the larger transformer, which worth more attention
Summary
  • Introduction:

    Methods based on machine learning have been widely proposed for automatic music generated, especially since significant progress in deep learning fields.
  • More and more melody can be composed by a deep-learning automaton, using the pitch and length of the notes in human music as a primary inputs to mimic human [1,2,3,4].
  • Finding a relatively common and objective way to evaluate the way melodies are produced in various musical styles can make different music task comparable.
  • The purpose of this study is to find an objective and effective method to generate the indicator value of whether a melody is human-composed by analyzing the AI-made melodies
  • Objectives:

    The purpose of this study is to find an objective and effective method to generate the indicator value of whether a melody is human-composed by analyzing the AI-made melodies.
  • Methods:

    The pipeline is shown in Fig.2 as follows.
  • The training set will undergo a data preprocessing part and be expanded by data augmentation.
  • The details is available in the following subsections.
  • A Masked Language Model (MLM) task based on ALBERT is trained for autoencoder on the expanded training set.
  • The trained model will be used for evaluation.
  • See Section 4 for details of training and evaluating
  • Results:

    For a pitch sequence, each note will be masked successively.
  • The probability pi of the ith masked note is predicted by the trained ALBERT, and the average probability of all notes is the probability that this data is composed by AI.
  • The number of notes in this pitch sequence is n suggests the probability of AI generating is as follows: p= n i=1 pi n.
  • The probability of each data created by human, which this task required, can be obtained by 1 − p
  • Conclusion:

    A brand-new method is provided in the research on tiny dataset training including data augmentation in music tune transposition and MIDI sequence truncation and prevention of over-fitting.
  • Because of the well behavior of the pre-trained model, the authors believe that it worth more extensive application in MIDI sequences encoding tasks.
  • Due to the limited computing resources, there is no experiment on whether the model can run well with a larger batch size or on the larger transformer, which worth more attention
Funding
  • Each time, about 15% of the elements has been randomly masked in a pitch sequence, and then use the other elements not masked to predict the elements that have been masked
Reference
  • Li Z, Li S. A comparison of melody created by artificial intelligence and human based on mathematical model[C]// Springer. Proceedings of the 7th Conference on Sound and Music Technology (CSMT). [S.l.]: Springer, 2020: 121–130.
    Google ScholarLocate open access versionFindings
  • Liu C H, Ting C K. Computational intelligence in music composition: A survey[J]. IEEE Transactions on Emerging Topics in Computational Intelligence. 2016, 1(1):2–15.
    Google ScholarLocate open access versionFindings
  • Dong H W, Hsiao W Y, Yang L C, et al. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment[J]. arXiv preprint arXiv:1709.06298. 2017.
    Findings
  • Wu C L, Liu C H, Ting C K. A novel genetic algorithm considering measures and phrases for generating melody[C]// IEEE. 2014 IEEE Congress on Evolutionary Computation (CEC). [S.l.]: IEEE, 2014: 2101–2107.
    Google ScholarFindings
  • Ren I Y. Using shannon entropy to evaluate automatic music generation systems: A case study of bach’s chorales[J]. ECE Department, University of Rochester. 2015.
    Google ScholarFindings
  • Liang F T, Gotham M, Johnson M, et al. Automatic stylistic composition of bach chorales with deep lstm.[C]. ISMIR. [S.l.], 2017: 449–456.
    Google ScholarFindings
  • Chu H, Urtasun R, Fidler S. Song from pi: A musically plausible network for pop music generation[J]. arXiv preprint arXiv:1611.03472016.
    Findings
  • Huang A, Wu R. Deep learning for music[J]. arXiv preprint arXiv:1606.04930. 2016.
    Findings
  • Unehara M, Onisawa T. Composition of music using human evaluation[C]// IEEE. 10th IEEE International Conference on Fuzzy Systems.(Cat. No. 01CH37297): volume 3. [S.l.]: IEEE, 2001: 1203–1206.
    Google ScholarFindings
  • Maeda Y, Kajihara Y. Automatic generation method of twelve tone row for musical composition used genetic algorithm[C]// IEEE. 2009 IEEE International Conference on Fuzzy Systems. [S.l.]: IEEE, 2009: 963–968.
    Google ScholarFindings
  • Pollastri E, Simoncelli G. Classification of melodies by composer with hidden markov models[C]// IEEE. Proceedings First International Conference on WEB Delivering of Music. WEDELMUSIC 2001. [S.l.]: IEEE, 2001: 88–95.
    Google ScholarLocate open access versionFindings
  • Ogihara M, Li T. N-gram chord profiles for composer style representation.[C]. ISMIR. [S.l.], 2008: 671–676.
    Google ScholarFindings
  • Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805. 2018.
    Findings
  • Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv:1802.05365. 2018.
    Findings
  • Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[M]. [S.l.]: [s.n.], 2018.
    Google ScholarFindings
  • Liu A T, Yang S w, Chi P H, et al. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders[C]// IEEE. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [S.l.]: IEEE, 2020: 6419–6423.
    Google ScholarLocate open access versionFindings
  • Jiang D, Lei X, Li W, et al. Improving transformer-based speech recognition using unsupervised pre-training[J]. arXiv preprint arXiv:1910.09932. 2019.
    Findings
  • Ling S, Liu Y, Salazar J, et al. Deep contextualized acoustic representations for semisupervised speech recognition[C]// IEEE. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [S.l.]: IEEE, 2020: 6429–6433.
    Google ScholarLocate open access versionFindings
  • Baskar M K, Watanabe S, Astudillo R, et al. Semi-supervised sequence-to-sequence asr using unpaired speech and text[J]. arXiv preprint arXiv:1905.01152. 2019.
    Findings
  • Schneider S, Baevski A, Collobert R, et al. wav2vec: Unsupervised pre-training for speech recognition[J]. arXiv preprint arXiv:1904.05862. 2019.
    Findings
  • Lan Z, Chen M, Goodman S, et al. Albert: A lite bert for self-supervised learning of language representations[J]. arXiv preprint arXiv:1909.11942. 2019.
    Findings
  • Chi P H, Chung P H, Wu T H, et al. Audio albert: A lite bert for self-supervised learning of audio representation[J]. arXiv preprint arXiv:2005.08575. 2020.
    Findings
  • Kim Y E, Chai W, Garcia R, et al. Analysis of a contour-based representation for melody.[C]. ISMIR. [S.l.], 2000.
    Google ScholarFindings
  • Wei J, Zou K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks[J]. arXiv preprint arXiv:1901.11196. 2019.
    Findings
  • Lee J, Lee Y, Kim J, et al. Set transformer: A framework for attention-based permutationinvariant neural networks[C]// PMLR. International Conference on Machine Learning. [S.l.]: PMLR, 2019: 3744–3753.
    Google ScholarLocate open access versionFindings
  • Ishida T, Yamane I, Sakai T, et al. Do we need zero training loss after achieving zero training error?[J]. arXiv preprint arXiv:2002.08709. 2020.
    Findings
  • Raffel C, Ellis D P. Intuitive analysis, creation and manipulation of midi data with pretty midi[C]. 15th International Society for Music Information Retrieval Conference Late Breaking and Demo Papers. [S.l.], 2014: 84–93.
    Google ScholarFindings
  • Paszke A, Gross S, Massa F, et al. Pytorch: An imperative style, high-performance deep learning library[C]. Advances in neural information processing systems. [S.l.], 2019: 8026– 8037.
    Google ScholarLocate open access versionFindings
  • Wolf T, Debut L, Sanh V, et al. Huggingface’s transformers: State-of-the-art natural language processing[J]. ArXiv. 2019, abs/1910.03771.
    Google ScholarFindings
  • Loshchilov I, Hutter F. Decoupled weight decay regularization[J]. arXiv preprint arXiv:1711.05101. 2017.
    Findings
Full Text
Your rating :
0

 

Tags
Comments