Transformer Based Grapheme-to-Phoneme Conversion

INTERSPEECH, pp.2095-2099, (2019)

被引用2|浏览1
EI
下载 PDF 全文
引用
微博一下

摘要

Attention mechanism is one of the most successful techniques in deep learning based Natural Language Processing (NLP). The transformer network architecture is completely based on attention mechanisms, and it outperforms sequence-to-sequence models in neural machine translation without recurrent and convolutional layers. Grapheme-to-phon...更多

代码

数据

0
简介
  • Many approaches have been proposed: the early solutions were rule-based [2], while in later works, joint sequence models for G2P conversion were introduced [3, 4].
  • The latter requires alignment between graphemes and phonemes, and it calculates a joint n-gram language model over sequences.
  • In [1], an end-to-end TTS system utilized an encoder-decoder model for the G2P task by using the multi-layer bidirectional encoder with GRU (Gated Recurrent Unit) and a deep unidirectional GRU decoder
重点内容
  • Grapheme-to-phoneme conversion is an important component in TTS and automatic speech recognition (ASR) systems [1]
  • Grapheme-to-phoneme conversion is an important component in TTS and ASR systems [1]
  • Encoder-decoder architectures were applied in various tasks, such as neural machine translation, speech recognition, text-to-speech synthesis [1,5,6]
  • Word Error Rate (WER) is the percentage of words in which the predicted phoneme sequence does not exactly match any reference pronunciation, the number of word errors is divided by the total number of unique words in the reference
  • We investigated a novel transformer architecture for the G2P task
  • We evaluated phoneme error rate (PER) and word error rate (WER), and the results of the proposed models are very competitive with previous state-art results
方法
  • Joint sequence model [3] Encoder-decoder with global attention [7] Encoder CNN with res.
  • Transformer 4x4 Encoder-decoder LSTM [18].
  • Joint maximum entropy (ME) n-gram model [4] Encoder CNN, decoder Bi-LSTM (Model 5) [11].
  • End-to-end CNN (Model 4) [11] Encoder-decoder LSTM with attention (Model 1) [11] Transformer 4x4.
  • Joint sequence model [3] Combination of sequitur G2P and seq2seq-attention and multitask learning [21] Deep Bi-LSTM with many-to-many alignment [23]
结果
  • The authors use the following common evaluation metrics for G2P: Phoneme Error Rate (PER) is the Levenshtein distance between the predicted phoneme sequences and the reference phoneme sequences, divided by the number of phonemes in the reference pronunciation [20].
  • Word Error Rate (WER) is the percentage of words in which the predicted phoneme sequence does not exactly match any reference pronunciation, the number of word errors is divided by the total number of unique words in the reference.
结论
  • The authors investigated a novel transformer architecture for the G2P task. Transformer 3x3 (3 layers encoder and 3 layers decoder), Transformer 4x4 (4 layers encoder and 4 layers decoder), and Transformer 5x5 (5 layers encoder and 5 layers decoder) architectures were presented including experiments on CMUDict and NetTalk.
  • The authors intend to study the application of the proposed method in the field of end-to-end TTS synthesis
总结
  • Introduction:

    Many approaches have been proposed: the early solutions were rule-based [2], while in later works, joint sequence models for G2P conversion were introduced [3, 4].
  • The latter requires alignment between graphemes and phonemes, and it calculates a joint n-gram language model over sequences.
  • In [1], an end-to-end TTS system utilized an encoder-decoder model for the G2P task by using the multi-layer bidirectional encoder with GRU (Gated Recurrent Unit) and a deep unidirectional GRU decoder
  • Methods:

    Joint sequence model [3] Encoder-decoder with global attention [7] Encoder CNN with res.
  • Transformer 4x4 Encoder-decoder LSTM [18].
  • Joint maximum entropy (ME) n-gram model [4] Encoder CNN, decoder Bi-LSTM (Model 5) [11].
  • End-to-end CNN (Model 4) [11] Encoder-decoder LSTM with attention (Model 1) [11] Transformer 4x4.
  • Joint sequence model [3] Combination of sequitur G2P and seq2seq-attention and multitask learning [21] Deep Bi-LSTM with many-to-many alignment [23]
  • Results:

    The authors use the following common evaluation metrics for G2P: Phoneme Error Rate (PER) is the Levenshtein distance between the predicted phoneme sequences and the reference phoneme sequences, divided by the number of phonemes in the reference pronunciation [20].
  • Word Error Rate (WER) is the percentage of words in which the predicted phoneme sequence does not exactly match any reference pronunciation, the number of word errors is divided by the total number of unique words in the reference.
  • Conclusion:

    The authors investigated a novel transformer architecture for the G2P task. Transformer 3x3 (3 layers encoder and 3 layers decoder), Transformer 4x4 (4 layers encoder and 4 layers decoder), and Transformer 5x5 (5 layers encoder and 5 layers decoder) architectures were presented including experiments on CMUDict and NetTalk.
  • The authors intend to study the application of the proposed method in the field of end-to-end TTS synthesis
表格
  • Table1: Training parameters
  • Table2: Results on the CMUDict and NetTalk dataset
  • Table3: Results on the CMUDict and NetTalk datasets
  • Table4: Examples of errors predicted by Transformer 4x4 and [<a class="ref-link" id="c11" href="#r11">11</a>]
Download tables as Excel
基金
  • The research presented in this paper has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.2-16-2017-00013, Thematic Fundamental Research Collaborations Grounding Innovation in Informatics and Infocommunications), by the BME-Artificial Intelligence FIKP grant of Ministry of Human Resources (BME FIKPMI/SC), by Doctoral Research Scholarship of Ministry of Human Resources (ÚNKP-18-4-BME-394) in the scope of New National Excellence Program, by János Bolyai Research Scholarship of the Hungarian Academy of Sciences, by the AI4EU project (No 825619), and the DANSPLAT project (Eureka 9944)
  • We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. We are grateful to Stan Chen for providing the dataset of NetTalk
引用论文
  • S.Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep Voice: Real-Time Neural Text-to- Speech,” Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 195-204, 2017.
    Google ScholarLocate open access versionFindings
  • A. W. Black, K. Lenzo, and V. Page, “Issues in Building General Letter to Sound Rules,” Proceedings of the 3rd ESCA Workshop on Speech Synthesis, pp. 77–80. 1998.
    Google ScholarLocate open access versionFindings
  • M. Bisani and H. Ney, “Joint-Sequence Models for Graphemeto- Phoneme Conversion,” Speech Communication, vol. 50, no. 5, pp. 434–451, 2008.
    Google ScholarLocate open access versionFindings
  • L. Galescu, and J.F. Allen, “Pronunciation of Proper Names with a Joint N-Gram Model for Bi-Directional Grapheme-toPhoneme Conversion,” 7th International Conference on Spoken Language Processing, pp. 109–112, 2002.
    Google ScholarLocate open access versionFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • L. Lu, X. Zhang, and S. Renals, “On Training the Recurrent Neural Network Encoder-Decoder for Large Vocabulary End-toEnd Speech Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5060–5064, 2016.
    Google ScholarLocate open access versionFindings
  • S. Toshniwal and K. Livescu, “Jointly learning to align and convert graphemes to phonemes with neural attention models,” IEEE Spoken Language Technology Workshop (SLT), pp. 76-82, 2016.
    Google ScholarLocate open access versionFindings
  • A.Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS'12), (1), pp. 1097-1105, 2012.
    Google ScholarLocate open access versionFindings
  • J. Gehring, M. Auli, D. Grangier, and Y. Dauphin, “A Convolutional Encoder Model for Neural Machine Translation,” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 123-135, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, R. A. Saurous, “Tacotron: Towards End-to-End Speech Synthesis,” in INTERSPEECH 2017 – 18th Annual Conference of the International Speech Communication Association, pp. 4006-4010, 2017.
    Google ScholarLocate open access versionFindings
  • S. Yolchuyeva, G. Németh, and B. Gyires-Tóth, “Grapheme-toPhoneme Conversion with Convolutional Neural Networks,” Applied Science, vol. 9, no. 6, pp. 1143, 2019.
    Google ScholarLocate open access versionFindings
  • M. J. Chae, K. Park, L. Bang, S. Suh, L. Park, N. Kim, and J. Park, “Convolutional Sequence to Sequence Model with NonSequential Greedy Decoding for Grapheme to Phoneme Conversion,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2486-2490, 2018.
    Google ScholarLocate open access versionFindings
  • J. Ni, Y. Shiga, and H. Kawai, “Multilingual Grapheme-toPhoneme Conversion with Global Character Vectors,” in INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, pp. 28232827, 2018.
    Google ScholarLocate open access versionFindings
  • B. Peters, J. Dehdari, and J. V. Genabith, “Massively Multilingual Neural Grapheme-to-Phoneme Conversion,” Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems. pp. 19–26. 2017.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” 31st Conference on Neural Information Processing Systems (NIPS 2017), pp. 6000-6010, 2017.
    Google ScholarLocate open access versionFindings
  • S. M. Lakew, M. Cettolo, and M. Federico, “A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation,” Proceedings of the 27th International Conference on Computational Linguistics (COLING), pp. 641- 652, 2018.
    Google ScholarLocate open access versionFindings
  • X. Zhu, L. Li, J. Liu, H. Peng, and X. Niu, “Captioning Transformer with Stacked Attention Modules,” Applied Science, vol. 8, no. 5, pp. 739, 2018.
    Google ScholarLocate open access versionFindings
  • K. Yao and G. Zweig, “Sequence-to-Sequence Neural Net Models for Grapheme-to-Phoneme Conversion,” in INTERSPEECH 2015 – 16th Annual Conference of the International Speech Communication Association, pp. 3330– 3334, 2015.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and L. B. Jimmy, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations (ICLR), pp. 1-13, 2015.
    Google ScholarLocate open access versionFindings
  • V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,” Soviet Physics Doklady, vol. 10, no. 8, pp.707–710, 1966.
    Google ScholarLocate open access versionFindings
  • B. Milde, C. Schmidt, and J. Köhler, “Multitask Sequence-toSequence Models for Grapheme-to-Phoneme Conversion” in INTERSPEECH 2017 – 18th Annual Conference of the International Speech Communication Association, pp. 25362540, 2017.
    Google ScholarLocate open access versionFindings
  • S. F. Chen, “Conditional and Joint Models for Grapheme-toPhoneme Conversion,” 8th European Conference on Speech Communication and Technology, pp. 2033–2036, 2003.
    Google ScholarLocate open access versionFindings
  • A. E. Mousa and B. W. Schuller, “Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks for Graphemeto-Phoneme Conversion Utilizing Complex Many-to-Many Alignments,” in INTERSPEECH 2016 – 17th Annual Conference of the International Speech Communication Association, pp. 2836-2840, 2016.
    Google ScholarLocate open access versionFindings
  • J. Ba, R., Kiros, and G. E. Hinton, “Layer Normalization,” CoRR, abs/1607.06450, 2016.
    Findings
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929-1958, 2014.
    Google ScholarLocate open access versionFindings
作者
Sevinj Yolchuyeva
Sevinj Yolchuyeva
Bálint Gyires-Tóth
Bálint Gyires-Tóth
您的评分 :
0

 

标签
评论
小科