AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We presented an effective method to improve the performance of TTS front-end text processing using pre-trained text representations

Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis

INTERSPEECH, pp.4480-4484, (2019)

Cited by: 0|Views5
EI

Abstract

In this paper, we propose a novel method to improve the performance and robustness of the front-end text processing modules of Mandarin text-to-speech (TTS) synthesis. We use pretrained text encoding models, such as the encoder of a transformer based NMT model and BERT, to extract the latent semantic representations of words or characters...More

Code:

Data:

0
Introduction
  • Great progress has been made in the field of textto-speech (TTS). The speech synthesized by recently proposed end-to-end acoustic models (e.g. Tacotron [1], transformer TTS [2], etc) and neural vocoders (e.g. WaveNet [3], WaveRNN [4], WaveGlow [5] etc) is almost undistinguishable from recorded human speech.
  • An RNN with gated cells, e.g. long-short term memory (LSTM) [19] and gated recurrent unit (GRU) [20], is a sequential model which has been successfully applied to speech and NLP tasks
  • These RNN based models can be learned in an end-to-end manner with manually designed feature.
  • RNNs usually work with pre-trained word-level or character-level embedding representations to improve the generalization of the learned models
  • Another existing problem is that long-term context dependency [22] is usually required in the TTS front-end tasks.
  • This dependency can not be well captured by using the previous methods
Highlights
  • Great progress has been made in the field of textto-speech (TTS)
  • An recurrent neural network (RNN) with gated cells, e.g. long-short term memory (LSTM) [19] and gated recurrent unit (GRU) [20], is a sequential model which has been successfully applied to speech and natural language processing (NLP) tasks
  • In order to analyze the space of extracted text representations, we performed a t-SNE [36] analysis on the representation vectors of several polyphone characters on the training set
  • We presented an effective method to improve the performance of TTS front-end text processing using pre-trained text representations
  • One is the BERT learned in an unsupervised way and the other is the encoder of an neural machine translation (NMT) model that trained on bilingual corpus
  • The experimental results on polyphone disambiguation and prosodic structure prediction show that the proposed methods significantly improved the performance comparing with conventional methods as well as other text representation based methods
Methods
  • The authors evaluated the proposed representation based methods on Mandarin dataset.
  • The authors collected the training corpus for polyphone disambiguation and prosodic structure prediction by linguistic experts for the experiments.
  • For experiments on polyphone disambiguation, the authors collected a dataset of 300000 sentences.
  • One polyphone character was labeled in each sentence.
  • The authors collected sentences for 89 frequently used polyphone characters, and there are totally
Results
  • The BERT is trained to predict randomly missing words/characters in given sentences.
  • In order to analyze the space of extracted text representations, the authors performed a t-SNE [36] analysis on the representation vectors of several polyphone characters on the training set.
  • One can see that there are clear patterns between different pronunciations of the a character
  • This make it very easy for the shallow feedforward neural network based classifier to predict correct pronunciation given the character representation.
  • Similar patterns were observed in the representation space of an NMT encoder
Conclusion
  • The authors presented an effective method to improve the performance of TTS front-end text processing using pre-trained text representations.
  • The experimental results on polyphone disambiguation and prosodic structure prediction show that the proposed methods significantly improved the performance comparing with conventional methods as well as other text representation based methods.
  • The authors will apply the pre-trained transformer text representations to the other modules of TTS front-end, including the text normalization, speaking style prediction, etc
Summary
  • Introduction:

    Great progress has been made in the field of textto-speech (TTS). The speech synthesized by recently proposed end-to-end acoustic models (e.g. Tacotron [1], transformer TTS [2], etc) and neural vocoders (e.g. WaveNet [3], WaveRNN [4], WaveGlow [5] etc) is almost undistinguishable from recorded human speech.
  • An RNN with gated cells, e.g. long-short term memory (LSTM) [19] and gated recurrent unit (GRU) [20], is a sequential model which has been successfully applied to speech and NLP tasks
  • These RNN based models can be learned in an end-to-end manner with manually designed feature.
  • RNNs usually work with pre-trained word-level or character-level embedding representations to improve the generalization of the learned models
  • Another existing problem is that long-term context dependency [22] is usually required in the TTS front-end tasks.
  • This dependency can not be well captured by using the previous methods
  • Methods:

    The authors evaluated the proposed representation based methods on Mandarin dataset.
  • The authors collected the training corpus for polyphone disambiguation and prosodic structure prediction by linguistic experts for the experiments.
  • For experiments on polyphone disambiguation, the authors collected a dataset of 300000 sentences.
  • One polyphone character was labeled in each sentence.
  • The authors collected sentences for 89 frequently used polyphone characters, and there are totally
  • Results:

    The BERT is trained to predict randomly missing words/characters in given sentences.
  • In order to analyze the space of extracted text representations, the authors performed a t-SNE [36] analysis on the representation vectors of several polyphone characters on the training set.
  • One can see that there are clear patterns between different pronunciations of the a character
  • This make it very easy for the shallow feedforward neural network based classifier to predict correct pronunciation given the character representation.
  • Similar patterns were observed in the representation space of an NMT encoder
  • Conclusion:

    The authors presented an effective method to improve the performance of TTS front-end text processing using pre-trained text representations.
  • The experimental results on polyphone disambiguation and prosodic structure prediction show that the proposed methods significantly improved the performance comparing with conventional methods as well as other text representation based methods.
  • The authors will apply the pre-trained transformer text representations to the other modules of TTS front-end, including the text normalization, speaking style prediction, etc
Tables
  • Table1: Accuracy rate for different systems in Mandarin on the polyphone disambiguation task system ME BLM BERT NMT TB Accuracy 91.34 94.50 96.80 96.18 96.94
  • Table2: The results of F1 scores of different systems on PW and PP tasks
Download tables as Excel
Funding
  • In this paper, we propose a novel method to improve the performance and robustness of the front-end text processing modules of Mandarin text-to-speech (TTS) synthesis
  • Our experiments on the tasks of Mandarin polyphone disambiguation and prosodic structure prediction show that the proposed method can significantly improve the performances
  • We get an absolute improvement of 0.013 and 0.027 in F1 score for prosodic word prediction and prosodic phrase prediction respectively, and an absolute improvement of 2.44% in polyphone disambiguation compared to previous methods
  • Another existing problem is that long-term context dependency [22] is usually required in the TTS front-end tasks. And this dependency can not be well captured by using the previous methods. To address these issues of the previous methods, we propose to use stronger text representation extractors to improve the performance of our front-end tasks
  • There are no significant difference between these three transformer based methods, which means that the NMT encoder can achieve a similar performance with a much smaller network structure
  • In this paper, we presented an effective method to improve the performance of TTS front-end text processing using pre-trained text representations
  • The experimental results on polyphone disambiguation and prosodic structure prediction show that the proposed methods significantly improved the performance comparing with conventional methods as well as other text representation based methods
Study subjects and analysis
samples: 500
202 character-pronunciation pairs in our corpus. At least 500 samples were collected for each pronunciation of each character. We split the dataset into three subset of 240000, 30000 and 30000 sentences for training, validation and test respectively

Reference
  • Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
    Findings
  • N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Close to human quality tts with transformer,” arXiv preprint arXiv:1809.08895, 2018.
    Findings
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
    Findings
  • N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018.
    Findings
  • R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flowbased generative network for speech synthesis,” arXiv preprint arXiv:1811.00002, 2018.
    Findings
  • P. Ebden and R. Sproat, “The kestrel tts text normalization system,” Natural Language Engineering, vol. 21, no. 3, pp. 333–353, 2015.
    Google ScholarLocate open access versionFindings
  • R. Sproat and T. Emerson, “The first international chinese word segmentation bakeoff,” in Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 1Association for Computational Linguistics, 2003, pp. 133–143.
    Google ScholarLocate open access versionFindings
  • L. Marquez and H. Rodrıguez, “Part-of-speech tagging using decision trees,” in European Conference on Machine Learning. Springer, 1998, pp. 25–36.
    Google ScholarLocate open access versionFindings
  • H. Zhang, J. Yu, W. Zhan, and S. Yu, “Disambiguation of chinese polyphonic characters,” in The First International Workshop on MultiMedia Annotation (MMA2001), vol.
    Google ScholarLocate open access versionFindings
  • 1. Citeseer, 2001, pp. 30–1.
    Google ScholarFindings
  • [10] Q. Shi, X. Ma, W. Zhu, W. Zhang, and L. Shen, “Statistic prosody structure prediction,” in Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002. IEEE, 2002, pp. 155–158.
    Google ScholarLocate open access versionFindings
  • [11] J.-F. Li, G.-p. Hu, and R. Wang, “Chinese prosody phrase break prediction based on maximum entropy model,” in Eighth International Conference on Spoken Language Processing, 2004.
    Google ScholarLocate open access versionFindings
  • [12] F. Liu, H. Jia, and J. Tao, “A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin,” in 2008 6th International Symposium on Chinese Spoken Language Processing. IEEE, 2008, pp. 1–4.
    Google ScholarLocate open access versionFindings
  • [13] F. Z. Liu and Y. Zhou, “Polyphone disambiguation based on maximum entropy model in mandarin grapheme-to-phoneme conversion,” in Key Engineering Materials, vol.
    Google ScholarLocate open access versionFindings
  • 480. Trans Tech Publ, 2011, pp. 1043–1048.
    Google ScholarFindings
  • [14] Y. Qian, Z. Wu, X. Ma, and F. Soong, “Automatic prosody prediction and detection with conditional random field (crf) models,” in 2010 7th International Symposium on Chinese Spoken Language Processing. IEEE, 2010, pp. 135–138.
    Google ScholarLocate open access versionFindings
  • [15] X. Shen and B. Xu, “A cart-based hierarchical stochastic model for prosodic phrasing in chinese,” in Proc. of ISCSLP00, 2000, pp. 105–108.
    Google ScholarLocate open access versionFindings
  • [16] C. Ding, L. Xie, J. Yan, W. Zhang, and Y. Liu, “Automatic prosody prediction for chinese speech synthesis using blstm-rnn and embedding features,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 98– 102.
    Google ScholarLocate open access versionFindings
  • [17] C. Shan, L. Xie, and K. Yao, “A bi-directional lstm approach for polyphone disambiguation in mandarin chinese,” in 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2016, pp. 1–5.
    Google ScholarLocate open access versionFindings
  • [18] R. Sproat and N. Jaitly, “An rnn model of text normalization.” in INTERSPEECH, 2017, pp. 754–758.
    Google ScholarFindings
  • [19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • [20] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
    Findings
  • [21] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • [22] T. Lin, B. G. Horne, P. Tino, and C. L. Giles, “Learning long-term dependencies in narx recurrent neural networks,” IEEE Transactions on Neural Networks, vol. 7, no. 6, pp. 1329–1338, 1996.
    Google ScholarLocate open access versionFindings
  • [23] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
    Google ScholarLocate open access versionFindings
  • [25] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • [26] Y. Zheng, J. Tao, Z. Wen, and Y. Li, “Blstm-crf based end-to-end prosodic boundary prediction with context sensitive embeddings in a text-to-speech front-end,” Proc. Interspeech 2018, pp. 47–51, 2018.
    Google ScholarLocate open access versionFindings
  • [27] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
    Findings
  • [28] Y. Huang, Z. Wu, R. Li, H. Meng, and L. Cai, “Multi-task learning for prosodic structure generation using blstm rnn with structured output layer.” in INTERSPEECH, 2017, pp. 779–783.
    Google ScholarFindings
  • [29] F. M. H. G. W. Renhua, “Multi-level polyphone disambiguation for mandarin grapheme-phoneme conversion,” Computer Engineering and Applications, no. 2, p. 49, 2006.
    Google ScholarLocate open access versionFindings
  • [30] K. Yao and G. Zweig, “Sequence-to-sequence neural net models for grapheme-to-phoneme conversion,” arXiv preprint arXiv:1506.00196, 2015.
    Findings
  • [31] X. Mao, Y. Dong, J. Han, D. Huang, and H. Wang, “Inequality maximum entropy classifier with character features for polyphone disambiguation in mandarin tts systems,” in 2007 IEEE International Conference on Acoustics, Speech and Signal ProcessingICASSP’07, vol.
    Google ScholarLocate open access versionFindings
  • 4. IEEE, 2007, pp. IV–705.
    Google ScholarFindings
  • [32] A. Rendel, R. Fernandez, R. Hoory, and B. Ramabhadran, “Using continuous lexical embeddings to improve symbolic-prosody prediction in a text-to-speech front-end,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5655–5659.
    Google ScholarLocate open access versionFindings
  • [33] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016.
    Findings
  • [34] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
    Findings
  • [35] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power, “Semisupervised sequence tagging with bidirectional language models,” arXiv preprint arXiv:1705.00108, 2017.
    Findings
  • [36] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579– 2605, 2008.
    Google ScholarLocate open access versionFindings
Author
Bing Yang
Bing Yang
Jiaqi Zhong
Jiaqi Zhong
Shan Liu
Shan Liu
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科