HiFi-GAN based Text-to-Speech Synthesis in Serbian

European Signal Processing Conference (EUSIPCO)(2022)

引用 0|浏览4
暂无评分
摘要
In this paper we present a deep neural network based text-to-speech system in the Serbian language, which converts generated acoustic features into a speech signal using the HiFi-GAN vocoder. The HiFi-GAN model was fine-tuned using an existing multi-speaker model trained on an English speech corpus. To overcome the problem of inadequate training data, we introduce a data generation technique based on a guided acoustic neural network, which attempts to minimize the mis-match between data used in HiFi-GAN training and inference. The outputs of the acoustic network are intended to represent a trade-off between original feature trajectories and trajectories generated by the standard text-to-speech system. The results of subjective evaluation through listening tests show that the proposed system produces speech whose quality significantly surpasses the quality of speech generated by the best existing speech synthesis for Serbian, and that its MOS score is very close to the score given to natural speech.
更多
查看译文
关键词
synthesis,serbian,hifi-gan,text-to-speech
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要