Utterance Copy in Formant-Based Speech Synthesizers Using LSTM Neural Networks

Cassio T. Batista,Renan F. Cunha,Pedro Batista,Aldebaro Klautau,Nelson Neto

2019 8th Brazilian Conference on Intelligent Systems (BRACIS)（2019）

引用 2|浏览5

暂无评分

摘要

Utterance copy, also known as speech imitation, is the task of estimating the parameters of an input, target speech signal in order to artificially reconstruct another signal with the same properties at the output. This can be considered a difficult inverse problem, since the input-output relationship is often non-linear, apart from having several parameters to be estimated and adjusted. This work describes the development of an application that uses a long short-term memory neural network (LSTM) to learn how to estimate the input parameters of thel formant-based Klatt speech synthesizer. Formant-based synthesizers do not reach state-of-art performance for text-to-speech (TTS) applications, but are an important tool for linguists studies due to the high interpretability of its input parameters. The proposed system was compared to the WinSnoori baseline software on both artificially-produced target utterances, generated by the DECtalk TTS system; and natural target ones. Results show that our system outperforms the baseline for synthetic voices on the metrics of PESQ, SNR, RMSE and LSD. For natural voices, the experiments indicate the need for an architecture that does not depend on labeled data, such as reinforcement learning.

查看译文

关键词

utterance copy, speech synthesis, long short term memory, deep-learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要