Extracting Unit Embeddings Using Sequence-To-Sequence Acoustic Models For Unit Selection Speech Synthesis

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING（2020）

引用 7|浏览40

暂无评分

摘要

This paper presents a method of using the intermediate representations between linguistic and acoustic features in a Tacotron model to derive the cost functions for unit selection speech synthesis. By extracting the outputs of the Tacotron encoder, each phone-sized candidate unit in the corpus is represented by a fixed-length unit vector. Similarly, each target unit to be synthesized is also converted into a unit vector of the same dimension by encoding the input phone sequence. The normalized Euclidean distances between these two vectors are utilized to fulfill unit pre-selection and to calculate the target cost for unit selection. Then, another DNN which predicts the unit vector of each phone from its preceding ones is constructed to derive the concatenation cost function. Experimental results demonstrate that the unit vectors extracted from Tacotron contain both duration and acoustic information of phone units. Comparing with our previous work, which learned unit vectors using a DNN and only acoustic features, the method proposed in this paper further improves the naturalness of unit selection speech synthesis in our experiments.

查看译文

关键词

speech synthesis, unit selection, hidden Markov model, deep neural network, Tacotron

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要