VITS: Quality Vs. Speed Analysis.

TSD（2023）

引用 0|浏览5

暂无评分

摘要

In this paper, we analyze the performance of a modern end-to-end speech synthesis model called Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS). We build on the original VITS model and examine how different modifications to its architecture affect synthetic speech quality and computational complexity. Experiments with two Czech voices, a male and a female, were carried out. To assess the quality of speech synthesized by the different modified models, MUSHRA listening tests were performed. The computational complexity was measured in terms of synthesis speed over real time. While the original VITS model is still preferred regarding speech quality, we present a modification of the original structure with a significantly better response yet providing acceptable output quality. Such a configuration can be used when system response latency is critical.

查看译文

关键词

speed analysis,quality vs

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要