Grid-based approximation for voice conversion in low resource environments

EURASIP J. Audio, Speech and Music Processing(2016)

引用 1|浏览26
暂无评分
摘要
The goal of voice conversion is to modify a source speaker’s speech to sound as if spoken by a target speaker. Common conversion methods are based on Gaussian mixture modeling (GMM). They aim to statistically model the spectral structure of the source and target signals and require relatively large training sets (typically dozens of sentences) to avoid over-fitting. Moreover, they often lead to muffled synthesized output signals, due to excessive smoothing of the spectral envelopes. Mobile applications are characterized with low resources in terms of training data, memory footprint, and computational complexity. As technology advances, computational and memory requirements become less limiting; however, the amount of available training data still presents a great challenge, as a typical mobile user is willing to record himself saying just few sentences. In this paper, we propose the grid-based (GB) conversion method for such low resource environments, which is successfully trained using very few sentences (5–10). The GB approach is based on sequential Bayesian tracking, by which the conversion process is expressed as a sequential estimation problem of tracking the target spectrum based on the observed source spectrum. The converted Mel frequency cepstrum coefficient (MFCC) vectors are sequentially evaluated using a weighted sum of the target training vectors used as grid points. The training process includes simple computations of Euclidian distances between the training vectors and is easily performed even in cases of very small training sets. We use global variance (GV) enhancement to improve the perceived quality of the synthesized signals obtained by the proposed and the GMM-based methods. Using just 10 training sentences, our enhanced GB method leads to converted sentences having closer GV values to those of the target and to lower spectral distances at the same time, compared to enhanced version of the GMM-based conversion method. Furthermore, subjective evaluations show that signals produced by the enhanced GB method are perceived as more similar to the target speaker than the enhanced GMM signals, at the expense of a small degradation in the perceived quality.
更多
查看译文
关键词
Bayesian tracking,Global variance (GV),Mel cepstral distortion (MCD),Grid-based approximation,Spectral conversion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要