Frequency warping by linear transformation, and vocal tract inversion for speaker normalization in automatic speech recognition

Frequency warping by linear transformation, and vocal tract inversion for speaker normalization in automatic speech recognition(2008)

引用 23|浏览21
暂无评分
摘要
Vocal Tract Length Normalization (VTLN) for standard filterbank-based Mel Frequency Cepstral Coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion. A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation. In this dissertation, we present a novel LT to perform FW for VTLN and model adaptation with standard MFCC features. Our formula for the transformation matrix is computationally simpler than previous LT approaches, with no required modification of the standard MFCC feature extraction scheme. In VTLN and Speaker Adaptive Modeling (SAM) experiments with the Resource Management (RMI) database, the performance of the new LT was comparable to that of regular VTLN by warping the Mel filterbank. This demonstrates that the approximations involved in the LT do not lead to any performance degradation. We also performed Speaker Adaptive Training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data. In the second part of the dissertation, vocal tract (VT) inversion to recover the VT shape sequence from speech signals is performed for vowels by cepstral analysis-by-synthesis, using chain-matrix calculation of VT acoustics and the Maeda articulatory model. The derivative of the VT chain matrix with respect to the area function was calculated in a novel efficient manner, and used in the BFGS quasi-Newton method for optimizing a cost function that includes a distance measure between input and synthesized cepstral sequences, and regularization and continuity terms. Inversion is evaluated on data from the University of Wisconsin X-ray microbeam (XRMB) database, and good agreement was achieved between inverted midsagittal VT outlines and measured XRMB tongue and lip pellet positions, with smooth optimized articulatory trajectories, and an average relative error of less than 3% in the first three formants.
更多
查看译文
关键词
VT shape sequence,automatic speech recognition,speaker normalization,VT acoustic,linear transformation,frequency warping,feature space LT denoted,regular VTLN,new LT,VT chain matrix,novel LT,previous LT approach,vocal tract inversion,Mel filterbank
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要