MOS Predictor for Synthetic Speech with I-Vector Inputs

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2022)

引用 3|浏览17
暂无评分
摘要
Based on deep learning technology, non-intrusive methods have received increasing attention for synthetic speech quality assessment since it does not need reference signals. Meanwhile, i-vector has been widely used in paralinguistic speech attribute recognition such as speaker and emotion recognition, but few studies have used it to estimate speech quality. In this paper, we propose a neural-network-based model that splices the deep features extracted by convolutional neural network (CNN) and i-vector on the time axis and uses Transformer encoder as time sequence model. To evaluate the proposed method, we improve the previous prediction models and conduct experiments on Voice Conversion Challenge (VCC) 2018 and 2016 dataset. Results show that i-vector contains information very related to the quality of synthetic speech and the proposed models that utilize i-vector and Transformer encoder highly increase the accuracy of MOSNet and MBNet on both utterance-level and system-level results.
更多
查看译文
关键词
speech quality assessment,speech synthesis,i-vector,Transformer encoder
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要