A small-footprint context-independent HMM-based synthesizer for Tamil

International Journal of Speech Technology(2015)

引用 11|浏览2
暂无评分
摘要
A text-to-speech synthesis system produces intelligible and natural speech corresponding to any given text. Two main attributes of a synthesizer are the quality of speech produced and the footprint size. In the current work, HMM-based speech synthesizers have been built and assessed using various kinds of phone-sized units, namely, monophone, triphone, triphone with contextual features, pentaphone, and pentaphone with contextual features. It is observed that the quality of synthetic speech improves with the addition of contexts, with a mean opinion score (MOS) of 2.4 for a synthesizer that uses monophones and 3.98 for one that uses pentaphones with 48 additional contextual features (pentaphone+). However, the footprint size also increases from 269 to 1840 kB, with the addition of contextual information. Therefore, based on a desired application, a compromise has to be made either on the quality or the footprint size. Analysis reveals that although speech synthesized by a monophone-based system lacks naturalness, it is intelligible. The lack of naturalness is primarily due to the discontinuities in the pitch contour. Therefore, an attempt is made to improve the quality of synthesized speech by smoothening the pitch contour, thereby retaining the small footprint size, while attaining quality of a synthesizer that uses contextual information. It is observed that smoothening the pitch contour at the word-level yields the best quality, with an MOS of 3.4. Further, a preference test reveals that 71.25 % of the sentences are similar in quality to the speech synthesized by a pentaphone+ HTS, while 5 % are better.
更多
查看译文
关键词
Phone-sized units,HMM-based speech synthesis,Monophone,Triphone,Pentaphone,TD-PSOLA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要