Pushing the Envelope - Aside : Beyond the Spectral Envelope as the Fundamental Representation for Speech Recognition

msra

引用 27|浏览54
暂无评分
摘要
Introduction State-of-the-art automatic speech recognition (ASR) systems continue to improve, and yet there remain many tasks for which the technology is inadequate. The core acoustic operation has essentially remained the same for decades: a single feature vector (derived from the power spectral envelope over a 20-30 ms window, stepped forward by ~10 ms per frame) is compared to a set of distributions derived from training data for an inventory of sub-word units (usually some variant of phones). While many systems also incorporate time derivatives [Furui, 1986] and/or projections from 5 or more frames to a lower dimension [Hunt & Lefebvre 1989, Haeb-Umbach et al 1994], the fundamental character of the acoustic features has remained quite similar. We believe that this limited perspective is a key weakness in speech recognizers. Under good conditions, human phone error rate for nonsense syllables has been estimated to be as low as 1.5% [Allen 1994], as compared with rates that are over an order of magnitude higher for the best Renals 1994]. In this light, our best current recognizers appear half-deaf, only making up for this deficiency by incorporating strong domain constraints. To develop generally applicable and useful recognition techniques, we must overcome the limitations of current acoustic processing. Interestingly, even human phonetic categorization is poor for extremely short segments (e.g., <100 ms), suggesting that analysis of longer time regions is somehow essential to the task. This is supported by information theoretic analysis showing discriminative dependence conditional on underlying phones, between features separated in time by up to several hundred milliseconds In mid 2002, we began working on a DARPA-sponsored project known as the " Novel Approaches " component of the Effective Affordable Reusable Speech-to-text (EARS) program. The fundamental goal of this multi-site effort was to " push " the spectral envelope away from its role as the sole source of acoustic information incorporated by the statistical models of modern speech recognition systems, particularly in the context of the conversational telephone speech recognition task. This ultimately would require both a revamping of the acoustical feature extraction and a fresh look at the incorporation of these features into statistical models representing speech. So far, much of our effort has gone towards the design of new features, and experimentation with their incorporation in a modern speech-to-text system. The new features have already provided significant improvements in such a system in the 2004 NIST evaluation of recognizers of …
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要