V-Speech

GetMobile: Mobile Computing and Communications(2020)

引用 0|浏览1
暂无评分
摘要
Smart glasses are often used in noisy public spaces or industrial settings. Voice commands and automatic speech recognition (ASR) are good user interfaces for such a form factor, but the background noise and interfering speakers pose important challenges. Typical signal processing techniques have limitations in performance and/or hardware resources. V-Speech is a novel solution that captures the voice signal with a vibration sensor located in the nasal pads of smart glasses. Although signal-to-noise ratio (SNR) is much higher with vibration sensor capture, it introduces a "nasal distortion," which must be dealt with. The second part of our proposed solution involves a voice transformation of the vibration signal using a neural network to produce an output that mimics the characteristics of a conventional microphone. We evaluated V-Speech in noise-free and very noisy conditions with 30 volunteer speakers uttering 145 phrases each, and validated its performance on ASR engines, with assessments of voice quality using the Perceptual Evaluation of Speech Quality (PESQ) metric, and with subjective listeners to determine intelligibility, naturalness and overall quality. The results show, in extreme noise conditions, a mean improvement of 50% for Word Error Rate (WER), 1.0 on a scale of 5.0 for PESQ, and speech regarded intelligible, with naturalness rated as fair to good. The output of V-Speech has low noise, sounds natural, and enables clear voice communication in challenging environments.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要