Unified Speculation, Detection, and Verification Keyword Spotting

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2022)

引用 3|浏览29
暂无评分
摘要
Accurate and timely recognition of the trigger keyword is vital for a good customer experience on smart devices. In the traditional keyword spotting task, there is typically a trade-off needed between accuracy and latency, where higher accuracy can be achieved by waiting for more context. In this paper, we propose a deep learning model that separates the keyword spotting task into three phases in order to further optimize both accuracy and latency of the overall system. These three tasks are: Speculation, Detection, and Verification. Speculation makes an early decision, which can be used to give a head-start to downstream processes on the device such as local speech recognition. Next, Detection mimics the traditional keyword trigger task and gives a more accurate decision by observing the full keyword context. Finally, Verification verifies previous decision by observing even more audio after the keyword span. We propose a latency-aware max-pooling loss function that can train a unified model for these three tasks by tuning for different latency targets within the same model. In addition, we empirically show that the resultant unified model can accommodate these tasks with desirable performance and without requiring additional compute or memory resources.
更多
查看译文
关键词
keyword spotting,accuracy latency tradeoff,convolutional recurrent neural network,max-pooling loss,multi-task learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要