Temporal modeling using dilated convolution and gating for voice-activity-detection

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2018)

引用 77|浏览3
暂无评分
摘要
Voice activity detection (VAD) is the task of predicting which parts of an utterance contains speech versus background noise. It is an important first step to determine which samples to send to the decoder and when to close the microphone. The long short-term memory neural network (LSTM) is a popular architecture for sequential modeling of acoustic signals, and has been successfully used in several VAD applications. However, it has been observed that LSTMs suffer from state saturation problems when the utterance is long (i.e., for voice dictation tasks), and thus requires the LSTM state to be periodically reset. In this paper, we propose an alternative architecture that does not suffer from saturation problems by modeling temporal variations through a stateless dilated convolution neural network (CNN). The proposed architecture differs from conventional CNNs in three respects: it uses dilated causal convolution, gated activations and residual connections. Results on a Google Voice Typing task shows that the proposed architecture achieves 14% relative FA improvement at a FR of 1% over state-of-the-art LSTMs for VAD task. We also include detailed experiments investigating the factors that distinguish the proposed architecture from conventional convolution.
更多
查看译文
关键词
CNN,voice activity detection,LSTM
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要