Monaural speech enhancement using U-net fused with multi-head self-attention

FAN Junyi,YANG Jibin,ZHANG Xiongwei,ZHENG Changyan

Chinese Journal of Acoustics（2023）

引用 0|浏览6

暂无评分

摘要

Under low signal-to-noise ratio(SNR)and burst noise conditions,the speech en-hancement effect of existing deep learning network models is not satisfactory.In contrast,humans can exploit the long-term correlation of speech to form an integrated perception of dif-ferent speech signals.Thus,describing the long-term dependencies of speech can help improve the enhancement performance under low SNR and burst noise conditions.Inspired by this feature,a time domain end-to-end monaural speech enhancement model TU-net that fuses the multi-head self-attention mechanism and U-net deep network is proposed.The TU-net model adopts the codec layer structure of U-net to achieve multi-scale feature fusion.It introduces the dual-path Transformer module using the multi-head self-attention mechanism to calculate the speech mask and better model long-term correlation.The TU-net model is trained with a weighted sum loss function in the time,time-frequency,and perceptual domains.Simulation experiments are carried out and the results show that with maintaining relatively fewer network model parameters,TU-net outperforms other similar monaural enhancement network models in several evaluation metrics such as perceptual evaluation of speech quality(PESQ),short-time objective intelligibility(STOI)and SNR gain under low SNR and burst noise conditions.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要