Tagged-MRI to audio synthesis with a pairwise heterogeneous deep translator

The Journal of the Acoustical Society of America(2022)

引用 2|浏览6
暂无评分
摘要
Identifying the underlying relationship between visual movements in tagged-MRI and intelligible speech is a vital problem to better understand speech production in health and disease. Due to their heterogeneous representations, however, direct mapping between the two modalities is challenging. We develop a deep learning framework that can synthesize a sequence of tagged MRI data to its corresponding mel-spectrogram, and then convert back into the audio waveform. Our network adopts a parallel encoder-decoder structure to take as input a pair of tagged MRI sequences. The 3D CNN-based encoders learn to extract the feature of spatiotemporally varying motions. The decoder then learns to generate the corresponding spectrograms conditioned on the latent space feature. For the pair of the same utterance, we further make the latent space feature as close as possible with the Kullback-Leibler divergence. To demonstrate the performance of our framework, we used a leave-one-out evaluation strategy on a total of 63 tagged MRI sequences from two utterances, including 43 “ageese” and 20 “asouk.” Our framework enabled the generation of clear audio given a sequence of tagged MRI unseen in training, which could potentially aid in better understanding speech production and improving treatment strategies for patients with speech-related disorders.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要