Deep speech-to-text models capture the neural basis of spontaneous speech in everyday conversations

bioRxiv (Cold Spring Harbor Laboratory)(2023)

引用 1|浏览26
暂无评分
摘要
Abstract Humans effortlessly use the continuous acoustics of speech to communicate rich linguistic meaning during everyday conversations. In this study, we leverage 100 hours (half a million words) of spontaneous open-ended conversations and concurrent high-quality neural activity recorded using electrocorticography (ECoG) to decipher the neural basis of real-world speech production and comprehension. Employing a deep multimodal speech-to-text model named Whisper, we develop encoding models capable of accurately predicting neural responses to both acoustic and semantic aspects of speech. Our encoding models achieved high accuracy in predicting neural responses in hundreds of thousands of words across many hours of left-out recordings. We uncover a distributed cortical hierarchy for speech and language processing, with sensory and motor regions encoding acoustic features of speech and higher-level language areas encoding syntactic and semantic information. Many electrodes—including those in both perceptual and motor areas—display mixed selectivity for both speech and linguistic features. Notably, our encoding model reveals a temporal progression from language-to-speech encoding before word onset during speech production and from speech-to-language encoding following word articulation during speech comprehension. This study offers a comprehensive account of the unfolding neural responses during fully natural, unbounded daily conversations. By leveraging a multimodal deep speech recognition model, we highlight the power of deep learning for unraveling the neural mechanisms of language processing in real-world contexts.
更多
查看译文
关键词
spontaneous speech-to-text,everyday conversations,neural basis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要