Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Yu Li,Feng Xue,Lin Wu, Yincen Xie,Shujie Li

Multimedia Systems(2024)

引用 0|浏览3
暂无评分
摘要
Lipreading refers to translating the lip motion regarding a video speaker into the corresponding texts. Existing lipreading methods typically describe the lip motion using visual appearance variations. However, merely using the lip visual variations is prone to associating with inaccurate texts due to the similar lip shapes for different words. Also, visual features are hard to generalize to unseen speakers, especially when the training data is limited. In this paper, we leverage both lip visual motion and facial landmarks and propose an effective sentence-level end-to-end approach for lipreading. The facial landmarks are introduced to eliminate the irrelevant visual features which are sensitive to specific lip appearance of individual speakers. This enables the model to adapt to different lip shapes of speakers and generalize to unseen speakers. In specific, the proposed framework consists of two branches corresponding to the visual features and facial landmarks. The visual branch extracts high-level visual features from the lip movement, and the landmark branch learns to extract both spatial and temporal patterns described by the landmarks. The feature embeddings from two streams for each frame are fused to form its latent vector which can be decoded into texts. We employ a sequence-to-sequence model to operate the feature embeddings of all frames as input, and decode them to generate the texts. The proposed method is demonstrated to well generalize to unseen speakers on benchmark data sets.
更多
查看译文
关键词
Lipreading,Lip motion,Facial landmarks,Sentence-level translation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要