Unsupervised Co-Segmentation For Athlete Movements And Live Commentaries Using Crossmodal Temporal Proximity

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)(2020)

引用 2|浏览0
暂无评分
摘要
Audio-visual co-segmentation is a task to extract segments and regions corresponding to specific events on unlabeled audio and video signals. It is particularly important to accomplish it in an unsupervised way, since it is generally very difficult to manually label all the objects and events appearing in audio-visual signals for supervised learning. Here, we propose to take advantage of the temporal proximity of corresponding audio and video entities included in the signals. For this purpose, we newly employ a guided attention scheme to this task to efficiently detect and utilize temporal co-occurrences of audio and video information. Experiments using a real TV broadcasts of sumo wrestling, a sport event, with live commentaries show that our model can automatically extract specific athlete movements and its spoken descriptions in an unsupervised manner.
更多
查看译文
关键词
unsupervised co-segmentation,live commentaries,crossmodal temporal proximity,audio-visual co-segmentation,unlabeled audio,video signals,audio-visual signals,supervised learning,video entities,guided attention scheme,temporal co-occurrences,video information,sport event,athlete movements,unsupervised manner,sumo wrestling,TV broadcasts
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要