Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
CoRR(2023)
摘要
We propose a novel benchmark for cross-view knowledge transfer of dense video
captioning, adapting models from web instructional videos with exocentric views
to an egocentric view. While dense video captioning (predicting time segments
and their captions) is primarily studied with exocentric videos (e.g.,
YouCook2), benchmarks with egocentric videos are restricted due to data
scarcity. To overcome the limited video availability, transferring knowledge
from abundant exocentric web videos is demanded as a practical approach.
However, learning the correspondence between exocentric and egocentric views is
difficult due to their dynamic view changes. The web videos contain mixed views
focusing on either human body actions or close-up hand-object interactions,
while the egocentric view is constantly shifting as the camera wearer moves.
This necessitates the in-depth study of cross-view transfer under complex view
changes. In this work, we first create a real-life egocentric dataset (EgoYC2)
whose captions are shared with YouCook2, enabling transfer learning between
these datasets assuming their ground-truth is accessible. To bridge the view
gaps, we propose a view-invariant learning method using adversarial training in
both the pre-training and fine-tuning stages. While the pre-training is
designed to learn invariant features against the mixed views in the web videos,
the view-invariant fine-tuning further mitigates the view gaps between both
datasets. We validate our proposed method by studying how effectively it
overcomes the view change problem and efficiently transfers the knowledge to
the egocentric domain. Our benchmark pushes the study of the cross-view
transfer into a new task domain of dense video captioning and will envision
methodologies to describe egocentric videos in natural language.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要