Order-Constrained Representation Learning for Instructional Video Prediction

IEEE Transactions on Circuits and Systems for Video Technology(2022)

引用 2|浏览27
暂无评分
摘要
In this paper, we propose a weakly-supervised approach called Order-Constrained Representation Learning (OCRL) to predict future actions from instructional videos by observing incomplete steps of actions. Most conventional methods focus on predicting actions based on partially observed video frames, which mainly study low-level semantics such as motion consistency. Unlike performing a single action, completing a task in an instructional video usually requires several steps of action and longer periods. Motivated by the fact that the order of action steps is key to learning task semantics, we develop a new frame of contrastive loss, called StepNCE, to integrate the shared semantic information between step order and task semantics under the framework of the memory bank-based momentum-updating algorithm. Specifically, we learn the video representations from step order-rearranged trimmed video clips based on the proposed task-consistency rule and order-consistency rule. Our StepNCE loss can be used to pre-train a video feature encoder, which is then fine-tuned to carry out the instructional video prediction task. Our approach digs deeper into the sequential logic between different action steps with respect to a certain task, which is able to promote the video understanding methods to a new semantic level. We evaluate our method on five popular instructional video and action prediction datasets: COIN, CrossTask, UT-Interaction, BIT-Interaction, and ActivityNet v1.2, and the results show that our approach gains improvements from conventional prediction methods.
更多
查看译文
关键词
Instructional video,video prediction,weakly-supervised learning,representation learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要