ActBERT: Learning Global-Local Video-Text Representations

CVPR（2020）

引用 429|浏览458

暂无评分

摘要

In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperforms the state-of-the-arts, demonstrating its superiority in video-text representation learning.

查看译文

关键词

ActBERT,global-local video-text representations,self-supervised learning,joint video-text representations,linguistic texts,local regional objects,visual clues,paired video sequences,text descriptions,detailed visual text relation modeling,linguistic descriptions,judicious clues extraction,contextual information,fine-grained objects,global human intention,downstream video,language tasks,text-video clip retrieval,video captioning,video question answering,action segmentation,action step localization,video-text representation learning,global action information,entangled transformer block,unlabeled data,ENT

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要