Long-term Leap Attention, Short-term Periodic Shift for Video Classification

Hao Zhang,Lechao Cheng,Yanbin Hao,Chong-Wah Ngo

International Multimedia Conference（2022）

引用 9|浏览20

暂无评分

摘要

ABSTRACTVideo transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes T times longer sequence than the latter under the current attention of quadratic complexity (T2N2). The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term "Leap Attention" (LA), short-term "Periodic Shift" (P-Shift) module for video transformers, with (2TN2) complexity. Specifically, the "LA" groups long-term frames into pairs, then refactors each discrete pair via attention. The "P -Shift" exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead (~2.6%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in: https://github.com/VideoNetworks/LAPS-transformer .

查看译文

关键词

video classification,attention,long-term,short-term

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要