Taylor Videos for Action Recognition

Lei Wang,Xiuyuan Yuan,Tom Gedeon,Liang Zheng

ICML 2024（2024）

Australian National University | Curtin University

Cited 8|Views43

Abstract

Effectively extracting motions from video is a critical and long-standingproblem for action recognition. This problem is very challenging becausemotions (i) do not have an explicit form, (ii) have various concepts such asdisplacement, velocity, and acceleration, and (iii) often contain noise causedby unstable pixels. Addressing these challenges, we propose the Taylor video, anew video format that highlights the dominate motions (e.g., a waving hand) ineach of its frames named the Taylor frame. Taylor video is named after Taylorseries, which approximates a function at a given point using important terms.In the scenario of videos, we define an implicit motion-extraction functionwhich aims to extract motions from video temporal block. In this block, usingthe frames, the difference frames, and higher-order difference frames, weperform Taylor expansion to approximate this function at the starting frame. Weshow the summation of the higher-order terms in the Taylor series gives usdominant motion patterns, where static objects, small and unstable motions areremoved. Experimentally we show that Taylor videos are effective inputs topopular architectures including 2D CNNs, 3D CNNs, and transformers. When usedindividually, Taylor videos yield competitive action recognition accuracycompared to RGB videos and optical flow. When fused with RGB or optical flowvideos, further accuracy improvement is achieved.

Translated text

Key words

Action Recognition

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper