Taylor Videos for Action Recognition
arXiv (Cornell University)(2024)
Abstract
Effectively extracting motions from video is a critical and long-standingproblem for action recognition. This problem is very challenging becausemotions (i) do not have an explicit form, (ii) have various concepts such asdisplacement, velocity, and acceleration, and (iii) often contain noise causedby unstable pixels. Addressing these challenges, we propose the Taylor video, anew video format that highlights the dominate motions (e.g., a waving hand) ineach of its frames named the Taylor frame. Taylor video is named after Taylorseries, which approximates a function at a given point using important terms.In the scenario of videos, we define an implicit motion-extraction functionwhich aims to extract motions from video temporal block. In this block, usingthe frames, the difference frames, and higher-order difference frames, weperform Taylor expansion to approximate this function at the starting frame. Weshow the summation of the higher-order terms in the Taylor series gives usdominant motion patterns, where static objects, small and unstable motions areremoved. Experimentally we show that Taylor videos are effective inputs topopular architectures including 2D CNNs, 3D CNNs, and transformers. When usedindividually, Taylor videos yield competitive action recognition accuracycompared to RGB videos and optical flow. When fused with RGB or optical flowvideos, further accuracy improvement is achieved.
MoreTranslated text
Key words
Action Recognition
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined