T-Vlad: Temporal Vector Of Locally Aggregated Descriptor For Multiview Human Action Recognition

Hajra Binte Naeem,Fiza Murtaza,Muhammad Haroon Yousaf,Sergio A. Velastin

PATTERN RECOGNITION LETTERS（2021）

引用 10|浏览24

暂无评分

摘要

Robust view-invariant human action recognition (HAR) requires effective representation of its temporal structure in multi-view videos. This study explores a view-invariant action representation based on convolutional features. Action representation over long video segments is computationally expensive, whereas features in short video segments limit the temporal coverage locally. Previous methods are based on complex multi-stream deep convolutional feature maps extracted over short segments. To cope with this issue, a novel framework is proposed based on a temporal vector of locally aggregated descriptors (TVLAD). T-VLAD encodes long term temporal structure of the video employing single stream convolutional features over short segments. A standard VLAD vector size is a multiple of its feature codebook size (256 is normally recommended). VLAD is modified to incorporate time-order information of segments, where the T-VLAD vector size is a multiple of its smaller time-order codebook size. Previous methods have not been extensively validated for view-variation. Results are validated in a challenging setup, where one view is used for testing and the remaining views are used for training. State-of-the-art results have been obtained on three multi-view datasets with fixed cameras, IXMAS, MuHAVi and MCAD. Also, the proposed encoding approach T-VLAD works equally well on a dynamic background dataset, UCF101. (c) 2021 Elsevier B.V. All rights reserved.

查看译文

关键词

Human action recognition, Multi-view, View-invariant, Temporal action sequence, VLAD, 3D Convolutional neural network features, IXMAS, Muhavi, UCF101, Short segment features

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要