Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates

IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 3007-3021, 2018.

Cited by: 178|Bibtex|Views107|DOI:https://doi.org/10.1109/TPAMI.2017.2771306
EI WOS
Other Links: dblp.uni-trier.de|pubmed.ncbi.nlm.nih.gov|academic.microsoft.com|arxiv.org
Weibo:
One question that may rise here is if the advantage of “spatio-temporal long short-term memory ” model could be only due to the higher length and redundant sequence of the joints fed to the network, and not because of the proposed semantic relations between the joints

Abstract:

Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed...More

Code:

Data:

0
Introduction
  • Human action recognition is a fast developing research area due to its wide applications in intelligent surveillance, human-computer interaction, robotics, and so on.
  • Human activity analysis based on human skeletal data has attracted a lot of attention, and various methods for feature extraction and classifier learning have been developed for skeleton-based action recognition [1], [2], [3].
  • A hidden Markov model (HMM) is utilized by Xia et al [4] to model the temporal dynamics over a histogram-based representation of joint positions for action recognition.
  • A skeleton-based dictionary learning method using geometry constraint and group sparsity is introduced in [9]
Highlights
  • Human action recognition is a fast developing research area due to its wide applications in intelligent surveillance, human-computer interaction, robotics, and so on
  • Since the 3D positions of skeletal joints provided by depth sensors are not always very accurate, we further introduce a new gating framework, so called “trust gate”, for our spatio-temporal long short-term memory (ST-LSTM) network to analyze the reliability of the input data at each spatio-temporal step
  • One question that may rise here is if the advantage of “ST-LSTM (Tree)” model could be only due to the higher length and redundant sequence of the joints fed to the network, and not because of the proposed semantic relations between the joints
  • To further investigate the effect of simultaneous modeling of dependencies in spatial and temporal domains, in this experiment, we replace our ST-LSTM with the original LSTM which only models the temporal dynamics among the frames without explicitly considering spatial dependencies
  • The second observation of this experiment is that if we add our trust gate to the original LSTM, the performance of LSTM can be improved, but its performance gain is less than the performance gain on ST-LSTM
  • We have extended the recurrent neural networks (RNNs)-based action recognition method to both spatial and temporal domains
Methods
  • Method Lie

    Group [6] Cippitelli et al [67] Dynamic Skeletons [62] FTP [68] Hierarchical RNN [30] Deep RNN [32] Part-aware LSTM [32] ST-LSTM (Joint Chain) ST-LSTM (Tree) ST-LSTM (Tree) + Trust Gate

    Feature Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric

    In TABLE 1, the deep RNN model concatenates the joint features at each frame and feeds them to the network to model the temporal kinetics, and ignores the spatial dynamics.
  • The authors observe that “ST-LSTM (Tree) + Trust Gate” significantly outperforms “ST-LSTM (Tree)” for most of the action classes, which demonstrates the proposed trust gate can effectively improve the human action recognition accuracy by learning the degrees of reliability over the input data at each time step.
  • As illustrated in Figure 11(a), the magnitude of trust gate’s output (l2 norm of the activations of the trust gate) is smaller when a noisy joint is fed, compared to the corresponding rectified joint
  • This demonstrates how the network controls the impact of noisy input on its stored representation of the observed data
Results
  • Evaluation Datasets

    NTU RGB+D dataset [32] was captured with Kinect (v2). It is currently the largest publicly available dataset for depth-based action recognition, which contains more than 56,000 video sequences and 4 million video frames.
  • One question that may rise here is if the advantage of “ST-LSTM (Tree)” model could be only due to the higher length and redundant sequence of the joints fed to the network, and not because of the proposed semantic relations between the joints
  • To answer this question, the authors evaluate the effect of using a double chain scheme to increase the spatial steps of the “ST-LSTM (Joint Chain)” model.
  • The results in TABLE 14 show the advantages of using the last-to-first link in improving the final action recognition performance
Conclusion
  • The authors have extended the RNN-based action recognition method to both spatial and temporal domains.
  • A skeleton tree traversal method based on the adjacency graph of body joints is proposed to better represent the structure of the input sequences and to improve the performance of the network by connecting the most related joints together in the input sequence.
  • A multi-modal feature fusion method is proposed for the ST-LSTM framework.
  • The experimental results have validated the contributions and demonstrated the effectiveness of the approach which achieves better performance over the existing state-of-the-art methods on seven challenging benchmark datasets
Summary
  • Introduction:

    Human action recognition is a fast developing research area due to its wide applications in intelligent surveillance, human-computer interaction, robotics, and so on.
  • Human activity analysis based on human skeletal data has attracted a lot of attention, and various methods for feature extraction and classifier learning have been developed for skeleton-based action recognition [1], [2], [3].
  • A hidden Markov model (HMM) is utilized by Xia et al [4] to model the temporal dynamics over a histogram-based representation of joint positions for action recognition.
  • A skeleton-based dictionary learning method using geometry constraint and group sparsity is introduced in [9]
  • Methods:

    Method Lie

    Group [6] Cippitelli et al [67] Dynamic Skeletons [62] FTP [68] Hierarchical RNN [30] Deep RNN [32] Part-aware LSTM [32] ST-LSTM (Joint Chain) ST-LSTM (Tree) ST-LSTM (Tree) + Trust Gate

    Feature Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric

    In TABLE 1, the deep RNN model concatenates the joint features at each frame and feeds them to the network to model the temporal kinetics, and ignores the spatial dynamics.
  • The authors observe that “ST-LSTM (Tree) + Trust Gate” significantly outperforms “ST-LSTM (Tree)” for most of the action classes, which demonstrates the proposed trust gate can effectively improve the human action recognition accuracy by learning the degrees of reliability over the input data at each time step.
  • As illustrated in Figure 11(a), the magnitude of trust gate’s output (l2 norm of the activations of the trust gate) is smaller when a noisy joint is fed, compared to the corresponding rectified joint
  • This demonstrates how the network controls the impact of noisy input on its stored representation of the observed data
  • Results:

    Evaluation Datasets

    NTU RGB+D dataset [32] was captured with Kinect (v2). It is currently the largest publicly available dataset for depth-based action recognition, which contains more than 56,000 video sequences and 4 million video frames.
  • One question that may rise here is if the advantage of “ST-LSTM (Tree)” model could be only due to the higher length and redundant sequence of the joints fed to the network, and not because of the proposed semantic relations between the joints
  • To answer this question, the authors evaluate the effect of using a double chain scheme to increase the spatial steps of the “ST-LSTM (Joint Chain)” model.
  • The results in TABLE 14 show the advantages of using the last-to-first link in improving the final action recognition performance
  • Conclusion:

    The authors have extended the RNN-based action recognition method to both spatial and temporal domains.
  • A skeleton tree traversal method based on the adjacency graph of body joints is proposed to better represent the structure of the input sequences and to improve the performance of the network by connecting the most related joints together in the input sequence.
  • A multi-modal feature fusion method is proposed for the ST-LSTM framework.
  • The experimental results have validated the contributions and demonstrated the effectiveness of the approach which achieves better performance over the existing state-of-the-art methods on seven challenging benchmark datasets
Tables
  • Table1: EXPERIMENTAL RESULTS ON THE NTU RGB+D DATASET
  • Table2: EVALUATION OF DIFFERENT FEATURE FUSION STRATEGIES ON THE NTU RGB+D DATASET. “GEOMETRIC + VISUAL (1)”
  • Table3: EXPERIMENTAL RESULTS ON THE UT-KINECT DATASET (LOOCV PROTOCOL [<a class="ref-link" id="c4" href="#r4">4</a>])
  • Table4: RESULTS ON THE UT-KINECT DATASET (HALF-VS-HALF PROTOCOL [<a class="ref-link" id="c69" href="#r69">69</a>])
  • Table5: EVALUATION OF OUR APPROACH FOR FEATURE FUSION ON THE UT-KINECT DATASET (LOOCV PROTOCOL [<a class="ref-link" id="c4" href="#r4">4</a>]). “GEOMETRIC + VISUAL” INDICATES WE SIMPLY CONCATENATE THE TWO TYPES OF FEATURES AS THE INPUT. “GEOMETRIC VISUAL” MEANS WE USE
  • Table6: EXPERIMENTAL RESULTS ON THE SBU INTERACTION
  • Table7: EXPERIMENTAL RESULTS ON THE SYSU-3D DATASET
  • Table8: The results show that it is beneficial to use the 4.9. Experiments on the Berkeley MHAD Dataset transformed skeletons as the input for action recognition. EVALUATION FOR SKELETON ROTATION ON THE SYSU-3D
  • Table9: EXPERIMENTAL RESULTS ON THE CHALEARN GESTURE
  • Table10: EXPERIMENTAL RESULTS ON THE MSR ACTION3D
  • Table11: EXPERIMENTAL RESULTS ON THE BERKELEY MHAD
  • Table12: PERFORMANCE COMPARISON OF DIFFERENT SPATIAL SEQUENCE MODELS
  • Table13: PERFORMANCE COMPARISON OF TEMPORAL AVERAGE, LSTM, AND OUR PROPOSED ST-LSTM
  • Table14: EVALUATION OF THE LAST-TO-FIRST LINK IN OUR PROPOSED NETWORK
Download tables as Excel
Related work
  • Skeleton-based action recognition has been explored in different aspects during recent years [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47]. In this section, we limit our review to more recent approaches which use RNNs or LSTMs for human activity analysis.

    Du et al [30] proposed a Hierarchical RNN network by utilizing multiple bidirectional RNNs in a novel hierarchical fashion. The human skeletal structure was divided to five major joint groups. Then each group was fed into the corresponding bidirectional RNN. The outputs of the RNNs were concatenated to represent the upper body and lower body, then each was further fed into another set of RNNs. By concatenating the outputs of two RNNs, the global body representation was obtained, which was fed to the next RNN layer. Finally, a softmax classifier was used in [30] to perform action classification.
Funding
  • ROSE Lab is supported by the National Research Foundation, Singapore, under its IDM Strategic Research Programme
Study subjects and analysis
challenging benchmark datasets: 7
Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method. Method Lie

Group [6] Cippitelli et al [67] Dynamic Skeletons [62] FTP [68] Hierarchical RNN [30] Deep RNN [32] Part-aware LSTM [32] ST-LSTM (Joint Chain) ST-LSTM (Tree) ST-LSTM (Tree) + Trust Gate

Feature Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric Geometric

In TABLE 1, the deep RNN model concatenates the joint features at each frame and then feeds them to the network to model the temporal kinetics, and ignores the spatial dynamics

persons: 7
Method Histogram of

3D Joints [4] Joint Angles Similarities [8] SCs (Informative Joints) [75] Oriented Displacements [87] Lie Group [6] Space Time Pose [73] Lillo et al [88] Hierarchical RNN [30] ST-LSTM (Tree) + Trust Gate. Ofli et al [89] Vantigodi et al [90] Vantigodi et al [91] Kapsouras et al [92] Hierarchical RNN [30] Co-occurrence LSTM [48] ST-LSTM (Tree) + Trust Gate

Feature Geometric Geometric Geometric Geometric Geometric Geometric Geometric

We adopt the experimental protocol in [30] on the Berkeley MHAD dataset. 384 video sequences corresponding to the first seven persons are used for training, and the 275 sequences of the remaining five persons are held out for testing
. The experimental results in TABLE 11 show that our method achieves very high accuracy (100%) on this dataset

challenging benchmark datasets: 7
Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method. Human action recognition is a fast developing research area due to its wide applications in intelligent surveillance, human-computer interaction, robotics, and so on

benchmark datasets: 7
(3) The functionality of the ST-LSTM framework is further extended by adding the proposed “trust gate”. (4) A multimodal feature fusion strategy within the ST-LSTM unit is introduced. (5) The proposed method achieves state-of-theart performance on seven benchmark datasets. The remainder of this paper is organized as follows

benchmark datasets: 4
This paper is an extension of our preliminary conference version [52]. In [52], we validated the effectiveness of our model on four benchmark datasets. In this paper, we extensively evaluate our model on seven challenging datasets

challenging datasets: 7
In [52], we validated the effectiveness of our model on four benchmark datasets. In this paper, we extensively evaluate our model on seven challenging datasets. Besides, we further propose an effective feature fusion strategy inside the ST-LSTM unit

benchmark datasets: 7
Thus, the network has better ability to learn the action patterns in the skeleton sequence. The proposed method is evaluated and empirically analyzed on seven benchmark datasets for which the coordinates of skeletal joints are provided. These datasets are NTU RGB+D, UT-Kinect, SBU Interaction, SYSU-3D, ChaLearn Gesture, MSR Action3D, and Berkeley MHAD

persons: 40
SYSU-3D dataset [62] contains 480 sequences and was collected with Kinect. In this dataset, 12 different activities were performed by 40 persons. The 3D coordinates of 20 joints are provided in this dataset

subjects: 10
MSR Action3D dataset [64] is widely used for depthbased action recognition. It contains a total of 10 subjects and 20 actions. Each action was performed by the same subject two or three times

subjects: 20
We follow the standard evaluation protocol in [62] on the SYSU-3D dataset. The samples from 20 subjects are used to train the model parameters, and the samples of the remaining 20 subjects are used for testing. We perform 30-fold cross validation and report the mean accuracy in TABLE 7

persons: 7
Acc. 95.4% 96.1% 97.6% 98.2% 100% 100% 100%. We adopt the experimental protocol in [30] on the Berkeley MHAD dataset. 384 video sequences corresponding to the first seven persons are used for training, and the 275 sequences of the remaining five persons are held out for testing. The experimental results in TABLE 11 show that our method achieves very high accuracy (100%) on this dataset

challenging benchmark datasets: 7
A multi-modal feature fusion method is also proposed for our ST-LSTM framework. The experimental results have validated the contributions and demonstrated the effectiveness of our approach which achieves better performance over the existing state-of-the-art methods on seven challenging benchmark datasets. This work was carried out at Rapid-Rich Object Search (ROSE) Lab, Nanyang Technological University

Reference
  • F. Zhu, L. Shao, J. Xie, and Y. Fang, “From handcrafted to learned representations for human action recognition: a survey,” Image and Vision Computing, 2016.
    Google ScholarLocate open access versionFindings
  • L. L. Presti and M. La Cascia, “3d skeleton-based human action classification: A survey,” Pattern Recognition, 2016.
    Google ScholarLocate open access versionFindings
  • F. Han, B. Reily, W. Hoff, and H. Zhang, “Space-time representation of people based on 3d skeletal data: a review,” arXiv, 2016.
    Google ScholarLocate open access versionFindings
  • L. Xia, C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in CVPRW, 2012.
    Google ScholarFindings
  • X. Yang and Y. Tian, “Effective 3d action recognition using eigenjoints,” Journal of Visual Communication and Image Representation, 2014.
    Google ScholarLocate open access versionFindings
  • R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in CVPR, 2014.
    Google ScholarFindings
  • G. Evangelidis, G. Singh, and R. Horaud, “Skeletal quads: Human action recognition using joint quadruples,” in ICPR, 2014.
    Google ScholarFindings
  • E. Ohn-Bar and M. Trivedi, “Joint angles similarities and hog2 for action recognition,” in CVPRW, 2013.
    Google ScholarFindings
  • J. Luo, W. Wang, and H. Qi, “Group sparsity and geometry constrained dictionary learning for action recognition from depth maps,” in ICCV, 2013.
    Google ScholarFindings
  • A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in ICASSP, 2013.
    Google ScholarFindings
  • I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, S. Kombrink, L. Burget, J. H. Cernocky, and S. Khudanpur, “Extensions of recurrent neural network language model,” in ICASSP, 2011.
    Google ScholarFindings
  • M. Sundermeyer, R. Schluter, and H. Ney, “Lstm neural networks for language modeling,” in INTERSPEECH, 2012.
    Google ScholarFindings
  • G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrentneural-network architectures and learning methods for spoken language understanding.” in INTERSPEECH, 2013.
    Google ScholarFindings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015.
    Google ScholarFindings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015.
    Google ScholarFindings
  • N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in ICML, 2015.
    Google ScholarFindings
  • B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multistream bi-directional recurrent neural network for fine-grained action detection,” in CVPR, 2016.
    Google ScholarFindings
  • A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deep learning on spatio-temporal graphs,” in CVPR, 2016.
    Google ScholarFindings
  • A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in CVPR, 2016.
    Google ScholarFindings
  • Z. Deng, A. Vahdat, H. Hu, and G. Mori, “Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition,” in CVPR, 2016.
    Google ScholarFindings
  • M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep temporal model for group activity recognition,” in CVPR, 2016.
    Google ScholarFindings
  • S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in lstms for activity detection and early detection,” in CVPR, 2016.
    Google ScholarFindings
  • B. Ni, X. Yang, and S. Gao, “Progressively parsing interactional objects for fine grained action detection,” in CVPR, 2016.
    Google ScholarFindings
  • Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, and J. Liu, “Online human action detection using joint classification-regression recurrent neural networks,” arXiv, 2016.
    Google ScholarFindings
  • J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015.
    Google ScholarFindings
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
    Google ScholarFindings
  • Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo, “Action recognition by learning deep multi-granular spatio-temporal video representation,” in ICMR, 2016.
    Google ScholarFindings
  • Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling spatial-temporal clues in a hybrid deep learning framework for video classification,” in ACM MM, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in CVPR, 2015.
    Google ScholarFindings
  • V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural networks for action recognition,” in ICCV, 2015.
    Google ScholarFindings
  • A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” in CVPR, 2016.
    Google ScholarFindings
  • M. Meng, H. Drira, M. Daoudi, and J. Boonaert, “Human-object interaction recognition by learning the distances between the object and the skeleton joints,” in FG, 2015.
    Google ScholarFindings
  • J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for 3d human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
    Google ScholarLocate open access versionFindings
  • A. Shahroudy, T. T. Ng, Q. Yang, and G. Wang, “Multimodal multipart learning for action recognition in depth videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • J. Wang and Y. Wu, “Learning maximum margin temporal warping for action recognition,” in ICCV, 2013.
    Google ScholarFindings
  • R. Vemulapalli and R. Chellapa, “Rolling rotations for recognizing human actions from 3d skeletal data,” in CVPR, 2016.
    Google ScholarFindings
  • H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian, “Real time action recognition using histograms of depth gradients and random decision forests,” in WACV, 2014.
    Google ScholarFindings
  • A. Shahroudy, G. Wang, and T.-T. Ng, “Multi-modal feature fusion for action recognition in rgb-d sequences,” in ISCCSP, 2014.
    Google ScholarFindings
  • H. Rahmani and A. Mian, “Learning a non-linear knowledge transfer model for cross-view action recognition,” in CVPR, 2015.
    Google ScholarFindings
  • I. Lillo, A. Soto, and J. Carlos Niebles, “Discriminative hierarchical modeling of spatio-temporally composable human activities,” in CVPR, 2014.
    Google ScholarFindings
  • H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in ICCV, 2013.
    Google ScholarFindings
  • C. Chen, R. Jafari, and N. Kehtarnavaz, “Fusion of depth, skeleton, and inertial data for human action recognition,” in ICASSP, 2016.
    Google ScholarFindings
  • Z. Liu, C. Zhang, and Y. Tian, “3d-based deep convolutional neural network for action recognition with depth sequences,” Image and Vision Computing, 2016.
    Google ScholarLocate open access versionFindings
  • X. Cai, W. Zhou, L. Wu, J. Luo, and H. Li, “Effective active skeleton representation for low latency human action recognition,” IEEE Transactions on Multimedia, 2016.
    Google ScholarLocate open access versionFindings
  • A. S. Al Alwani and Y. Chahir, “Spatiotemporal representation of 3d skeleton joints-based action recognition using modified spherical harmonics,” Pattern Recognition Letters, 2016.
    Google ScholarLocate open access versionFindings
  • L. Tao and R. Vidal, “Moving poselets: A discriminative and interpretable skeletal motion representation for action recognition,” in ICCVW, 2015.
    Google ScholarFindings
  • W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks,” in AAAI, 2016.
    Google ScholarFindings
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, 2014.
    Google ScholarLocate open access versionFindings
  • F. G. Harvey and C. Pal, “Semi-supervised learning with encoderdecoder recurrent neural networks: Experiments with motion capture sequences,” arXiv, 2016.
    Google ScholarFindings
  • B. Mahasseni and S. Todorovic, “Regularizing long short term memory with 3d human-skeleton sequences for action recognition,” in CVPR, 2016.
    Google ScholarFindings
  • J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in ECCV, 2016.
    Google ScholarFindings
  • S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, 1997.
    Google ScholarLocate open access versionFindings
  • B. Zou, S. Chen, C. Shi, and U. M. Providence, “Automatic reconstruction of 3d human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking,” Pattern Recognition, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixtures-of-parts,” in CVPR, 2011.
    Google ScholarFindings
  • N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in ECCV, 2006.
    Google ScholarFindings
  • H. Wang, A. Klaser, C. Schmid, and C.-L. Liu, “Action recognition by dense trajectories,” in CVPR, 2011.
    Google ScholarFindings
  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • G. Cheron, I. Laptev, and C. Schmid, “P-cnn: Pose-based cnn features for action recognition,” in ICCV, 2015.
    Google ScholarFindings
  • A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012.
    Google ScholarFindings
  • K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras, “Two-person interaction detection using body-pose features and multiple instance learning,” in CVPRW, 2012.
    Google ScholarFindings
  • J.-F. Hu, W. Zheng, J. Lai, and J. Zhang, “Jointly learning heterogeneous features for rgb-d activity recognition,” in CVPR, 2015.
    Google ScholarFindings
  • S. Escalera, J. Gonzalez, X. Baro, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante, “Multi-modal gesture recognition challenge 2013: Dataset and results,” in ICMI, 2013.
    Google ScholarFindings
  • W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” in CVPRW, 2010.
    Google ScholarFindings
  • F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Berkeley mhad: A comprehensive multimodal human action database,” in WACV, 2013.
    Google ScholarFindings
  • R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like environment for machine learning,” in NIPS Workshop, 2011.
    Google ScholarLocate open access versionFindings
  • E. Cippitelli, E. Gambi, S. Spinsante, and F. Florez-Revuelta, “Evaluation of a skeleton-based method for human activity recognition on a large-scale rgb-d dataset,” in TechAAL, 2016.
    Google ScholarFindings
  • H. Rahmani and A. Mian, “3d action recognition from novel viewpoints,” in CVPR, 2016.
    Google ScholarFindings
  • Y. Zhu, W. Chen, and G. Guo, “Fusing spatiotemporal features and joints for 3d action recognition,” in CVPRW, 2013.
    Google ScholarFindings
  • R. Anirudh, P. Turaga, J. Su, and A. Srivastava, “Elastic functional coding of human actions: from vector-fields to latent variables,” in CVPR, 2015.
    Google ScholarFindings
  • R. Slama, H. Wannous, M. Daoudi, and A. Srivastava, “Accurate 3d action recognition using learning on the grassmann manifold,” Pattern Recognition, 2015.
    Google ScholarLocate open access versionFindings
  • S. Jetley and F. Cuzzolin, “3d activity recognition using motion history and binary shape templates,” in ACCVW, 2014.
    Google ScholarFindings
  • M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del Bimbo, “Space-time pose representation for 3d human action recognition,” in ICIAP, 2013.
    Google ScholarFindings
  • M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Bimbo, “3-d human action recognition by shape analysis of motion trajectories on riemannian manifold,” IEEE Transactions on Cybernetics, 2015.
    Google ScholarLocate open access versionFindings
  • M. Jiang, J. Kong, G. Bebis, and H. Huo, “Informative joints based human action recognition using skeleton contexts,” Signal Processing: Image Communication, 2015.
    Google ScholarLocate open access versionFindings
  • A. Chrungoo, S. Manimaran, and B. Ravindran, “Activity recognition for natural human robot interaction,” in ICSR, 2014.
    Google ScholarFindings
  • C. Wang, Y. Wang, and A. L. Yuille, “Mining 3d key-pose-motifs for action recognition,” in CVPR, 2016.
    Google ScholarFindings
  • Y. Ji, G. Ye, and H. Cheng, “Interactive body part contrast mining for human interaction recognition,” in ICMEW, 2014.
    Google ScholarFindings
  • W. Li, L. Wen, M. Choo Chuah, and S. Lyu, “Category-blind human action recognition: a practical recognition system,” in ICCV, 2015.
    Google ScholarFindings
  • A. Savitzky and M. Golay, “Smoothing and differentiation of data by simplified least squares procedures,” Analytical chemistry, 1964.
    Google ScholarFindings
  • J.-F. Hu, W.-S. Zheng, L. Ma, G. Wang, and J. Lai, “Real-time rgb-d activity prediction by soft regression,” in ECCV, 2016.
    Google ScholarFindings
  • H. Wang, W. Wang, and L. Wang, “Hierarchical motion evolution for action recognition,” in ACPR, 2015.
    Google ScholarFindings
  • B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling video evolution for action recognition,” in CVPR, 2015.
    Google ScholarFindings
  • A. Yao, L. Van Gool, and P. Kohli, “Gesture recognition portfolios for personalization,” in CVPR, 2014.
    Google ScholarFindings
  • J. Wu, J. Cheng, C. Zhao, and H. Lu, “Fusing multi-modal features for gesture recognition,” in ICMI, 2013.
    Google ScholarFindings
  • T. Pfister, J. Charles, and A. Zisserman, “Domain-adaptive discriminative one-shot learning of gestures,” in ECCV, 2014.
    Google ScholarFindings
  • M. A. Gowayyed, M. Torki, M. E. Hussein, and M. El-Saban, “Histogram of oriented displacements (hod): Describing trajectories of human joints for action recognition,” in IJCAI, 2013.
    Google ScholarFindings
  • I. Lillo, J. Carlos Niebles, and A. Soto, “A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets,” in CVPR, 2016.
    Google ScholarFindings
  • F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Sequence of the most informative joints (smij): A new representation for human skeletal action recognition,” Journal of Visual Communication and Image Representation, 2014.
    Google ScholarLocate open access versionFindings
  • S. Vantigodi and R. V. Babu, “Real-time human action recognition from motion capture data,” in NCVPRIPG, 2013.
    Google ScholarFindings
  • S. Vantigodi and V. B. Radhakrishnan, “Action recognition from motion capture data using meta-cognitive rbf network classifier,” in ISSNIP, 2014.
    Google ScholarFindings
  • I. Kapsouras and N. Nikolaidis, “Action recognition on motion capture data using a dynemes and forward differences representation,” Journal of Visual Communication and Image Representation, 2014.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments