Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos

IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1045-1058, 2018.

Cited by: 129|Bibtex|Views112|DOI:https://doi.org/10.1109/TPAMI.2017.2691321
EI WOS
Other Links: dblp.uni-trier.de|pubmed.ncbi.nlm.nih.gov|academic.microsoft.com|arxiv.org
Weibo:
This paper presents a new deep learning framework for a hierarchical shared-specific component factorization, to analyze RGB+D features of human action

Abstract:

Single modality action recognition on RGB or depth sequences has been extensively explored recently. It is generally accepted that each of these two modalities has different strengths and limitations for the task of action recognition. Therefore, analysis of the RGB+D videos can help us to better study the complementary properties of thes...More

Code:

Data:

0
Introduction
  • Recent development of range sensors had an indisputable impact on research and applications of machine vision.
  • Depth sequences provide an exclusive modality of information about the 3D structure of the scene, which suits the problem of activity analysis [41, 47, 65, 73, 74, 79, 83]
  • This complements the textural and appearance information from RGB.
  • The authors' goal in this work is to analyze the multimodal RGB+D signals for identifying the strengths of respective modalities through teasing out their shared and modality-specific components and to utilize them for improving the classification of human actions
Highlights
  • Recent development of range sensors had an indisputable impact on research and applications of machine vision
  • Depth sequences provide an exclusive modality of information about the 3D structure of the scene, which suits the problem of activity analysis [41, 47, 65, 73, 74, 79, 83]
  • Three different recognition scenarios are defined on this dataset
  • The first 8 actors are assigned for training and the second 8 actors are for testing
  • This paper presents a new deep learning framework for a hierarchical shared-specific component factorization (DSSCA), to analyze RGB+D features of human action
  • Compared to canonical correlation analysis (CCA)-reconstruction independent component analysis (RICA) method, SSCA improves the error rate by more than 40% which is a notable improvement
  • Provided experimental results on five RGB+D action recognition datasets show the strength of our deep sharedspecific component analysis and the proposed structured sparsity learning machine by achieving the state-of-the-art performances on all the reported benchmarks
Methods
  • HoDG-RDF [46] Bag-of-FLPs [78]

    HON4D [41] SSFF [55] ToSP [59] RGGP [34]

    Actionlet [73] SVN [81] BHIM [22]

    DCSF+Joint [79] MMTW [74] HOPC [45]

    Depth Fusion [90] MMMP [54]

    DL-GSGC [36] JOULE-SVM [20] Range-Sample [35]

    Proposed DSSCA-SSLM Pairs

    Pairs Stacked Local+Holistic 100.0% 100.0%

    have almost the same set of body motions but in different temporal order.
  • Each action class is captured from 10 subjects, each one 3 times
  • Overall, this dataset includes 360 RGB+D video samples.
  • Due to the large size of training video samples in this dataset, evaluation of the kernel-based methods were not tractable and the authors only reported the results for baseline method 1 and DSSCA-SSLM frameworks, as provided in Tables 11 and 12.
  • Provided experimental results on five RGB+D action recognition datasets show the strength of the deep sharedspecific component analysis and the proposed structured sparsity learning machine by achieving the state-of-the-art performances on all the reported benchmarks
Results
  • Online S3 Stacked Local+Holistic 82.0% 83.8%.
  • The actions twice
  • Overall, this dataset include 336 RGB+D video samples.
  • The first and second scenarios are cross-subject tests.
  • The first 8 actors are assigned for training and the second 8 actors are for testing.
  • The samples of the second scenario are the same as the first one but training and testing samples are swapped.
  • The first and second scenarios are cross-subject and the third is a cross-environment evaluation
Conclusion
  • This paper presents a new deep learning framework for a hierarchical shared-specific component factorization (DSSCA), to analyze RGB+D features of human action.
Summary
  • Introduction:

    Recent development of range sensors had an indisputable impact on research and applications of machine vision.
  • Depth sequences provide an exclusive modality of information about the 3D structure of the scene, which suits the problem of activity analysis [41, 47, 65, 73, 74, 79, 83]
  • This complements the textural and appearance information from RGB.
  • The authors' goal in this work is to analyze the multimodal RGB+D signals for identifying the strengths of respective modalities through teasing out their shared and modality-specific components and to utilize them for improving the classification of human actions
  • Methods:

    HoDG-RDF [46] Bag-of-FLPs [78]

    HON4D [41] SSFF [55] ToSP [59] RGGP [34]

    Actionlet [73] SVN [81] BHIM [22]

    DCSF+Joint [79] MMTW [74] HOPC [45]

    Depth Fusion [90] MMMP [54]

    DL-GSGC [36] JOULE-SVM [20] Range-Sample [35]

    Proposed DSSCA-SSLM Pairs

    Pairs Stacked Local+Holistic 100.0% 100.0%

    have almost the same set of body motions but in different temporal order.
  • Each action class is captured from 10 subjects, each one 3 times
  • Overall, this dataset includes 360 RGB+D video samples.
  • Due to the large size of training video samples in this dataset, evaluation of the kernel-based methods were not tractable and the authors only reported the results for baseline method 1 and DSSCA-SSLM frameworks, as provided in Tables 11 and 12.
  • Provided experimental results on five RGB+D action recognition datasets show the strength of the deep sharedspecific component analysis and the proposed structured sparsity learning machine by achieving the state-of-the-art performances on all the reported benchmarks
  • Results:

    Online S3 Stacked Local+Holistic 82.0% 83.8%.
  • The actions twice
  • Overall, this dataset include 336 RGB+D video samples.
  • The first and second scenarios are cross-subject tests.
  • The first 8 actors are assigned for training and the second 8 actors are for testing.
  • The samples of the second scenario are the same as the first one but training and testing samples are swapped.
  • The first and second scenarios are cross-subject and the third is a cross-environment evaluation
  • Conclusion:

    This paper presents a new deep learning framework for a hierarchical shared-specific component factorization (DSSCA), to analyze RGB+D features of human action.
Tables
  • Table1: Comparison of the results of our methods with the baselines in Online RGBD Action dataset. S1, S2, and S3 refers to the three different scenarios of the Online RGBD Action dataset. First column shows the performance of descriptor concatenation on all RGB+D input features. Second column reports the accuracy of the kernel combination on the same set of features. Third column shows the result of our correlation-independence analysis. It employs a kernel combination for classification. Last column reports the accuracy of proposed structured sparsity learning machine
  • Table2: Performance comparison for holistic network, local network, and stacked local+holistic (Figure 3) networks on Online RGBD action datasets. Reported are the results of our method using kernel combination and SSLM
  • Table3: Comparison with a correlation network (without modality-specific components) on the Online RGBD Action dataset, local network, scenario 3. Without Z components, the network is limited to the shared ones and acts similar to CCA
  • Table4: Performance comparison of proposed DSSCA with the state-of-the-art results on Online RGBD Action dataset. Same environment setup is the average of S1 and S2 scenarios, and cross environment setup is the same as S3 scenario
  • Table5: Comparison of the results of our methods with the baselines in MSR-DailyActivity3D dataset
  • Table6: Performance comparison for holistic network, local network, and stacked local+holistic (Figure 3) networks on MSRDailyActivity3D dataset. Reported are the results of our method using kernel combination and SSLM
  • Table7: Performance comparison of the proposed multimodal DSSCA with the state-of-the-art methods on MSR-DailyActivity dataset
  • Table8: Comparison of the results of our methods with the baselines in 3D Action Pairs dataset
  • Table9: Performance comparison for holistic network, local network, and stacked local+holistic (Figure 3) networks on 3D Action Pairs dataset. Reported are the results of our method using kernel combination and SSLM
  • Table10: Performance comparison of proposed multimodal correlation-independence analysis with the state-of-the-art methods on 3D Action Pairs dataset
  • Table11: Comparison of the result of our method with the baseline for the cross-subject evaluation criteria of NTU RGB+D dataset
  • Table12: Performance comparison for holistic network, local network, and stacked local+holistic (Figure 3) networks on the crosssubject evaluation criteria of NTU RGB+D dataset. Reported are the results of our method using SSLM
  • Table13: Performance comparison of proposed multimodal correlation-independence analysis with the state-of-the-art methods on the cross-subject evaluation criteria of NTU RGB+D dataset
  • Table14: Comparison of the results of our methods with the baselines on RGBD-HuDaAct dataset. First column shows the performance of descriptor concatenation on all RGB+D input features. Second column reports the accuracy of the kernel combination on the same set of features. Third column shows the result of our correlation-independence analysis. It employs a kernel combination for classification. Last column reports the accuracy of proposed structured sparsity learning machine
  • Table15: Performance comparison for holistic network, local network, and stacked local+holistic networks on RGBD-HuDaAct dataset. Reported are the results of our method using kernel combination and SSLM
  • Table16: Performance Comparison on RGBD-HuDaAct Dataset classes: exit the room, make a phone call, get up from bed, go to bed, sit down, mop floor, stand up, eat meal, put on jacket, drink water, enter room, take off jacket, and background activity. The standard evaluation on this dataset is defined on a leave-one-subject-out cross-validation setting. In our experiments we follow the evaluation setup described in [<a class="ref-link" id="c39" href="#r39">39</a>]
  • Table17: Comparison between our method and baseline method 2 on single modality RGB and depth based input features, on all the datasets
  • Table18: Proportion of the weights to factorized components in SSLM classifier for Online RGBD, MSR-DailyActivity3D, and 3D action pairs datasets. Reported values are the 2 norms of all the corresponding weights to each of the components, learned by SSLM on the stacked local+holistic networks
Download tables as Excel
Related work
  • There are other works which applied deep networks to multimodal learning. The work in [38, 60] used DBM for finding a common space representation for two input modalities, and predict one modality from the other. Andrew et al [1] proposed a deep canonical correlation analysis network with two stacks of deep embedding followed by a CCA on top layer. Our method is different from these works in two major aspects. First, the previous work performed the multimodal analysis in just one layer of the deep network, but our proposed method performs the common component analysis in every single layer. Second, we incorporate modality-specific components in each layer to maintain all the informative features, at each layer.
Funding
  • The ROSE Lab is supported by the National Research Foundation, Singapore, under its Interactive Digital Media (IDM) Strategic Research Programme
Study subjects and analysis
challenging benchmark datasets: 5
Further, based on the structure of the features, a structured sparsity learning machine is proposed which utilizes mixed norms to apply regularization within components and group selection between them for better classification performance. Our experimental results show the effectiveness of our cross-modality feature analysis framework by achieving state-of-the-art accuracy for action classification on five challenging benchmark datasets. HoDG-RDF [46] Bag-of-FLPs [78]

HON4D [41] SSFF [55] ToSP [59] RGGP [34]

Actionlet [73] SVN [81] BHIM [22]

DCSF+Joint [79] MMTW [74] HOPC [45]

Depth Fusion [90] MMMP [54]

DL-GSGC [36] JOULE-SVM [20] Range-Sample [35]

Proposed DSSCA-SSLM Pairs

Pairs Stacked Local+Holistic 100.0% 100.0%

have almost the same set of body motions but in different temporal order

subjects: 10
HoDG-RDF [46] Bag-of-FLPs [78]

HON4D [41] SSFF [55] ToSP [59] RGGP [34]

Actionlet [73] SVN [81] BHIM [22]

DCSF+Joint [79] MMTW [74] HOPC [45]

Depth Fusion [90] MMMP [54]

DL-GSGC [36] JOULE-SVM [20] Range-Sample [35]

Proposed DSSCA-SSLM Pairs

Pairs Stacked Local+Holistic 100.0% 100.0%

have almost the same set of body motions but in different temporal order. Each action class is captured from 10 subjects, each one 3 times. Overall, this dataset includes 360 RGB+D video samples

subjects: 5
Overall, this dataset includes 360 RGB+D video samples. The first five subjects are kept for testing and others are for training.

Table 10 compares the accuracies between the proposed framework and the state-of-the-art methods reported on this benchmark
. Our method ties with two recent works (MMMP [54], and BHIM [22]) in saturating the benchmark by achieving the flawless 100% accuracy on this dataset

challenging benchmark datasets: 5
Further, based on the structure of the features, a structured sparsity learning machine is proposed which utilizes mixed norms to apply regularization within components and group selection between them for better classification performance. Our experimental results show the effectiveness of our cross-modality feature analysis framework by achieving state-of-the-art accuracy for action classification on five challenging benchmark datasets. Human activity recognition is one of the active fields in computer vision and has been explored extensively

datasets: 40
DSSCA SSLM: refers to the proposed structured sparsity learning machine based on the hierarchically factorized components described in section 4. It is worth mentioning, there are more than 40 datasets specifically for 3D human action recognition. The survey of Zhang et al [85] provided a great coverage over the current datasets and discussed their characteristics in different aspects, as well as the best performing methods for each dataset

subjects: 5
Each action is done by 10 actors, twice by each actor. The standard evaluation on this dataset is defined on a cross-subject setting: first five subjects are used for training and others for testing. Results of the experiments on this benchmark are reported in Tables 5 and 6

pairs: 6
3D Action Pairs dataset [41] is a less challenging RGB+D dataset for action recognition. This dataset provides 6 pairs of action classes: pick up a box/put down a box, lift a box/place a box, push a chair/pull a chair, wear a hat/take off a hat, put on a backpack/take off a backpack, and stick a poster/remove a poster. Each pair of the classes

subjects: 10
have almost the same set of body motions but in different temporal order. Each action class is captured from 10 subjects, each one 3 times. Overall, this dataset includes 360 RGB+D video samples

subjects: 5
Overall, this dataset includes 360 RGB+D video samples. The first five subjects are kept for testing and others are for training. Table 10 compares the accuracies between the proposed framework and the state-of-the-art methods reported on this benchmark

Reference
  • G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, 2013.
    Google ScholarFindings
  • F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. 2005.
    Google ScholarFindings
  • Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al. Greedy layer-wise training of deep networks. NIPS, 2007.
    Google ScholarLocate open access versionFindings
  • M. Borga. Canonical correlation: a tutorial. Online tutorial, 2001.
    Google ScholarLocate open access versionFindings
  • A. Bosch, A. Zisserman, and X. Munoz. Representing shape with a spatial pyramid kernel. In ACM CIVR, 2007.
    Google ScholarLocate open access versionFindings
  • Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multi-view super vector for action recognition. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • S. Chatzis. Infinite markov-switching maximum entropy discrimination machines. In ICML, 2013.
    Google ScholarLocate open access versionFindings
  • W. Ding, K. Liu, F. Cheng, and J. Zhang. Learning hierarchical spatio-temporal pattern for human activity prediction. JVCIP, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • G. Evangelidis, G. Singh, and R. Horaud. Skeletal quads: Human action recognition using joint quadruples. In ICPR, 2014.
    Google ScholarLocate open access versionFindings
  • C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, and G. Wang. Recent Advances in Convolutional Neural Networks. ArXiv, 2015.
    Google ScholarLocate open access versionFindings
  • F. Han, B. Reily, W. Hoff, and H. Zhang. Space-Time Representation of People Based on 3D Skeletal Data: A Review. arXiv, 2016.
    Google ScholarLocate open access versionFindings
  • J. Han, L. Shao, D. Xu, and J. Shotton. Enhanced computer vision with microsoft kinect sensor: A review. IEEE Transactions on Cybernetics, 2013.
    Google ScholarLocate open access versionFindings
  • D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 2004.
    Google ScholarLocate open access versionFindings
  • G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 2006.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
    Google ScholarLocate open access versionFindings
  • H. Hotelling. Relations between two sets of variates. Biometrika, 1936.
    Google ScholarLocate open access versionFindings
  • J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang. Jointly learning heterogeneous features for rgb-d activity recognition. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. In NIPS, 2010.
    Google ScholarLocate open access versionFindings
  • Y. Kong and Y. Fu. Bilinear heterogeneous information machine for rgb-d action recognition. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • D. Kosmopoulos, P. Doliotis, V. Athitsos, and I. Maglogiannis. Fusion of color and depth video for human behavior recognition in an assistive environment. In Distributed, Ambient, and Pervasive Interactions, Lecture Notes in Computer Science. 2013.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. 2012.
    Google ScholarLocate open access versionFindings
  • P. L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. IJNS, 2000.
    Google ScholarLocate open access versionFindings
  • I. Laptev. On space-time interest points. IJCV, 2005.
    Google ScholarLocate open access versionFindings
  • Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. Ica with reconstruction cost for efficient overcomplete feature learning. In NIPS. 2011.
    Google ScholarLocate open access versionFindings
  • H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area v2. In NIPS. 2008.
    Google ScholarFindings
  • H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • H. Liu, M. Yuan, and F. Sun. Rgb-d action recognition using linear coding. Neurocomputing, 2015.
    Google ScholarLocate open access versionFindings
  • J. Liu, A. Shahroudy, D. Xu, and G. Wang. Skeleton-based action recognition using spatio-temporal lstm network with trust gates.
    Google ScholarFindings
  • J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In ECCV. 2016.
    Google ScholarFindings
  • L. Liu and L. Shao. Learning discriminative representations from rgb-d video data. In IJCAI, 2013.
    Google ScholarFindings
  • C. Lu, J. Jia, and C.-K. Tang. Range-sample depth feature for action recognition. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • J. Luo, W. Wang, and H. Qi. Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In ICCV, 2013.
    Google ScholarLocate open access versionFindings
  • M. Meng, H. Drira, M. Daoudi, and J. Boonaert. Humanobject interaction recognition by learning the distances between the object and the skeleton joints. In IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2015.
    Google ScholarLocate open access versionFindings
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.
    Google ScholarLocate open access versionFindings
  • B. Ni, G. Wang, and P. Moulin. Rgbd-hudaact: A color-depth video database for human daily activity recognition. In ICCV Workshops, 2011.
    Google ScholarLocate open access versionFindings
  • E. Ohn-Bar and M. Trivedi. Joint angles similarities and hog2 for action recognition. In CVPR Workshops, 2013.
    Google ScholarLocate open access versionFindings
  • O. Oreifej and Z. Liu. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In CVPR, 2013.
    Google ScholarLocate open access versionFindings
  • X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv, abs/1405.4506, 2014.
    Findings
  • F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
    Google ScholarLocate open access versionFindings
  • H. Rahmani, A. Mahmood, D. Huynh, and A. Mian. Action classification with locality-constrained linear coding. In ICPR, 2014.
    Google ScholarLocate open access versionFindings
  • H. Rahmani, A. Mahmood, D. Huynh, and A. Mian. Histogram of oriented principal components for cross-view action recognition. TPAMI, 2016.
    Google ScholarLocate open access versionFindings
  • H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian. Real time action recognition using histograms of depth gradients and random decision forests. In WACV, 2014.
    Google ScholarLocate open access versionFindings
  • H. Rahmani, A. Mahmood, D. Q Huynh, and A. Mian. Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition. In ECCV. 2014.
    Google ScholarLocate open access versionFindings
  • H. Rahmani and A. Mian. Learning a non-linear knowledge transfer model for cross-view action recognition. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • H. Rahmani and A. Mian. 3d action recognition from novel viewpoints. In CVPR, June 2016.
    Google ScholarLocate open access versionFindings
  • M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In NIPS, 2007.
    Google ScholarFindings
  • M. Salzmann, C. H. Ek, R. Urtasun, and T. Darrell. Factorized orthogonal latent spaces. In AISTATS, 2010.
    Google ScholarLocate open access versionFindings
  • M. Schmidt. Minfunc, 2005.
    Google ScholarLocate open access versionFindings
  • A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • A. Shahroudy, T. T. Ng, Q. Yang, and G. Wang. Multimodal multipart learning for action recognition in depth videos. TPAMI, 2016.
    Google ScholarLocate open access versionFindings
  • A. Shahroudy, G. Wang, and T.-T. Ng. Multi-modal feature fusion for action recognition in rgb-d sequences. In ISCCSP, 2014.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS. 2014.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv, 2014.
    Google ScholarFindings
  • J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In ICCV, 2003.
    Google ScholarLocate open access versionFindings
  • Y. Song, S. Liu, and J. Tang. Describing trajectory of surface patch for human action recognition on rgb and depth videos. SPL, 2015.
    Google ScholarLocate open access versionFindings
  • N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. JMLR, 2014.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • J.-S. Tsai, Y.-P. Hsu, C. Liu, and L.-C. Fu. An efficient partbased approach to action recognition from rgb-d video with bow-pyramid representation. In IROS, 2013.
    Google ScholarFindings
  • V. Veeriah, N. Zhuang, and G.-J. Qi. Differential recurrent neural networks for action recognition. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3d skeletons as points in a lie group. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 2013.
    Google ScholarLocate open access versionFindings
  • H. Wang, F. Nie, W. Cai, and H. Huang. Semi-supervised robust dictionary learning via efficient l2,0+ -norms minimization. In ICCV, 2013.
    Google ScholarFindings
  • H. Wang, F. Nie, and H. Huang. Multi-view clustering and feature learning via structured sparsity. In ICML, 2013.
    Google ScholarFindings
  • H. Wang, F. Nie, and H. Huang. Robust and discriminative self-taught learning. In ICML, 2013.
    Google ScholarLocate open access versionFindings
  • H. Wang, F. Nie, H. Huang, and C. Ding. Heterogeneous visual features fusion via sparse multimodal machine. In CVPR, 2013.
    Google ScholarLocate open access versionFindings
  • H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
    Google ScholarLocate open access versionFindings
  • J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • J. Wang, Z. Liu, Y. Wu, and J. Yuan. Learning actionlet ensemble for 3d human action recognition. TPAMI, 2014.
    Google ScholarFindings
  • J. Wang and Y. Wu. Learning maximum margin temporal warping for action recognition. In ICCV, 2013.
    Google ScholarFindings
  • J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, 2010.
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. Ogunbona. Action recognition from depth maps using deep convolutional neural networks. In THMS, 2015.
    Google ScholarLocate open access versionFindings
  • P. Wang, W. Li, P. Ogunbona, Z. Gao, and H. Zhang. Mining mid-level features for action recognition based on effective skeleton representation. In DICTA, 2014.
    Google ScholarLocate open access versionFindings
  • L. Xia and J. Aggarwal. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In CVPR, 2013.
    Google ScholarFindings
  • J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • X. Yang and Y. Tian. Super normal vector for activity recognition using depth sequences. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • X. Yang, C. Zhang, and Y. Tian. Recognizing actions using depth motion maps-based histograms of oriented gradients. In ACM MM, 2012.
    Google ScholarLocate open access versionFindings
  • G. Yu, Z. Liu, and J. Yuan. Discriminative orderlet mining for real-time recognition of human-object interaction. In ACCV, 2014.
    Google ScholarLocate open access versionFindings
  • J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
    Google ScholarFindings
  • J. Zhang, W. Li, P. O. Ogunbona, P. Wang, and C. Tang. Rgbd-based action recognition datasets: A survey. arXiv, 2016.
    Google ScholarFindings
  • Z. Zhang. Microsoft kinect sensor and its effect. IEEE MultiMedia, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Zhao, Z. Liu, L. Yang, and H. Cheng. Combing rgb and depth map features for human activity recognition. In APSIPA ASC, 2012.
    Google ScholarLocate open access versionFindings
  • Z. Y. Zhao Runlin. Depth induced feature representation for 4d human activity recognition. Computer Modelling & New Technologies, 2014.
    Google ScholarFindings
  • W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Zhu, W. Chen, and G. Guo. Fusing multiple features for depth-based action recognition. ACM TIST, 2015.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments