AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose a unified framework for social scene understanding by simultaneously solving three tasks in a single feed forward pass through a Neural Network: multi-person detection, individual’s action recognition, and collective activity recognition

Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition.

CVPR, (2017)

Cited by: 120|Views31
EI
Full Text
Bibtex
Weibo

Abstract

We present a unified framework for understanding human social behaviors in raw image sequences. Our model jointly detects multiple individuals, infers their social actions, and estimates the collective actions with a single feed-forward pass through a neural network. We propose a single architecture that does not rely on external detectio...More

Code:

Data:

0
Introduction
  • Human social behavior can be characterized by “social actions” – an individual act which takes into account the behaviour of other individuals – and “collective actions” taken together by a group of people with a common objective.
  • Given a sequence of image frames, the method jointly locates and describes the social actions of each individual in a scene as well as the collective actions
  • This perceived social scene representation can be used for sports analytics, understanding social behaviour, surveillance, and social robot navigation.
  • Extracting features individually for each object discards a large amount of context and interactions, which can be useful when reasoning about collective behaviours
  • This point is important because the locations and actions of humans can be highly correlated.
  • The sequential approach does not scale well with the number of people in the scene, since it requires multiple runs for a single image
Highlights
  • Human social behavior can be characterized by “social actions” – an individual act which takes into account the behaviour of other individuals – and “collective actions” taken together by a group of people with a common objective
  • Recent methods for multi-person scene understanding take a sequential approach [21, 11, 31]: i) each person is detected in every given frame; ii) these detections are assostanding right spike standing waiting moving spiking waiting standing moving blocking blocking waiting moving ciated over time by a tracking algorithm; iii) a feature representation is extracted for each individual detection; and iv) these representations are joined via a structured model
  • We propose a unified framework for social scene understanding by simultaneously solving three tasks in a single feed forward pass through a Neural Network: multi-person detection, individual’s action recognition, and collective activity recognition
  • We present a person-level matching Recurrent Neural Network (RNN) model to propagate information in the temporal domain, while not having access to the the trajectories of individuals
  • Our model achieves state-of-the-art results on challenging multiperson sequences, and outperforms existing approaches that rely on the ground truth annotations at test time
  • This dataset consists of 55 volleyball games with 4830 labelled frames, where each player is annotated with the bounding box and one of the 9 individual actions, and the whole scene is assigned with one of the 8 collective activity labels, which define which part of the game is happening
Methods
  • The authors' main goal is to construct comprehensive interpretations of social scenes from raw image sequences.
  • The authors first obtain a preliminary set of detection hypotheses, encoded as two dense maps Bt ∈ R|I|×4 and Pt ∈ R|I|, where at each location i ∈ I, Bti encodes the coordinates of the bounding box, and Pti is the probability that this bounding box represents a person
  • Those detections are refined jointly by inference in a hybrid Markov Random Field (MRF).
  • One of the benefits of the detection method with respect to the ReInspect, is that the approach is not restricted only to detection, and can be used for instance-level segmentation
Results
  • The authors report the results on the task of multiperson scene understanding and compare them to multiple baselines.
  • The authors evaluate the framework on the recently introduced volleyball dataset [21], since it is the only publicly available dataset for multi-person activity recognition that is relatively large-scale and contains labels for people locations, as well as their collective and individual actions
  • This dataset consists of 55 volleyball games with 4830 labelled frames, where each player is annotated with the bounding box and one of the 9 individual actions, and the whole scene is assigned with one of the 8 collective activity labels, which define which part of the game is happening.
  • To get the ground truth locations of people for those, the authors resort to the same appearancebased tracker as proposed by the authors of the dataset [21]
Conclusion
  • The authors have proposed a unified model for joint detection and activity recognition of multiple people.
  • The authors' approach does not require any external ground truth detections nor tracks, and demonstrates state-of-the-art performance both on multi-person scene understanding and detection datasets.
  • Future work will apply the proposed framework to explicitly capture and understand human interactions
Summary
  • Introduction:

    Human social behavior can be characterized by “social actions” – an individual act which takes into account the behaviour of other individuals – and “collective actions” taken together by a group of people with a common objective.
  • Given a sequence of image frames, the method jointly locates and describes the social actions of each individual in a scene as well as the collective actions
  • This perceived social scene representation can be used for sports analytics, understanding social behaviour, surveillance, and social robot navigation.
  • Extracting features individually for each object discards a large amount of context and interactions, which can be useful when reasoning about collective behaviours
  • This point is important because the locations and actions of humans can be highly correlated.
  • The sequential approach does not scale well with the number of people in the scene, since it requires multiple runs for a single image
  • Methods:

    The authors' main goal is to construct comprehensive interpretations of social scenes from raw image sequences.
  • The authors first obtain a preliminary set of detection hypotheses, encoded as two dense maps Bt ∈ R|I|×4 and Pt ∈ R|I|, where at each location i ∈ I, Bti encodes the coordinates of the bounding box, and Pti is the probability that this bounding box represents a person
  • Those detections are refined jointly by inference in a hybrid Markov Random Field (MRF).
  • One of the benefits of the detection method with respect to the ReInspect, is that the approach is not restricted only to detection, and can be used for instance-level segmentation
  • Results:

    The authors report the results on the task of multiperson scene understanding and compare them to multiple baselines.
  • The authors evaluate the framework on the recently introduced volleyball dataset [21], since it is the only publicly available dataset for multi-person activity recognition that is relatively large-scale and contains labels for people locations, as well as their collective and individual actions
  • This dataset consists of 55 volleyball games with 4830 labelled frames, where each player is annotated with the bounding box and one of the 9 individual actions, and the whole scene is assigned with one of the 8 collective activity labels, which define which part of the game is happening.
  • To get the ground truth locations of people for those, the authors resort to the same appearancebased tracker as proposed by the authors of the dataset [21]
  • Conclusion:

    The authors have proposed a unified model for joint detection and activity recognition of multiple people.
  • The authors' approach does not require any external ground truth detections nor tracks, and demonstrates state-of-the-art performance both on multi-person scene understanding and detection datasets.
  • Future work will apply the proposed framework to explicitly capture and understand human interactions
Tables
  • Table1: Results on the volleyball dataset. We report average accuracy for collective activity and individual actions. For OURS-temporal for the ground truth bounding boxes (GT) we report results with the bbox matching, and for the detections (MRF) we report results with the embed matching
  • Table2: Comparison of different matching strategies for the volleyball dataset. boxes corresponds to the nearest neighbour (NN) match in the space of bounding box coordinates, embed corresponds to the NN in the embedding space e, and embed-soft is a soft matching in e
  • Table3: Comparative results of detection schemes on the volleyball dataset. We report the average accuracy for the collective and individual action recognition
Download tables as Excel
Related work
  • The main focus of this work is creating a unified model that can simultaneously detect multiple individuals and recognize their individual social actions and collective behaviour. In what follows, we give a short overview of the existing work on these tasks. Multi-object detection - There already exists large body of research in the area of object detection. Most of the current methods either rely on a sliding window approach [34, 45], or on the object proposal mechanism [18, 33], followed by a CNN-based classifier. The vast majority of those stateof-the-art methods do not reason jointly on the presence of multiple objects, and rely on very heuristic post-processing steps to get the final detections. A notable exception is the ReInspect [38] algorithm, which is specifically designed to handle multi-object scenarios by modeling detection process in a sequential manner, and employing a Hungarian loss to train the model end-to-end. We approach this problem in a very different way, by doing probabilistic inference on top of a dense set of detection hypotheses, while also demonstrating state-of-the-art results on challenging crowded scenes. Another line of work that specifically focuses on joint multi-person detection [16, 44, 3] uses generative models, however, those methods require multiple views or depth maps and are not applicable in monocular settings.
Funding
  • This work was supported in part by the Swiss National Science Foundation, Panasonic, Nissan (1188371-1-UDARQ), MURI (1186514-1-TBCJE), and ONR (1165419-10-TDAUZ)
Reference
  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1, 2015.
    Google ScholarLocate open access versionFindings
  • M. R. Amer, P. Lei, and S. Todorovic. Hirf: Hierarchical random field for collective activity recognition in videos. In European Conference on Computer Vision, pages 572–585.
    Google ScholarLocate open access versionFindings
  • T. Bagautdinov, F. Fleuret, and P. Fua. Probability occupancy maps for occluded depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2829–2837, 2015.
    Google ScholarLocate open access versionFindings
  • P. Baque, T. Bagautdinov, F. Fleuret, and P. Fua. Principled parallel mean-field inference for discrete random fields. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • O. Barinova, V. Lempitsky, and P. Kholi. On detection of multiple object instances using hough transforms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1773–1784, 2012.
    Google ScholarLocate open access versionFindings
  • W. Choi and S. Savarese. A unified framework for multitarget tracking and collective activity recognition. In European Conference on Computer Vision, pages 215–230.
    Google ScholarLocate open access versionFindings
  • W. Choi and S. Savarese. Understanding collective activitiesof people from videos. IEEE transactions on pattern analysis and machine intelligence, 36(6):1242–1257, 2014.
    Google ScholarLocate open access versionFindings
  • W. Choi, K. Shahid, and S. Savarese. Learning context for collective activity recognition. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3273–3280. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
    Findings
  • N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE, 2005.
    Google ScholarLocate open access versionFindings
  • Z. Deng, A. Vahdat, H. Hu, and G. Mori. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1110–1118, 2015.
    Google ScholarLocate open access versionFindings
  • M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua. Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):267–282, 2008.
    Google ScholarLocate open access versionFindings
  • J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. IEEE transactions on pattern analysis and machine intelligence, 33(11):2188–2202, 2011.
    Google ScholarLocate open access versionFindings
  • R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 447–456, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori. A hierarchical deep temporal model for group activity recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015.
    Google ScholarLocate open access versionFindings
  • S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221– 231, 2013.
    Google ScholarLocate open access versionFindings
  • J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • S. Khamis, V. Morariu, and L. Davis. Combining per-frame and per-track cues for multi-person action recognition. Computer Vision–ECCV 2012, pages 116–129, 2012.
    Google ScholarLocate open access versionFindings
  • S. Khamis, V. I. Morariu, and L. S. Davis. A flow model for joint action recognition and identity maintenance. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1218–1225. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
    Google ScholarLocate open access versionFindings
  • V. Ramanathan, J. Huang, S. Abu-El-Haija, A. Gorban, K. Murphy, and L. Fei-Fei. Detecting events and key actors in multi-person videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. A multi-stream bi-directional recurrent neural network for finegrained action detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • S. Singh, C. Arora, and C. V. Jawahar. First person action recognition using deep learned descriptors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • R. Stewart, M. Andriluka, and A. Y. Ng. End-to-end people detection in crowded scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
    Findings
  • V. Veeriah, N. Zhuang, and G.-J. Qi. Differential recurrent neural networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 4041–4049, 2015.
    Google ScholarLocate open access versionFindings
  • H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1):60– 79, 2013.
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4305–4314, 2015.
    Google ScholarLocate open access versionFindings
  • D. Weinland, M. Ozuysal, and P. Fua. Making Action Recognition Robust to Occlusions and Viewpoint Changes. 2010.
    Google ScholarFindings
  • Z. Wu, A. Thangali, S. Sclaroff, and M. Betke. Coupling detection and data association for multiple object tracking. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1948–1955. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • S. Zhang, R. Benenson, and B. Schiele. Filtered feature channels for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1751–1760, 2015.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
小科