Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation

AAAI, 2019.

Cited by: 2|Bibtex|Views32
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We propose a turbo learning framework to perform HOI recognition and pose estimation simultaneously

Abstract:

Human-object interactions (HOI) recognition and pose estimation are two closely related tasks. Human pose is an essential cue for recognizing actions and localizing the interacted objects. Meanwhile, human action and their interacted objects' localizations provide guidance for pose estimation. In this paper, we propose a turbo learning ...More

Code:

Data:

0
Introduction
Highlights
  • Human-object interactions (HOI) recognition (Gkioxari et al 2017; Gupta, Kembhavi, and Davis 2009; Yao and FeiFei 2010; Chen and Grauman 2014) aims to detect and recognize triplets in the form < human, action, object > from a single image
  • Instead of only considering the human appearance feature, we argue that the human pose estimation obtains detailed analysis of human structure and could provide robust pose prior for Human-object interactions recognition
  • To utilize the flow of complementary information iteratively, we introduce a turbo learning framework, which can be expanded as a sequence of pose aware Human-object interactions recognition modules and Human-object interactions guided pose estimation modules by the time step
  • Pose Estimation with vs. without Human-object interactions Recognition To demonstrate that Human-object interactions can guide the keypoints distribution, we evaluate a variant of our method which removes the Human-object interactions recognition branch, so the pose estimation task only relies on the image features extracted by the ROIAlign layer
  • As these two tasks can provide guidance to each other, we introduce two novel modules: pose aware Human-object interactions recognition module and Human-object interactions guided pose estimation module, in which each task’s features are treated as a part of input to the other task
Methods
  • An overview of the framework is illustrated in Fig 2(a).
  • After extracting the human appearance features by the stem network, the pose aware HOI recognition module is applied to predict HOI with both human appearance features and human pose features.
  • The proposed HOI guided pose estimation module updates the human pose result depending on HOI recognition result.
  • The two modules form a closed loop to gradually improve the results of both pose estimation and HOI recognition.
  • The closed loop can be expanded by the time sequence as shown in Fig 2(b)
Results
  • The proposed method achieves the state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.
  • The proposed method achieves the state-of-the-art results on two public benchmarks including V-COCO and HICO-DET (Chao et al 2017).
  • As shown in Tables 1 and 2, the proposed method can achieve the state-of-the-art performance on HICO-DET and V-COCO
Conclusion
  • The authors propose a turbo learning method to perform both HOI recognition and pose estimation
  • As these two tasks can provide guidance to each other, the authors introduce two novel modules: pose aware HOI recognition module and HOI guided pose estimation module, in which each task’s features are treated as a part of input to the other task.
  • The proposed method has achieved the state-ofthe-art performance on V-COCO and HICO-DET datasets
Summary
  • Introduction:

    Human-object interactions (HOI) recognition (Gkioxari et al 2017; Gupta, Kembhavi, and Davis 2009; Yao and FeiFei 2010; Chen and Grauman 2014) aims to detect and recognize triplets in the form < human, action, object > from a single image.
  • Most existing methods treat HOI recognition and pose estimation as separate tasks, which ignore the information reciprocity between two tasks
  • Methods:

    An overview of the framework is illustrated in Fig 2(a).
  • After extracting the human appearance features by the stem network, the pose aware HOI recognition module is applied to predict HOI with both human appearance features and human pose features.
  • The proposed HOI guided pose estimation module updates the human pose result depending on HOI recognition result.
  • The two modules form a closed loop to gradually improve the results of both pose estimation and HOI recognition.
  • The closed loop can be expanded by the time sequence as shown in Fig 2(b)
  • Results:

    The proposed method achieves the state-of-the-art performance on two public benchmarks including Verbs in COCO (V-COCO) and HICO-DET datasets.
  • The proposed method achieves the state-of-the-art results on two public benchmarks including V-COCO and HICO-DET (Chao et al 2017).
  • As shown in Tables 1 and 2, the proposed method can achieve the state-of-the-art performance on HICO-DET and V-COCO
  • Conclusion:

    The authors propose a turbo learning method to perform both HOI recognition and pose estimation
  • As these two tasks can provide guidance to each other, the authors introduce two novel modules: pose aware HOI recognition module and HOI guided pose estimation module, in which each task’s features are treated as a part of input to the other task.
  • The proposed method has achieved the state-ofthe-art performance on V-COCO and HICO-DET datasets
Tables
  • Table1: Results on HICO-DET test set
  • Table2: Results on V-COCO test set
Download tables as Excel
Related work
Funding
  • This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61332007 and 61621136008
Reference
  • Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, 7.
    Google ScholarLocate open access versionFindings
  • Chao, Y.-W.; Wang, Z.; He, Y.; Wang, J.; and Deng, J. 2015. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, 1017–1025.
    Google ScholarLocate open access versionFindings
  • Chao, Y.-W.; Liu, Y.; Liu, X.; Zeng, H.; and Deng, J. 2017. Learning to detect human-object interactions. In arXiv preprint arXiv:1702.05448.
    Findings
  • Chen, C.-Y., and Grauman, K. 201Predicting the location of “interactees” in novel human-object interactions. In Proceedings of the Asian conference on computer vision, 351– 367. Springer.
    Google ScholarLocate open access versionFindings
  • Delaitre, V.; Laptev, I.; and Sivic, J. 2010. Recognizing human actions in still images: a study of bag-of-features and part-based representations. In Proceedings of the British Machine Vision Conference, 1–11.
    Google ScholarLocate open access versionFindings
  • Desai, C., and Ramanan, D. 2012. Detecting actions, poses, and objects with relational phraselets. In Proceedings of the European Conference on Computer Vision, 158–172. Springer.
    Google ScholarLocate open access versionFindings
  • Fang, H.; Xie, S.; Tai, Y.-W.; and Lu, C. 201Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, volume 2.
    Google ScholarLocate open access versionFindings
  • Gkioxari, G.; Girshick, R.; Dollar, P.; and He, K. 2017. Detecting and recognizing human-object interactions. In arXiv preprint arXiv:1704.07333.
    Findings
  • Gupta, S., and Malik, J. 2015. Visual semantic role labeling. In arXiv preprint arXiv:1505.04474.
    Findings
  • Gupta, A.; Kembhavi, A.; and Davis, L. S. 2009. Observing human-object interactions: Using spatial and functional compatibility for recognition. In Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 31, 1775–1789. IEEE.
    Google ScholarLocate open access versionFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
    Google ScholarLocate open access versionFindings
  • He, K.; Gkioxari, G.; Dollar, P.; and Girshick, R. 2017. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, 2980–2988. IEEE.
    Google ScholarLocate open access versionFindings
  • Herath, S.; Harandi, M.; and Porikli, F. 2017. Going deeper into action recognition: A survey. In Proceedings of the Image and Vision Computing, volume 60, 4–21. Elsevier.
    Google ScholarLocate open access versionFindings
  • Hu, J.-F.; Zheng, W.-S.; Lai, J.; Gong, S.; and Xiang, T. 2013. Recognising human-object interaction via exemplar based modelling. In Proceedings of the IEEE International Conference on Computer Vision, 3144–3151. IEEE.
    Google ScholarLocate open access versionFindings
  • Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, 740–755. Springer.
    Google ScholarLocate open access versionFindings
  • Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; and Belongie, S. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4.
    Google ScholarLocate open access versionFindings
  • Luvizon, D. C.; Picard, D.; and Tabia, H. 2018. 2d/3d pose estimation and action recognition using multitask deep learning. In arXiv preprint arXiv:1802.09232.
    Findings
  • Luvizon, D. C.; Tabia, H.; and Picard, D. 2017. Human pose regression by combining indirect part detection and contextual information. In arXiv preprint arXiv:1710.02322.
    Findings
  • Maji, S.; Bourdev, L.; and Malik, J. 2011. Action recognition from a distributed representation of pose and appearance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3177–3184. IEEE.
    Google ScholarLocate open access versionFindings
  • Mallya, A., and Lazebnik, S. 2016. Learning models for actions and person-object interactions with transfer to question answering. In Proceedings of the European Conference on Computer Vision, 414–428. Springer.
    Google ScholarLocate open access versionFindings
  • Newell, A.; Huang, Z.; and Deng, J. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems, 2277–2287.
    Google ScholarLocate open access versionFindings
  • Ning, G.; Zhang, Z.; and He, Z. 2017. Knowledge-guided deep fractal neural networks for human pose estimation. In Proceedings of the IEEE Transactions on Multimedia. IEEE.
    Google ScholarLocate open access versionFindings
  • Pishchulin, L.; Andriluka, M.; and Schiele, B. 2014. Finegrained activity recognition with holistic and pose based features. In Proceedings of the German Conference on Pattern Recognition, 678–689. Springer.
    Google ScholarLocate open access versionFindings
  • Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, 91–99.
    Google ScholarLocate open access versionFindings
  • Shen, L.; Yeung, S.; Hoffman, J.; Mori, G.; and Fei-Fei, L. 2018. Scaling human-object interaction recognition through zero-shot learning. In 2018 IEEE Winter Conference on Applications of Computer Vision, 1568–1576. IEEE.
    Google ScholarLocate open access versionFindings
  • Wei, S.-E.; Ramakrishna, V.; Kanade, T.; and Sheikh, Y. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4732.
    Google ScholarLocate open access versionFindings
  • Xiaohan Nie, B.; Xiong, C.; and Zhu, S.-C. 2015. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1293–1301.
    Google ScholarLocate open access versionFindings
  • Yao, B., and Fei-Fei, L. 2010. Modeling mutual context of object and human pose in human-object interaction activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 17–24. IEEE.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments