AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose a novel Human-Object Interaction learning paradigm named HOI Analysis, which is inspired by Harmonic Analysis

HOI Analysis: Integrating and Decomposing Human-Object Interaction

NIPS 2020, (2020)

Cited by: 1|Views49
EI
Full Text
Bibtex
Weibo

Abstract

Human-Object Interaction (HOI) consists of human, object and implicit interaction/verb. Different from previous methods that directly map pixels to HOI semantics, we propose a novel perspective for HOI learning in an analytical manner. In analogy to Harmonic Analysis, whose goal is to study how to represent the signals with the superpos...More

Code:

Data:

0
Introduction
  • Human-Object Interaction (HOI) takes up most of the human activities. As a composition, HOI consists of three parts: .
  • The view of Gestalt psychology is usually summarized as one simple sentence: “The whole is more than the sum of its parts” [14].
  • This is in line with human perception.
  • E.g., posterior superior temporal sulcus, are responsible for integrating isolated human and object into coherent HOI [1].
Highlights
  • Human-Object Interaction (HOI) takes up most of the human activities
  • Except for the direct thinking that maps pixels to semantics, in this work we rethink HOI and explore two questions in a novel perspective (Fig. 1): First, as for the inner structure of HOI, how do isolated human and object compose HOI? Second, what is the relationship between two human-object pairs with the same HOI?
  • For V-COCO, we evaluate AProle (24 actions with roles) on Scenario 1 (S1) and Scenario 2 (S2)
  • Integration-Decomposition Network (IDN) is the first to achieve more than 20 mAP on all three Default sets without additional information used
  • We propose a novel HOI learning paradigm named HOI Analysis, which is inspired by Harmonic Analysis
  • We propose a novel paradigm for Human-Object Interaction detection, which would promote human activity understanding
Methods
  • The authors can hardly depict which region is the verb.
  • Instead of directly finding the interaction region and mapping it to semantics [4, 12, 27, 49, 19], the authors propose a novel learning paradigm, i.e., learning the verb representation via HOI Analysis.
  • InteractNet [13] COCO ResNet50-FPN 9.94 7.16 10.77 - - GPNN [42].
  • COCO ResNet101 13.11 9.34 14.23 - -
Results
  • With TI (·) and TD(·), IDN outperforms previous methods significantly and achieves 23.36 mAP on the Default Full set of HICO-DET [4] with COCO detector.
  • The improvement on the Rare set proves that the dynamically learned interaction representation can greatly alleviate the data deficiency of rare HOIs. With the HICO-DET finetuned detector, IDN shows great improvements and achieves more than 26 mAP and further proves
Conclusion
  • The authors propose a novel HOI learning paradigm named HOI Analysis, which is inspired by Harmonic Analysis.
  • An Integration-Decomposition Network (IDN) is introduced to implement it.
  • With the integration and decomposition between the coherent HOI and isolated human and object, IDN can effectively learn the interaction representation in transformation function space and outperform the state-of-the-art on HOI detection with significant improvements.
  • The authors propose a novel paradigm for Human-Object Interaction detection, which would promote human activity understanding.
  • The authors will release the code and trained models to the community, as part of efforts to alleviate the repeated training of future works
Summary
  • Introduction:

    Human-Object Interaction (HOI) takes up most of the human activities. As a composition, HOI consists of three parts: .
  • The view of Gestalt psychology is usually summarized as one simple sentence: “The whole is more than the sum of its parts” [14].
  • This is in line with human perception.
  • E.g., posterior superior temporal sulcus, are responsible for integrating isolated human and object into coherent HOI [1].
  • Methods:

    The authors can hardly depict which region is the verb.
  • Instead of directly finding the interaction region and mapping it to semantics [4, 12, 27, 49, 19], the authors propose a novel learning paradigm, i.e., learning the verb representation via HOI Analysis.
  • InteractNet [13] COCO ResNet50-FPN 9.94 7.16 10.77 - - GPNN [42].
  • COCO ResNet101 13.11 9.34 14.23 - -
  • Results:

    With TI (·) and TD(·), IDN outperforms previous methods significantly and achieves 23.36 mAP on the Default Full set of HICO-DET [4] with COCO detector.
  • The improvement on the Rare set proves that the dynamically learned interaction representation can greatly alleviate the data deficiency of rare HOIs. With the HICO-DET finetuned detector, IDN shows great improvements and achieves more than 26 mAP and further proves
  • Conclusion:

    The authors propose a novel HOI learning paradigm named HOI Analysis, which is inspired by Harmonic Analysis.
  • An Integration-Decomposition Network (IDN) is introduced to implement it.
  • With the integration and decomposition between the coherent HOI and isolated human and object, IDN can effectively learn the interaction representation in transformation function space and outperform the state-of-the-art on HOI detection with significant improvements.
  • The authors propose a novel paradigm for Human-Object Interaction detection, which would promote human activity understanding.
  • The authors will release the code and trained models to the community, as part of efforts to alleviate the repeated training of future works
Tables
  • Table1: Results on HICO-DET [<a class="ref-link" id="c4" href="#r4">4</a>]. “COCO” is the COCO pre-trained detector, “HICO-DET” means that the “COCO” is further fine-tuned on HICO-DET, “GT” means the ground truth boxes. Superscript DRG or VCL indicates that the HICO-DET fine-tuned detector from DRG [<a class="ref-link" id="c11" href="#r11">11</a>] or VCL [<a class="ref-link" id="c21" href="#r21">21</a>] is used
  • Table2: Results on V-COCO [<a class="ref-link" id="c18" href="#r18">18</a>]
  • Table3: Ablation studies on HICO-DET [<a class="ref-link" id="c4" href="#r4">4</a>]
Download tables as Excel
Related work
  • Human-Object Interaction (HOI) detection [4, 18] is crucial for deeper scene understanding and can facilitate behavior and activity learning [15, 25, 46, 47, 37, 38, 44]. Recently, huge progress has been made in this field with the promotion of large-scale datasets [18, 4, 5, 15, 25] and deep learning. HOI has been studied for a long history. Previously, most methods [16, 17, 55, 54, 6, 7] adopted hand-crafted features. With the renaissance of neural networks, recent works [8, 32, 27, 12, 45, 19, 49, 42, 4, 13, 41, 24, 28] start to leverage learning-based features with end-to-end paradigm. HORCNN [4] utilized a multi-stream model to leverage human, object and spatial patterns respectively, which is widely followed by subsequent works [12, 27, 49]. Differently, GPNN [42] adopted a graph model to address HOI learning for both images and videos. Instead of directly processing all (•)
Funding
  • Acknowledgments and Disclosure of Funding This work is supported in part by the National Key R&D Program of China, No 2017YFA0700800, National Natural Science Foundation of China under Grants 61772332 and Shanghai Qi Zhi Institute, SHEITC (2018-RGZN-02046)
Study subjects and analysis
negative pairs: 360
The decoder is structured symmetrical to the encoder. For HICO-DET [4], AE is pretrained for 4 epochs using SGD with a learning rate of 0.1, momentum of 0.9, while each batch contains 45 positive and 360 negative pairs. The whole IDN (AE and transformation modules) is first trained without inter-pair transformation (IPT) for 20 epochs using SGD with a learning rate of 2e-2, momentum of 0.9

negative pairs: 120
Then we finetune IDN with IPT for 30 epochs using SGD, with a learning rate of 1e-3, momentum of 0.9. Each batch for the whole IDN contains 15 positive and 120 negative pairs. For V-COCO [18], AE is first pre-trained for 60 epochs

Reference
  • Christopher Baldassano, Diane M Beck, and Li Fei-Fei. Human-object interactions are more than the sum of their parts. Cerebral Cortex, 2017.
    Google ScholarLocate open access versionFindings
  • Ankan Bansal, Sai Saketh Rambhatla, Abhinav Shrivastava, and Rama Chellappa. Detecting human-object interactions via functional generalization. AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In WACV, 2018.
    Google ScholarLocate open access versionFindings
  • Yu Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. Hico: A benchmark for recognizing human-object interactions in images. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Vincent Delaitre, Josef Sivic, and Ivan Laptev. Learning person-object interactions for action recognition in still images. In NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • Chaitanya Desai and Deva Ramanan. Detecting actions, poses, and objects with relational phraselets. In ECCV, 2012.
    Google ScholarLocate open access versionFindings
  • Hao Shu Fang, Jinkun Cao, Yu Wing Tai, and Cewu Lu. Pairwise body-part attention for recognizing human-object interactions. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • Chen Gao, Jiarui Xu, Yuliang Zou, and Jia-Bin Huang. Drg: Dual relation graph for human-object interaction detection. In ECCV, 2020.
    Google ScholarLocate open access versionFindings
  • Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instance-centric attention network for human-object interaction detection. In BMVC, 2018.
    Google ScholarLocate open access versionFindings
  • Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. Detecting and recognizing human-object interactions. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • EB Goldstein. Cognitive psychology belmont. CA: Thomson Higher Education, 2008.
    Google ScholarFindings
  • Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Abhinav Gupta and Larry S. Davis. Objects in action: An approach for combining action understanding and object perception. In CVPR, 2007.
    Google ScholarFindings
  • Abhinav Gupta, Aniruddha Kembhavi, and Larry S Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. TPAMI, 2009.
    Google ScholarLocate open access versionFindings
  • Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
    Findings
  • Tanmay Gupta, Alexander Schwing, and Derek Hoiem. No-frills human-object interaction detection: Factorization, appearance and layout encodings, and training techniques. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Zhi Hou, Xiaojiang Peng, Yu Qiao, and Dacheng Tao. Visual compositional learning for human-object interaction detection. arXiv preprint arXiv:2007.12407, 2020.
    Findings
  • Jiefeng Li, Can Wang, Wentao Liu, Chen Qian, and Cewu Lu. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation. In ECCV, 2020.
    Google ScholarLocate open access versionFindings
  • Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu, Jiefeng Li, and Cewu Lu. Detailed 2d-3d joint representation for human-object interaction. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. Pastanet: Toward human activity knowledge engine. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Yong-Lu Li, Yue Xu, Xiaohan Mao, and Cewu Lu. Symmetry and group in attribute-object compositions. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yanfeng Wang, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Yue Liao, Si Liu, Fei Wang, Yanjie Chen, and Jiashi Feng. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • Cewu Lu, Hao Su, Yonglu Li, Yongyi Lu, Li Yi, Chi-Keung Tang, and Leonidas J Guibas. Beyond holistic object recognition: Enriching image understanding with part states. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
    Google ScholarLocate open access versionFindings
  • Arun Mallya and Svetlana Lazebnik. Learning models for actions and person-object interactions with transfer to question answering. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with context. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute-object compositions. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Zhixiong Nan, Yang Liu, Nanning Zheng, and Song-Chun Zhu. Recognizing unseen attribute-object pair with generative model. In AAAI, 2019.
    Google ScholarLocate open access versionFindings
  • Bo Pang, Kaiwen Zha, Hanwen Cao, Jiajun Tang, Minghui Yu, and Cewu Lu. Complex sequential understanding through the awareness of spatial and temporal concepts. Nature Machine Intelligence, 2(5):245–253, 2020.
    Google ScholarLocate open access versionFindings
  • Bo Pang, Kaiwen Zha, Yifan Zhang, and Cewu Lu. Further understanding videos through adverbs: A new video task. In AAAI, 2020.
    Google ScholarFindings
  • Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. Detecting rare visual relations using analogies. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Fei Fei Li. Scaling human-object interaction recognition through zero-shot learning. In WACV, 2018.
    Google ScholarLocate open access versionFindings
  • Jianhua Sun, Qinhong Jiang, and Cewu Lu. Recursive social behavior graph for trajectory prediction. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. Asynchronous interaction aggregation for action detection. In ECCV, 2020.
    Google ScholarLocate open access versionFindings
  • Oytun Ulutan, ASM Iftekhar, and BS Manjunath. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, and Xuming He. Pose-aware multi-level feature network for human object interaction detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Tiancai Wang, Rao Muhammad Anwer, Muhammad Haris Khan, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao, and Jorma Laaksonen. Deep contextual attention for human-object interaction detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, and Jian Sun. Learning human-object interaction detection using interaction points. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Xin Wang, Fisher Yu, Ruth Wang, Trevor Darrell, and Joseph E Gonzalez. Tafe-net: Task-aware feature embeddings for low shot learning. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan S Kankanhalli. Learning to detect humanobject interactions with knowledge. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Fei-Fei Li. Human action recognition by learning bases of action attributes and parts. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • Bangpeng Yao and Fei-Fei Li. Grouplet: A structured image representation for recognizing human and object interactions. In CVPR, 2010.
    Google ScholarLocate open access versionFindings
  • Penghao Zhou and Mingmin Chi. Relation parsing neural network for human-object interaction detection. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
Author
Xinpeng Liu
Xinpeng Liu
Xiaoqian Wu
Xiaoqian Wu
Yizhuo Li
Yizhuo Li
Your rating :
0

 

Tags
Comments
小科