Human Object Interaction Detection via Multi-level Conditioned Network

ICMR '20: International Conference on Multimedia Retrieval Dublin Ireland June, 2020(2020)

引用 8|浏览74
暂无评分
摘要
As one of the essential problems in scene understanding, human object interaction detection (HOID) aims to recognize fine-grained object-specific human actions, which demands the capabilities of both visual perception and reasoning. Existing methods based on convolutional neural network (CNN) utilize diverse visual features for HOID, which are insufficient for complex human object interaction understanding. To enhance the reasoning capablity of CNN, we propose a novel multi-level conditioned network that fuses extra spatial-semantic knowledge with visual features. Specifically, we construct a multi-branch CNN as backbone for multi-level visual representation. We then encode extra knowledge including human body structure and object context as condition to dynamically influence the feature extraction of CNN by affine transformation and attention mechanism. Finally, we fuse the modulated multimodal features to distinguish the interactions. The proposed method is evaluated on two most frequently-used benchmarks, HICO-DET and V-COCO. The experiment results show that our method is superior to the state-of-the-arts.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要