HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
arxiv(2024)
摘要
Large Vision Language Models (VLMs) are now the de facto state-of-the-art for
a number of tasks including visual question answering, recognising objects, and
spatial referral. In this work, we propose the HOI-Ref task for egocentric
images that aims to understand interactions between hands and objects using
VLMs. To enable HOI-Ref, we curate the HOI-QA dataset that consists of 3.9M
question-answer pairs for training and evaluating VLMs. HOI-QA includes
questions relating to locating hands, objects, and critically their
interactions (e.g. referring to the object being manipulated by the hand). We
train the first VLM for HOI-Ref on this dataset and call it VLM4HOI. Our
results demonstrate that VLMs trained for referral on third person images fail
to recognise and refer hands and objects in egocentric images. When fine-tuned
on our egocentric HOI-QA dataset, performance improves by 27.9
hands and objects, and by 26.7
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要