Task-Oriented Multi-Modal Question Answering For Collaborative Applications

2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)(2020)

引用 5|浏览46
暂无评分
摘要
Cobots that can work in human workspaces and adapt to human need to understand and respond to human's inquiry and instruction. In this paper, we propose new question answering (QA) task and dataset for human-robot collaboration on task-oriented operation, i.e., task-oriented collaborative QA (TC-QA). Differing from conventional video QA for answering questions about what happened in video clips constrained by scripts and subtitles, TC-QA aims to share common ground for task-oriented operation through question answering. We propose an open-end (OE) format of answer with text reply, image with annotated related objects, and video with operation duration to guide operation execution. Designed for grounding, the TC-QA dataset comprises query videos and questions to seek acknowledgement, correction, attention to task-related objects, and information on objects or operation. Due to the flexibility of real-world task with limited training sample, we propose and evaluate a baseline method based on a hybrid approach. The hybrid approach employs deep learning methods for object detection, hand detection and gesture recognition, and symbolic reasoning to ground question on observation for providing the answer. Our experiments show that the hybrid method is effective for the TC-QA task.
更多
查看译文
关键词
question answering, multi-modal grounding, human-robot collaboration, hybrid system, corpora
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要