ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

CVPR(2020)

引用 594|浏览420
暂无评分
摘要
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. Long composition rollouts with non-reversible state changes are among the phenomena we include to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets. We show that a baseline model designed for recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
更多
查看译文
关键词
natural language directives,interactive visual environments,research benchmarks,nonreversible state changes,compositional tasks,household tasks,egocentric vision,natural language instructions,Realistic Environments,Action Learning,grounded instructions,visual language understanding models,action space,ALFRED tasks,coffee maker,low-level language instructions
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要