Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation
arxiv(2024)
摘要
Vision-and-Language Navigation in Continuous Environments (VLN-CE) is one of
the most intuitive yet challenging embodied AI tasks. Agents are tasked to
navigate towards a target goal by executing a set of low-level actions,
following a series of natural language instructions. All VLN-CE methods in the
literature assume that language instructions are exact. However, in practice,
instructions given by humans can contain errors when describing a spatial
environment due to inaccurate memory or confusion. Current VLN-CE benchmarks do
not address this scenario, making the state-of-the-art methods in VLN-CE
fragile in the presence of erroneous instructions from human users. For the
first time, we propose a novel benchmark dataset that introduces various types
of instruction errors considering potential human causes. This benchmark
provides valuable insight into the robustness of VLN systems in continuous
environments. We observe a noticeable performance drop (up to -25
Rate when evaluating the state-of-the-art VLN-CE methods on our benchmark.
Moreover, we formally define the task of Instruction Error Detection and
Localization, and establish an evaluation protocol on top of our benchmark
dataset. We also propose an effective method, based on a cross-modal
transformer architecture, that achieves the best performance in error detection
and localization, compared to baselines. Surprisingly, our proposed method has
revealed errors in the validation set of the two commonly used datasets for
VLN-CE, i.e., R2R-CE and RxR-CE, demonstrating the utility of our technique in
other tasks. Code and dataset will be made available upon acceptance at
https://intelligolabs.github.io/R2RIE-CE
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要