Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
arxiv(2024)
摘要
Large language models (LLMs) have exhibited impressive performance in
language comprehension and various reasoning tasks. However, their abilities in
spatial reasoning, a crucial aspect of human cognition, remain relatively
unexplored. Human possess a remarkable ability to create mental images of
unseen objects and actions through a process known as the Mind's Eye,
enabling the imagination of the unseen world. Inspired by this cognitive
capacity, we propose Visualization-of-Thought (VoT) prompting. VoT
aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces,
thereby guiding subsequent reasoning steps. We employed VoT for multi-hop
spatial reasoning tasks, including natural language navigation, visual
navigation, and visual tiling in 2D grid worlds. Experimental results
demonstrated that VoT significantly enhances the spatial reasoning abilities of
LLMs. Notably, VoT outperformed existing multimodal large language models
(MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability
to generate mental images to facilitate spatial reasoning resembles
the mind's eye process, suggesting its potential viability in MLLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要