WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
CoRR(2024)
摘要
The advancement of large language models (LLMs) leads to a new era marked by
the development of autonomous applications in the real world, which drives
innovation in the creation of advanced web-based agents. Existing web agents
typically only handle one input modality and are evaluated only in simplified
web simulators or static web snapshots, greatly limiting their applicability in
real-world scenarios. To bridge this gap, we introduce WebVoyager, an
innovative Large Multimodal Model (LMM) powered web agent that can complete
user instructions end-to-end by interacting with real-world websites. Moreover,
we propose a new evaluation protocol for web agents to address the challenges
of automatic evaluation of open-ended web agent tasks, leveraging the robust
multimodal comprehension capabilities of GPT-4V. We create a new benchmark by
gathering real-world tasks from 15 widely used websites to evaluate our agents.
We show that WebVoyager achieves a 55.7
surpassing the performance of both GPT-4 (All Tools) and the WebVoyager
(text-only) setups, underscoring the exceptional capability of WebVoyager in
practical applications. We found that our proposed automatic evaluation
achieves 85.3
development of web agents in a real-world setting.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要