m m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
arxiv(2024)
摘要
Real-world multi-modal problems are rarely solved by a single machine
learning model, and often require multi-step computational plans that involve
stitching several models. Tool-augmented LLMs hold tremendous promise for
automating the generation of such computational plans. However, the lack of
standardized benchmarks for evaluating LLMs as planners for multi-step
multi-modal tasks has prevented a systematic study of planner design decisions.
Should LLMs generate a full plan in a single shot or step-by-step? Should they
invoke tools directly with Python code or through structured data formats like
JSON? Does feedback improve planning? To answer these questions and more, we
introduce m m's: a benchmark containing 4K+ multi-step multi-modal tasks
involving 33 tools that include multi-modal models, (free) public APIs, and
image processing modules. For each of these task queries, we provide
automatically generated plans using this realistic toolset. We further provide
a high-quality subset of 1,565 task plans that are human-verified and correctly
executable. With m m's, we evaluate 6 popular LLMs with 2 planning strategies
(multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3
types of feedback (parsing/verification/execution). Finally, we summarize
takeaways from our extensive experiments. Our dataset and code are available on
HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github
(https://github.com/RAIVNLab/mnms).
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要