MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
CoRR(2023)
摘要
Large language models (LLMs) have garnered significant attention due to their
impressive natural language processing (NLP) capabilities. Recently, many
studies have focused on the tool utilization ability of LLMs. They primarily
investigated how LLMs effectively collaborate with given specific tools.
However, in scenarios where LLMs serve as intelligent agents, as seen in
applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate
decision-making processes that involve deciding whether to employ a tool and
selecting the most suitable tool(s) from a collection of available tools to
fulfill user requests. Therefore, in this paper, we introduce MetaTool, a
benchmark designed to evaluate whether LLMs have tool usage awareness and can
correctly choose tools. Specifically, we create a dataset called ToolE within
the benchmark. This dataset contains various types of user queries in the form
of prompts that trigger LLMs to use tools, including both single-tool and
multi-tool scenarios. Subsequently, we set the tasks for both tool usage
awareness and tool selection. We define four subtasks from different
perspectives in tool selection, including tool selection with similar choices,
tool selection in specific scenarios, tool selection with possible reliability
issues, and multi-tool selection. We conduct experiments involving eight
popular LLMs and find that the majority of them still struggle to effectively
select tools, highlighting the existing gaps between LLMs and genuine
intelligent agents. However, through the error analysis, we found there is
still significant room for improvement. Finally, we conclude with insights for
tool developers -- we strongly recommend that tool developers choose an
appropriate rewrite model for generating new descriptions based on the
downstream LLM the tool will apply to. Our code is in
\href{https://github.com/HowieHwong/MetaTool}{Github}.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要