Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios
Annual Meeting of the Association for Computational Linguistics(2024)
Abstract
The recent trend of using Large Language Models (LLMs) as intelligent agentsin real-world applications underscores the necessity for comprehensiveevaluations of their capabilities, particularly in complex scenarios involvingplanning, creating, and using tools. However, existing benchmarks typicallyfocus on simple synthesized queries that do not reflect real-world complexity,thereby offering limited perspectives in evaluating tool utilization. Toaddress this issue, we present UltraTool, a novel benchmark designed to improveand evaluate LLMs' ability in tool utilization within real-world scenarios.UltraTool focuses on the entire process of using tools - from planning andcreating to applying them in complex tasks. It emphasizes real-worldcomplexities, demanding accurate, multi-step planning for effectiveproblem-solving. A key feature of UltraTool is its independent evaluation ofplanning with natural language, which happens before tool usage and simplifiesthe task solving by mapping out the intermediate steps. Thus, unlike previouswork, it eliminates the restriction of pre-defined toolset during planning.Through extensive experiments on various LLMs, we offer novel insights into theevaluation of capabilities of LLMs in tool utilization, thereby contributing afresh perspective to this rapidly evolving field. The benchmark is publiclyavailable at https://github.com/JoeYing1019/UltraTool.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined