AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents
arxiv(2024)
摘要
The advances made by Large Language Models (LLMs) have led to the pursuit of
LLM agents that can solve intricate, multi-step reasoning tasks. As with any
research pursuit, benchmarking and evaluation are key corner stones to
efficient and reliable progress. However, existing benchmarks are often narrow
and simply compute overall task success. To face these issues, we propose
AgentQuest – a framework where (i) both benchmarks and metrics are modular and
easily extensible through well documented and easy-to-use APIs; (ii) we offer
two new evaluation metrics that can reliably track LLM agent progress while
solving a task. We exemplify the utility of the metrics on two use cases
wherein we identify common failure points and refine the agent architecture to
obtain a significant performance increase. Together with the research
community, we hope to extend AgentQuest further and therefore we make it
available under https://github.com/nec-research/agentquest.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要