AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability
CoRR(2024)
摘要
This paper introduces AQA-Bench, a novel benchmark to assess the sequential
reasoning capabilities of large language models (LLMs) in algorithmic contexts,
such as depth-first search (DFS). The key feature of our evaluation benchmark
lies in its interactive evaluation protocol – for example, in DFS, the
availability of each node's connected edge is contingent upon the model's
traversal to that node, thereby necessitating the LLM's ability to effectively
remember visited nodes and strategize subsequent moves. We comprehensively
build AQA-Bench with three different algorithms, namely binary search,
depth-first search, and breadth-first search, and to evaluate the sequential
reasoning ability of 12 different LLMs. Our investigations reveal several
interesting findings: (1) Closed-source models like GPT-4 and Gemini generally
show strong sequential reasoning ability, significantly outperforming
open-source LLMs. (2) Naively providing interactive examples may inadvertently
hurt few-shot performance. (3) A very limited number of predecessor steps
following the optimal policy can substantially boost small models' performance.
(4) The scaling correlation between performance and model size is not always
significant, sometimes even showcasing an inverse trend. We hope our study can
catalyze future work on advancing the understanding and enhancement of LLMs'
capabilities in sequential reasoning. The code is available at
https://github.com/UCSC-VLAA/AQA-Bench.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要