LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles
International Conference on Computational Linguistics(2023)
摘要
With the continuous evolution and refinement of LLMs, they are endowed with
impressive logical reasoning or vertical thinking capabilities. But can they
think out of the box? Do they possess proficient lateral thinking abilities?
Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation
benchmark, LatEval, which assesses the model's lateral thinking within an
interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the
quality of questions posed by the model and the model's capability to integrate
information for problem-solving. We find that nearly all LLMs struggle with
employing lateral thinking during interactions. For example, even the most
advanced model, GPT-4, exhibits the advantage to some extent, yet still
maintain a noticeable gap when compared to human. This evaluation benchmark
provides LLMs with a highly challenging and distinctive task that is crucial to
an effective AI assistant.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要