Large Language Models as Automated Aligners for benchmarking Vision-Language Models
ICLR 2024(2023)
摘要
With the advancements in Large Language Models (LLMs), Vision-Language Models
(VLMs) have reached a new level of sophistication, showing notable competence
in executing intricate cognition and reasoning tasks. However, existing
evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to
measure task-specific performance, face significant limitations in assessing
the alignment of these increasingly anthropomorphic models with human
intelligence. In this work, we address the limitations via Auto-Bench, which
delves into exploring LLMs as proficient aligners, measuring the alignment
between VLMs and human intelligence and value through automatic data curation
and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs
(e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning
triplets via prompting on visual symbolic representations (e.g., captions,
object locations, instance relationships, and etc.). The curated data closely
matches human intent, owing to the extensive world knowledge embedded in LLMs.
Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered
question-answer-reasoning triplets have been curated, covering 4 primary
abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to
serve as judges, implementing the quantitative and qualitative automated
assessments to facilitate a comprehensive evaluation of VLMs. Our validation
results reveal that LLMs are proficient in both evaluation data curation and
model assessment, achieving an average agreement rate of 85%. We envision
Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating
the evolving sophisticated VLMs.
更多查看译文
关键词
LLMs,VLMs,Benchmark
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要