Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
CoRR(2024)
摘要
Human forecasting accuracy in practice relies on the 'wisdom of the crowd'
effect, in which predictions about future events are significantly improved by
aggregating across a crowd of individual forecasters. Past work on the
forecasting ability of large language models (LLMs) suggests that frontier
LLMs, as individual forecasters, underperform compared to the gold standard of
a human crowd forecasting tournament aggregate. In Study 1, we expand this
research by using an LLM ensemble approach consisting of a crowd of twelve
LLMs. We compare the aggregated LLM predictions on 31 binary questions to that
of a crowd of 925 human forecasters from a three-month forecasting tournament.
Our preregistered main analysis shows that the LLM crowd outperforms a simple
no-information benchmark and is not statistically different from the human
crowd. In exploratory analyses, we find that these two approaches are
equivalent with respect to medium-effect-size equivalence bounds. We also
observe an acquiescence effect, with mean model predictions being significantly
above 50
Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2)
can be improved by drawing on human cognitive output. We find that both models'
forecasting accuracy benefits from exposure to the median human prediction as
information, improving accuracy by between 17
less accurate predictions than simply averaging human and machine forecasts.
Our results suggest that LLMs can achieve forecasting accuracy rivaling that of
human crowd forecasting tournaments: via the simple, practically applicable
method of forecast aggregation. This replicates the 'wisdom of the crowd'
effect for LLMs, and opens up their use for a variety of applications
throughout society.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要