Predict the Next Word: Humans exhibit uncertainty in this task and language models _____
CoRR(2024)
摘要
Language models (LMs) are statistical models trained to assign probability to
human-generated text. As such, it is reasonable to question whether they
approximate linguistic variability exhibited by humans well. This form of
statistical assessment is difficult to perform at the passage level, for it
requires acceptability judgements (i.e., human evaluation) or a robust
automated proxy (which is non-trivial). At the word level, however, given some
context, samples from an LM can be assessed via exact matching against a
prerecorded dataset of alternative single-word continuations of the available
context. We exploit this fact and evaluate the LM's ability to reproduce
variability that humans (in particular, a population of English speakers)
exhibit in the 'next word prediction' task. This can be seen as assessing a
form of calibration, which, in the context of text classification, Baan et al.
(2022) termed calibration to human uncertainty. We assess GPT2, BLOOM and
ChatGPT and find that they exhibit fairly low calibration to human uncertainty.
We also verify the failure of expected calibration error (ECE) to reflect this,
and as such, advise the community against relying on it in this setting.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要