Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
CoRR(2024)
摘要
Tokenization, the division of input text into input tokens, is an often
overlooked aspect of the large language model (LLM) pipeline and could be the
source of useful or harmful inductive biases. Historically, LLMs have relied on
byte pair encoding, without care to specific input domains. With the increased
use of LLMs for reasoning, various number-specific tokenization schemes have
been adopted, with popular models like LLaMa and PaLM opting for single-digit
tokenization while GPT-3.5 and GPT-4 have separate tokens for each 1-, 2-, and
3-digit numbers. In this work, we study the effect this choice has on numerical
reasoning through the use of arithmetic tasks. We consider left-to-right and
right-to-left tokenization for GPT-3.5 and -4, finding that right-to-left
tokenization (enforced by comma separating numbers at inference time) leads to
largely improved performance. Furthermore, we find that model errors when using
standard left-to-right tokenization follow stereotyped error patterns,
suggesting that model computations are systematic rather than approximate. We
show that the model is able to convert between tokenizations easily, thus
allowing chain-of-thought-inspired approaches to recover performance on
left-to-right tokenized inputs. We also find the gap between tokenization
directions decreases when models are scaled, possibly indicating that larger
models are better able to override this tokenization-dependent inductive bias.
In summary, our work performs the first study of how number tokenization
choices lead to differences in model performance on arithmetic tasks,
accompanied by a thorough analysis of error patterns. We hope this work
inspires practitioners to more carefully ablate number tokenization-related
choices when working towards general models of numerical reasoning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要