Toward a Theory of Tokenization in LLMs
arxiv(2024)
摘要
While there has been a large body of research attempting to circumvent
tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the
current consensus is that it is a necessary initial step for designing
state-of-the-art performant language models. In this paper, we investigate
tokenization from a theoretical point of view by studying the behavior of
transformers on simple data generating processes. When trained on data drawn
from certain simple k^th-order Markov processes for k > 1,
transformers exhibit a surprising phenomenon - in the absence of tokenization,
they empirically fail to learn the right distribution and predict characters
according to a unigram model (Makkuva et al., 2024). With the addition of
tokenization, however, we empirically observe that transformers break through
this barrier and are able to model the probabilities of sequences drawn from
the source near-optimally, achieving small cross-entropy loss. With this
observation as starting point, we study the end-to-end cross-entropy loss
achieved by transformers with and without tokenization. With the appropriate
tokenization, we show that even the simplest unigram models (over tokens)
learnt by transformers are able to model the probability of sequences drawn
from k^th-order Markov sources near optimally. Our analysis provides
a justification for the use of tokenization in practice through studying the
behavior of transformers on Markovian data.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要