Two Counterexamples to Tokenization and the Noiseless Channel
CoRR(2024)
摘要
In Tokenization and the Noiseless Channel
, Rényi efficiency is suggested as an
intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer
which leads to the highest Rényi efficiency of the unigram distribution
should be chosen. The Rényi efficiency is thus treated as a predictor of
downstream performance (e.g., predicting BLEU for a machine translation task),
without the expensive step of training multiple models with different
tokenizers. Although useful, the predictive power of this metric is not
perfect, and the authors note there are additional qualities of a good
tokenization scheme that Rényi efficiency alone cannot capture.
We describe two variants of BPE tokenization which can arbitrarily increase
Rényi efficiency while decreasing the downstream model performance. These
counterexamples expose cases where Rényi efficiency fails as an intrinsic
tokenization metric and thus give insight for building more accurate
predictors.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要