Rho-1: Not All Tokens Are What You Need

Zhenghao Lin, Zhibin Gou, Yeyun Gong,Xiao Liu, Yelong Shen, Ruochen Xu,Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen

CoRR(2024)

引用 0|浏览58
暂无评分
摘要
Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for language model training". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30 fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6 51.8 pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8 efficiency and performance of the language model pre-training.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要