QuRating: Selecting High-Quality Data for Training Language Models
CoRR(2024)
摘要
Selecting high-quality pre-training data is important for creating capable
language models, but existing methods rely on simple heuristics. We introduce
QuRating, a method for selecting pre-training data that captures the abstract
qualities of texts which humans intuitively perceive. In this paper, we
investigate four qualities - writing style, required expertise, facts trivia,
and educational value. We find that LLMs are able to discern these qualities
and observe that they are better at making pairwise judgments of texts than at
rating the quality of a text directly. We train a QuRater model to learn scalar
ratings from pairwise judgments, and use it to annotate a 260B training corpus
with quality ratings for each of the four criteria. In our experiments, we
select 30B tokens according to the different quality ratings and train
1.3B-parameter language models on the selected data. We find that it is
important to balance quality and diversity, as selecting only the highest-rated
documents leads to poor results. When we sample using quality ratings as logits
over documents, our models achieve lower perplexity and stronger in-context
learning performance than baselines. Beyond data selection, we use the quality
ratings to construct a training curriculum which improves performance without
changing the training dataset. We extensively analyze the quality ratings and
discuss their characteristics, biases, and wider implications.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要