Text Quality-Based Pruning for Efficient Training of Language Models
arxiv(2024)
摘要
In recent times training Language Models (LMs) have relied on computationally
heavy training over massive datasets which makes this training process
extremely laborious. In this paper we propose a novel method for numerically
evaluating text quality in large unlabelled NLP datasets in a model agnostic
manner to assign the text instances a "quality score".
By proposing the text quality metric, the paper establishes a framework to
identify and eliminate low-quality text instances, leading to improved training
efficiency for LM models. Experimental results over multiple models and
datasets demonstrate the efficacy of this approach, showcasing substantial
gains in training effectiveness and highlighting the potential for
resource-efficient LM training.
For example, we observe an absolute accuracy improvement of 0.9
over 14 downstream evaluation tasks for multiple LM models while using 40
lesser data and training 42
and 0.8
training 21
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要