DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling
arxiv(2024)
摘要
Traditional language models operate autoregressively, i.e., they predict one
token at a time. Rapid explosion in model sizes has resulted in high inference
times. In this work, we propose DynaMo, a suite of multi-token prediction
language models that reduce net inference times. Our models
dynamically predict multiple tokens based on their confidence in the
predicted joint probability distribution. We propose a lightweight technique to
train these models, leveraging the weights of traditional autoregressive
counterparts. Moreover, we propose novel ways to enhance the estimated joint
probability to improve text generation quality, namely co-occurrence weighted
masking and adaptive thresholding. We also propose systematic qualitative and
quantitative methods to rigorously test the quality of generated text for
non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3,
achieves same-quality generated text as the baseline (Pythia-6.9B) while
achieving 2.57× speed-up with only 5.87
training time overheads, respectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要