Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference
CoRR(2023)
摘要
The computational cost of transformer models makes them inefficient in
low-latency or low-power applications. While techniques such as quantization or
linear attention can reduce the computational load, they may incur a reduction
in accuracy. In addition, globally reducing the cost for all inputs may be
sub-optimal. We observe that for each layer, the full width of the layer may be
needed only for a small subset of tokens inside a batch and that the
"effective" width needed to process a token can vary from layer to layer.
Motivated by this observation, we introduce the Adaptive Computation Module
(ACM), a generic module that dynamically adapts its computational load to match
the estimated difficulty of the input on a per-token basis. An ACM consists of
a sequence of learners that progressively refine the output of their preceding
counterparts. An additional gating mechanism determines the optimal number of
learners to execute for each token. We also describe a distillation technique
to replace any pre-trained model with an "ACMized" variant. The distillation
phase is designed to be highly parallelizable across layers while being simple
to plug-and-play into existing networks. Our evaluation of transformer models
in computer vision and speech recognition demonstrates that substituting layers
with ACMs significantly reduces inference costs without degrading the
downstream accuracy for a wide interval of user-defined budgets.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要