Cached Transformers: Improving Transformers with Differentiable Memory Cache
CoRR(2023)
摘要
This work introduces a new Transformer model called Cached Transformer, which
uses Gated Recurrent Cached (GRC) attention to extend the self-attention
mechanism with a differentiable memory cache of tokens. GRC attention enables
attending to both past and current tokens, increasing the receptive field of
attention and allowing for exploring long-range dependencies. By utilizing a
recurrent gating unit to continuously update the cache, our model achieves
significant advancements in \textbf{six} language and vision tasks, including
language modeling, machine translation, ListOPs, image classification, object
detection, and instance segmentation. Furthermore, our approach surpasses
previous memory-based techniques in tasks such as language modeling and
displays the ability to be applied to a broader range of situations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要