Setting the Record Straight on Transformer Oversmoothing
CoRR(2024)
摘要
Transformer-based models have recently become wildly successful across a
diverse set of domains. At the same time, recent work has shown that
Transformers are inherently low-pass filters that gradually oversmooth the
inputs, reducing the expressivity of their representations. A natural question
is: How can Transformers achieve these successes given this shortcoming? In
this work we show that in fact Transformers are not inherently low-pass
filters. Instead, whether Transformers oversmooth or not depends on the
eigenspectrum of their update equations. Our analysis extends prior work in
oversmoothing and in the closely-related phenomenon of rank collapse. We show
that many successful Transformer models have attention and weights which
satisfy conditions that avoid oversmoothing. Based on this analysis, we derive
a simple way to parameterize the weights of the Transformer update equations
that allows for control over its spectrum, ensuring that oversmoothing does not
occur. Compared to a recent solution for oversmoothing, our approach improves
generalization, even when training with more layers, fewer datapoints, and data
that is corrupted.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要