Catformer: Designing Stable Transformers Via Sensitivity Analysis

Jared Quincy Davis,Albert Gu,Krzysztof Choromanski,Tri Dao,Christopher Ré,Chelsea Finn,Percy Liang

ICML 2021（2021）

引用 9|浏览74

暂无评分

摘要

Transformer architectures are widely used, but training them is non-trivial, requiring custom learning rate schedules, scaling terms, residual connections, careful placement of submodules such as normalization, and so on. In this paper, we improve upon recent analysis of Transformers and formalize a notion of sensitivity to capture the difficulty of training. Sensitivity characterizes how the variance of activation and gradient norms change in expectation when parameters are randomly perturbed. We analyze the sensitivity of previous Transformer architectures and design a new architecture, the Catformer, which replaces residual connections or RNN-based gating mechanisms with concatenation. We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL-the state-of-the-art architecture designed to address stability-by 13%.

查看译文

关键词

stable transformers,sensitivity analysis

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要