An Adaptive Learning Method for Solving the Extreme Learning Rate Problem of Transformer.

Jianbang Ding,Xuancheng Ren,Ruixuan Luo

NLPCC (1)(2023)

引用 0|浏览0
暂无评分
摘要
Transformer, a neural sequence model entirely based on attention, has achieved great success in natural language processing and become the de facto default model for multiple NLP tasks. Albeit its prevalence, the attention-based structure poses unmet challenges that the widely-used adaptive optimization methods, e.g., Adam, have serious difficulty in learning and often fail to converge if applied alone. In this work, we illustrate the problem that the adaptive optimization methods produce extremely-large learning rates that break the balance of stability. We further propose AdaMod, which smooths out extremely-large learning rates with adaptive and momental upper bounds on a per-parameter basis, instead of the uniform scaling in the warmup scheme. We empirically demonstrate AdaMod can improve the learning stability and bring significant improvements to the performance of Transformers and CNNs. Moreover, empirical results verify its effectiveness and robustness across different applications.
更多
查看译文
关键词
extreme learning rate problem,adaptive learning method
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要