Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

2021 18th International SoC Design Conference (ISOCC)(2021)

引用 1|浏览14
暂无评分
摘要
Recently, the necessity of multiple attention heads in transformer architecture has been questioned [1]. Removing less important heads from a large network is a promising strategy to reduce computation cost and parameters. However, pruning out attention heads in multihead attention does not evenly reduce the overall load, because feedforward modules are not affected. In this study, we apply attent...
更多
查看译文
关键词
Training,Costs,Sensitivity,Computational modeling,Computer architecture,Transformers,Computational efficiency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要