A Mixture of h - 1 Heads is Better than h Heads

ACL, pp. 6566-6577, 2020.

Cited by: 0|Views97
EI

Abstract:

Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead “reallocate” them—the model learns to activate different heads ...More
Get fulltext within 24h
Bibtex
Your rating :
0

 

Tags
Comments