A Mixture of h - 1 Heads is Better than h Heads
ACL, pp. 6566-6577, 2020.
Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead “reallocate” them—the model learns to activate different heads ...More
Full Text (Upload PDF)
PPT (Upload PPT)