A Mixture of Heads is Better than Heads
ACL 2020, 2020.
Abstract:
Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead" reallocate" them--the model learns to activate different heads...More
Code:
Data:
Tags
Comments