BaSFormer: A Balanced Sparsity Regularized Attention Network for Transformer

IEEE/ACM Transactions on Audio, Speech, and Language Processing(2024)

引用 0|浏览0
暂无评分
摘要
Attention networks often make decisions relying solely on a few pieces of tokens, even if those reliances are not truly indicative of the underlying meaning or intention of the full context. This can lead to over-fitting in transformers and hinder their ability to generalize. Attention regularization and sparsity-based methods have been used to overcome this issue. However, these methods cannot guarantee that all tokens have sufficient receptive fields for global information inference. Thus, the impact of individual biases cannot be effectively reduced. As a result, the generalization of these approaches improved slightly from the training data to new data. To address these limitations, we propose a balanced sparsity (BaS) regularized attention network on top of the transformers, called BaSFormer. BaS regularization introduces the K-regular graph constraint on self-attention connections, which replaces SoftMax with SparseMax in the attention transformation. In BaS-regularized selfattention, SparseMax assigns zero attention scores to low-scoring connections, highlighting influential and meaningful contexts. The K-regular graph constraint ensures that all tokens have an equalsized receptive field to aggregate information, which facilitates the involvement of global tokens in the feature update of each layer and reduces the impact of individual biases. Given that there is no continuous loss can be used for the K-regular graph regularization, we propose an exponential extremum loss with an augmented Lagrangian function. The experimental results showed that BaSFormer improved the effectiveness of debiasing compared to that of the newest LLMs, such as the GPT-3.5, GPT-4 and LLaMA. In addition, BaSFormer achieves new stateof-the-art (SOTA) results in text generation tasks. Interestingly, this work also shows that BaSFormer can learn hierarchical linguistic dependencies in gradient attributions, which improves interpretability and adversarial robustness.
更多
查看译文
关键词
Over-fitting,Transformers,Attention regularization,Receptive field,Balanced sparsity,Generalization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要