Extreme Miscalibration and the Illusion of Adversarial Robustness
CoRR(2024)
摘要
Deep learning-based Natural Language Processing (NLP) models are vulnerable
to adversarial attacks, where small perturbations can cause a model to
misclassify. Adversarial Training (AT) is often used to increase model
robustness. However, we have discovered an intriguing phenomenon: deliberately
or accidentally miscalibrating models masks gradients in a way that interferes
with adversarial attack search methods, giving rise to an apparent increase in
robustness. We show that this observed gain in robustness is an illusion of
robustness (IOR), and demonstrate how an adversary can perform various forms of
test-time temperature calibration to nullify the aforementioned interference
and allow the adversarial attack to find adversarial examples. Hence, we urge
the NLP community to incorporate test-time temperature scaling into their
robustness evaluations to ensure that any observed gains are genuine. Finally,
we show how the temperature can be scaled during training to improve
genuine robustness.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要