Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
CoRR(2024)
摘要
Large vision-language models (VLMs) such as GPT-4 have achieved exceptional
performance across various multi-modal tasks. However, the deployment of VLMs
necessitates substantial energy consumption and computational resources. Once
attackers maliciously induce high energy consumption and latency time
(energy-latency cost) during inference of VLMs, it will exhaust computational
resources. In this paper, we explore this attack surface about availability of
VLMs and aim to induce high energy-latency cost during inference of VLMs. We
find that high energy-latency cost during inference of VLMs can be manipulated
by maximizing the length of generated sequences. To this end, we propose
verbose images, with the goal of crafting an imperceptible perturbation to
induce VLMs to generate long sentences during inference. Concretely, we design
three loss objectives. First, a loss is proposed to delay the occurrence of
end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop
generating further tokens. Moreover, an uncertainty loss and a token diversity
loss are proposed to increase the uncertainty over each generated token and the
diversity among all tokens of the whole generated sequence, respectively, which
can break output dependency at token-level and sequence-level. Furthermore, a
temporal weight adjustment algorithm is proposed, which can effectively balance
these losses. Extensive experiments demonstrate that our verbose images can
increase the length of generated sequences by 7.87 times and 8.56 times
compared to original images on MS-COCO and ImageNet datasets, which presents
potential challenges for various applications. Our code is available at
https://github.com/KuofengGao/Verbose_Images.
更多查看译文
关键词
energy-latency manipulation,large vision-language model,verbose images
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要