Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling.
2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)(2024)
Abstract
In this paper, we introduce Attention Prompt Tuning (APT) - a computationallyefficient variant of prompt tuning for video-based applications such as actionrecognition. Prompt tuning approaches involve injecting a set of learnableprompts along with data tokens during fine-tuning while keeping the backbonefrozen. This approach greatly reduces the number of learnable parameterscompared to full tuning. For image-based downstream tasks, normally a couple oflearnable prompts achieve results close to those of full tuning. However,videos, which contain more complex spatiotemporal information, require hundredsof tunable prompts to achieve reasonably good results. This reduces theparameter efficiency observed in images and significantly increases latency andthe number of floating-point operations (FLOPs) during inference. To tacklethese issues, we directly inject the prompts into the keys and values of thenon-local attention mechanism within the transformer block. Additionally, weintroduce a novel prompt reparameterization technique to make APT more robustagainst hyperparameter selection. The proposed APT approach greatly reduces thenumber of FLOPs and latency while achieving a significant performance boostover the existing parameter-efficient tuning methods on UCF101, HMDB51, andSSv2 datasets for action recognition. The code and pre-trained models areavailable at https://github.com/wgcban/apt
MoreTranslated text
Key words
Action Recognition,Computational Efficiency,Attention Mechanism,Floating-point Operations,Tuning Method,Transformer Block,Action Recognition Datasets,Image Processing,Learning Rate,Convolutional Neural Network,Language Processing,Appended,Tuning Parameter,Improvement In Accuracy,Recurrent Neural Network,Data Augmentation,Weight Decay,Multilayer Perceptron,Downstream Applications,Linear Probe,Input Tokens,Top-1 Accuracy,Multilayer Perceptron Layer,Top-5 Accuracy,Vision Transformer,Pre-trained Weights,Transformer Layers,Embedding Dimension,Network-based Methods,Image Classification
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined