FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition
CoRR(2024)
摘要
In this paper, we introduce FROSTER, an effective framework for
open-vocabulary action recognition. The CLIP model has achieved remarkable
success in a range of image-based tasks, benefiting from its strong
generalization capability stemming from pretaining on massive image-text pairs.
However, applying CLIP directly to the open-vocabulary action recognition task
is challenging due to the absence of temporal information in CLIP's
pretraining. Further, fine-tuning CLIP on action recognition datasets may lead
to overfitting and hinder its generalizability, resulting in unsatisfactory
results when dealing with unseen actions.
To address these issues, FROSTER employs a residual feature distillation
approach to ensure that CLIP retains its generalization capability while
effectively adapting to the action recognition task. Specifically, the residual
feature distillation treats the frozen CLIP model as a teacher to maintain the
generalizability exhibited by the original CLIP and supervises the feature
learning for the extraction of video-specific features to bridge the gap
between images and videos. Meanwhile, it uses a residual sub-network for
feature distillation to reach a balance between the two distinct objectives of
learning generalizable and video-specific features.
We extensively evaluate FROSTER on open-vocabulary action recognition
benchmarks under both base-to-novel and cross-dataset settings. FROSTER
consistently achieves state-of-the-art performance on all datasets across the
board. Project page: https://visual-ai.github.io/froster.
更多查看译文
关键词
Open-Vocabulary Action Recognition,Action Recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要