Chrome Extension
WeChat Mini Program
Use on ChatGLM

Distilling Vision-Language Models on Millions of Videos

CVPR 2024(2024)

Cited 0|Views105
No score
Abstract
The recent advance in vision-language models is largely attributed to theabundance of image-text data. We aim to replicate this success forvideo-language models, but there simply is not enough human-curated video-textdata available. We thus resort to fine-tuning a video-language model from astrong image-language baseline with synthesized instructional data. Theresulting video-language model is then used to auto-label millions of videos togenerate high-quality captions. We show the adapted video-language modelperforms well on a wide range of video-language benchmarks. For instance, itsurpasses the best prior result on open-ended NExT-QA by 2.8model generates detailed descriptions for previously unseen videos, whichprovide better textual supervision than existing methods. Experiments show thata video-language dual-encoder model contrastively trained on theseauto-generated captions is 3.8leverages vision-language models. Our best model outperforms state-of-the-artmethods on MSR-VTT zero-shot text-to-video retrieval by 6
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined