InternVL: Scaling Up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen,Jiannan Wu,Wenhai Wang,Weijie Su,Guo Chen,Sen Xing, Zhong Muyan,Qing-Long Zhang,Xizhou Zhu,Lewei Lu,Bin Li,Ping Luo,Tong Lu,Yu Qiao,Jifeng Dai

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)（2024）

引用 0|浏览81

暂无评分

摘要

The exponential growth of large language models (LLMs) has opened up numerouspossibilities for multimodal AGI systems. However, the progress in vision andvision-language foundation models, which are also critical elements ofmulti-modal AGI, has not kept pace with LLMs. In this work, we design alarge-scale vision-language foundation model (InternVL), which scales up thevision foundation model to 6 billion parameters and progressively aligns itwith the LLM, using web-scale image-text data from various sources. This modelcan be broadly applied to and achieve state-of-the-art performance on 32generic visual-linguistic benchmarks including visual perception tasks such asimage-level or pixel-level recognition, vision-language tasks such as zero-shotimage/video classification, zero-shot image/video-text retrieval, and link withLLMs to create multi-modal dialogue systems. It has powerful visualcapabilities and can be a good alternative to the ViT-22B. We hope that ourresearch could contribute to the development of multi-modal large models. Codeand models are available at https://github.com/OpenGVLab/InternVL.

查看译文

关键词

multi-modal,vision foundation model,vision-language model

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要