WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

arxiv(2021)

引用 117|浏览239
暂无评分
摘要
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model within the cross-modal contrastive learning (CMCL) framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our CMCL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our CMCL model. Extensive experiments demonstrate that the pre-trained CMCL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
更多
查看译文
关键词
bridging vision,language,large-scale,multi-modal,pre-training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要