Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

Wanrong Zhu,Jack Hessel,Anas Awadalla,Samir Yitzhak Gadre,Jesse Dodge,Alex Fang,Youngjae Yu,Ludwig Schmidt, William Yang Wang,Yejin Choi

arXiv (Cornell University)（2023）

引用 73|浏览272

暂无评分

摘要

In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4 (mmc4), an augmentation of the popular text-only c4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. mmc4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (90%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (78%). After filtering NSFW images, ads, etc., the corpus contains 103M documents containing 585M images interleaved with 43B English tokens.

查看译文

关键词

multimodal c4,corpus,text,images,billion-scale

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要