Quality and Relevance Metrics for Selection of Multimodal Pretraining Data.

CVPR Workshops(2020)

引用 1|浏览294
暂无评分
摘要
Self-supervised pretraining has become a strong force in both language and vision tasks. Current efforts to improve the effects of pretraining focus on improving network architecture or defining new tasks to extract representations from the data. We focus on a third axis, the data itself, to quantify and measure how different sources and quality of data can affect the learned representations. As pretraining datasets grow larger and larger, the cost of pretraining will continue to increase. This issue is especially acute for visuolingusitic data, as the cost of storage and processing for image and video data will rise quickly. We therefore examine four visuolinguistic datasets (three preexisting datasets and one collected by us) for their utility as pretraining datasets. We define metrics for dataset quality and relevance, propose a method for subsampling large corpuses for the data most relevant to a set of downstream multimodal vision and language tasks of interest, and show that this method increases performance across the board for all downstream tasks.
更多
查看译文
关键词
video data,dataset quality,downstream multimodal vision,language tasks,downstream tasks,multimodal pretraining data,self-supervised pretraining,vision tasks,network architecture,learned representations,image processing,vi- suolinguistic datasets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要