MIMIC: Masked Image Modeling with Image Correspondences
arxiv(2023)
摘要
Dense pixel-specific representation learning at scale has been bottlenecked
due to the unavailability of large-scale multi-view datasets. Current methods
for building effective pretraining datasets heavily rely on annotated 3D
meshes, point clouds, and camera parameters from simulated environments,
preventing them from building datasets from real-world data sources where such
metadata is lacking. We propose a pretraining dataset-curation approach that
does not require any additional annotations. Our method allows us to generate
multi-view datasets from both real-world videos and simulated environments at
scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and
MIMIC-3M with 3.1M multi-view image pairs. We train multiple models with
different masked image modeling objectives to showcase the following findings:
Representations trained on our automatically generated MIMIC-3M outperform
those learned from expensive crowdsourced datasets (ImageNet-1K) and those
learned from synthetic environments (MULTIVIEW-HABITAT) on two dense geometric
tasks: depth estimation on NYUv2 (1.7
Taskonomy (2.05
outperform MULTIVIEW-HABITAT, on semantic segmentation on ADE20K (3.89
estimation on MSCOCO (9.4
object-centric expensive ImageNet-1K. We outperform even when the
representations are frozen, and when downstream training data is limited to
few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which
is promising since our curation method can arbitrarily scale to produce even
larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at
https://github.com/RAIVNLab/MIMIC.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要