Data-efficient Event Camera Pre-training via Disentangled Masked Modeling
CoRR(2024)
摘要
In this paper, we present a new data-efficient voxel-based self-supervised
learning method for event cameras. Our pre-training overcomes the limitations
of previous methods, which either sacrifice temporal information by converting
event sequences into 2D images for utilizing pre-trained image models or
directly employ paired image data for knowledge distillation to enhance the
learning of event streams. In order to make our pre-training data-efficient, we
first design a semantic-uniform masking method to address the learning
imbalance caused by the varying reconstruction difficulties of different
regions in non-uniform data when using random masking. In addition, we ease the
traditional hybrid masked modeling process by explicitly decomposing it into
two branches, namely local spatio-temporal reconstruction and global semantic
reconstruction to encourage the encoder to capture local correlations and
global semantics, respectively. This decomposition allows our selfsupervised
learning method to converge faster with minimal pre-training data. Compared to
previous approaches, our self-supervised learning method does not rely on
paired RGB images, yet enables simultaneous exploration of spatial and temporal
cues in multiple scales. It exhibits excellent generalization performance and
demonstrates significant improvements across various tasks with fewer
parameters and lower computational costs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要