TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation
arxiv(2024)
摘要
Autonomous driving requires an accurate representation of the environment. A
strategy toward high accuracy is to fuse data from several sensors. Learned
Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual
sensors into one joint latent space. For cost-efficient camera-only systems,
this provides an effective mechanism to fuse data from multiple cameras with
different views. Accuracy can further be improved by aggregating sensor
information over time. This is especially important in monocular camera systems
to account for the lack of explicit depth and velocity measurements. Thereby,
the effectiveness of developed BEV encoders crucially depends on the operators
used to aggregate temporal information and on the used latent representation
spaces. We analyze BEV encoders proposed in the literature and compare their
effectiveness, quantifying the effects of aggregation operators and latent
representations. While most existing approaches aggregate temporal information
either in image or in BEV latent space, our analyses and performance
comparisons suggest that these latent representations exhibit complementary
strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which
integrates aggregated temporal information from both latent spaces. We consider
subsequent image frames as stereo through time and leverage methods from
optical flow estimation for temporal stereo encoding. Empirical evaluation on
the NuScenes dataset shows a significant improvement by TempBEV over the
baseline for 3D object detection and BEV segmentation. The ablation uncovers a
strong synergy of joint temporal aggregation in the image and BEV latent space.
These results indicate the overall effectiveness of our approach and make a
strong case for aggregating temporal information in both image and BEV latent
spaces.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要