VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection
arxiv(2024)
摘要
Due to its cost-effectiveness and widespread availability, monocular 3D
object detection, which relies solely on a single camera during inference,
holds significant importance across various applications, including autonomous
driving and robotics. Nevertheless, directly predicting the coordinates of
objects in 3D space from monocular images poses challenges. Therefore, an
effective solution involves transforming monocular images into LiDAR-like
representations and employing a LiDAR-based 3D object detector to predict the
3D coordinates of objects. The key step in this method is accurately converting
the monocular image into a reliable point cloud form. In this paper, we present
VFMM3D, an innovative approach that leverages the capabilities of Vision
Foundation Models (VFMs) to accurately transform single-view images into LiDAR
point cloud representations. VFMM3D utilizes the Segment Anything Model (SAM)
and Depth Anything Model (DAM) to generate high-quality pseudo-LiDAR data
enriched with rich foreground information. Specifically, the Depth Anything
Model (DAM) is employed to generate dense depth maps. Subsequently, the Segment
Anything Model (SAM) is utilized to differentiate foreground and background
regions by predicting instance masks. These predicted instance masks and depth
maps are then combined and projected into 3D space to generate pseudo-LiDAR
points. Finally, any object detectors based on point clouds can be utilized to
predict the 3D coordinates of objects. Comprehensive experiments are conducted
on the challenging 3D object detection dataset KITTI. Our VFMM3D establishes a
new state-of-the-art performance. Additionally, experimental results
demonstrate the generality of VFMM3D, showcasing its seamless integration into
various LiDAR-based 3D object detectors.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要