SVT-SDE: Spatiotemporal Vision Transformers-Based Self-Supervised Depth Estimation in Stereoscopic Surgical Videos

IEEE Transactions on Medical Robotics and Bionics(2023)

引用 2|浏览11
暂无评分
摘要
Dense depth estimation plays a crucial role in developing context-aware computer-assisted intervention systems. However, it is a challenging task due to low image quality and highly dynamic surgical environment. The task is further complicated by the difficulty in acquiring per-pixel ground truth depth data in a surgical setting. Recent works on self-supervised depth estimation use image reconstruction (i.e., the warped images) as supervisory signal, which helps to eliminate the requirement of ground truth depth annotations but also causes over-smoothed depth predictions. Additionally, most existing depth estimation methods are built upon static laparoscopic images, ignoring rich temporal information. To address these challenges, we propose a novel spatiotemporal vision transformers-based self-supervised depth estimation method, referred as SVT-SDE. Unlike previous works, SVT-SDE features a novel spatiotemporal vision transformers (SVT) architecture, which can learn complementary visual and temporal information from the input stereoscopic video clips. We further introduce high-frequency-based supervisory signal, which helps to preserve fine-grained details of depth estimation. Results from experiments conducted on two publicly available datasets demonstrate the superior performance of SVT-SDE over the state-of-the-art self-supervised depth estimation methods.
更多
查看译文
关键词
Estimation,Image reconstruction,Videos,Surgery,Spatiotemporal phenomena,Feature extraction,Cameras,Depth estimation,surgical videos,spatiotemporal vision transformers,unsupervised
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要