Estimating depth and ego-motion are crucial tasks for laparoscopic navigation and robotic-assisted surgery. Most current self-supervised methods involve warping one frame onto an adjacent frame using the estimated depth and camera pose. The photometric loss between the estimated and original frames then serves as the training signal. However, these methods encounter major challenges due to non-Lambertian reflection regions and the textureless surfaces of organs, leading to significant performance degradation and scale ambiguity in monocular depth estimation. In this paper, we introduce a network that predicts depth and ego-motion using spatial-temporal consistency constraints. Spatial consistency is derived from the left and right views of the stereo laparoscopic image pairs, while temporal consistency comes from consecutive frames. To enhance the understanding of semantic information in surgical scenes, we employ the Swin Transformer as the encoder and decoder for depth estimation, due to its superior semantic segmentation capabilities. To address issues of illumination variance and scale ambiguity, we incorporate a SIFT loss term to eliminate oversaturated regions in laparoscopic images. Our method is evaluated on the SCARED dataset and shows remarkable results.
Method | Year | Abs Rel | Sq Rel | RMSE | RMSE log | δ |
---|---|---|---|---|---|---|
Fang et al. | 2020 | 0.078 | 0.794 | 6.794 | 0.109 | 0.946 |
Endo-SfM | 2021 | 0.062 | 0.606 | 5.726 | 0.093 | 0.957 |
AF-SfMLeaner | 2022 | 0.063 | 0.538 | 5.597 | 0.089 | 0.974 |
Yang et al. | 2024 | 0.062 | 0.558 | 5.585 | 0.090 | 0.962 |
Ours | 0.057 | 0.436 | 4.972 | 0.081 | 0.972 |
Install required dependencies with pip:
pip install -r requirements.txt
Download pretrained model from: depth_anything_vitb14. Create a folder named pretrained_model
in this repo and place the downloaded model in it.
Please follow AF-SfMLearner to prepare the SCARED dataset.
CUDA_VISIBLE_DEVICES=0 python train_end_to_end.py --data_path <your_data_path> --log_dir './logs'
Export ground truth depth and pose before evaluation:
python export_gt_depth.py --data_path PATH_TO_YOUR_DATA --split endovis
python export_gt_pose.py --data_path PATH_TO_YOUR_DATA --split endovis --sequence YOUR_SEQUENCE
If you want to evaluate your model:
python evaluate_depth.py --data_path PATH_TO_YOUR_DATA --load_weights_folder PATH_TO_YOUR_MODEL --eval_mono
If you want to evaluate your model:
python evaluate_pose.py --data_path PATH_TO_YOUR_DATA --load_weights_folder PATH_TO_YOUR_MODEL --eval_mono
If you want to generate your depthmap:
python generate_pred.py
If you want to generate point cloud map:
python generate_pred_nocolor.py
cd depth2pointcloud
python generate_depthmap.py
python generate_pc_rgb.py
Our code is based on the implementation of AF-SfMLearner, Depth-Anything. We thank their excellent works.