Hostname: page-component-7479d7b7d-8zxtt Total loading time: 0 Render date: 2024-07-13T13:54:32.077Z Has data issue: false hasContentIssue false

SPSVO: a self-supervised surgical perception stereo visual odometer for endoscopy

Published online by Cambridge University Press:  29 September 2023

Junjie Zhao
Affiliation:
College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, China
Yang Luo*
Affiliation:
College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, China
Qimin Li
Affiliation:
College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, China
Natalie Baddour
Affiliation:
Department of Mechanical Engineering, University of Ottawa, Ottawa, ON, Canada
Md Sulayman Hossen
Affiliation:
College of civil engineering Chongqing university, Chongqing University, Chongqing, China
*
Corresponding author: Yang Luo; Email: yluo688@cqu.edu.cn.

Abstract

Accurate tracking and reconstruction of surgical scenes is a critical enabling technology toward autonomous robotic surgery. In endoscopic examinations, computer vision has provided assistance in many aspects, such as aiding in diagnosis or scene reconstruction. Estimation of camera motion and scene reconstruction from intra-abdominal images are challenging due to irregular illumination and weak texture of endoscopic images. Current surgical 3D perception algorithms for camera and object pose estimation rely on geometric information (e.g., points, lines, and surfaces) obtained from optical images. Unfortunately, standard hand-crafted local features for pose estimation usually do not perform well in laparoscopic environments. In this paper, a novel self-supervised Surgical Perception Stereo Visual Odometer (SPSVO) framework is proposed to accurately estimate endoscopic pose and better assist surgeons in locating and diagnosing lesions. The proposed SPSVO system combines a self-learning feature extraction method and a self-supervised matching procedure to overcome the adverse effects of irregular illumination in endoscopic images. The framework of the proposed SPSVO includes image pre-processing, feature extraction, stereo matching, feature tracking, keyframe selection, and pose graph optimization. The SPSVO can simultaneously associate the appearance of extracted feature points and textural information for fast and accurate feature tracking. A nonlinear pose graph optimization method is adopted to facilitate the backend process. The effectiveness of the proposed SPSVO framework is demonstrated on a public endoscopic dataset, with the obtained root mean square error of trajectory tracking reaching 0.278 to 0.690 mm. The computation speed of the proposed SPSVO system can reach 71ms per frame.

Type
Research Article
Copyright
© Chongqing University, 2023. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bernhardt, S., Nicolau, S. A., Soler, L. and Doignon, C., “ The status of augmented reality in laparoscopic surgery as of 2016,” Med. Image Anal. 37, 6690 (2017).CrossRefGoogle ScholarPubMed
Shao, S., Pei, Z., Chen, W., Zhu, W., Wu, X., Sun, D. and Zhang, B., “ Self-supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,” Med. Image Anal. 77, 102338 (2022).CrossRefGoogle ScholarPubMed
Feuerstein, M.. Augmented Reality in Laparoscopic Surgery (Vdm Verlag Dr.mller Aktiengesellschaft & Co.kg, 2007).Google Scholar
Lim, P. K., Stephenson, G. S., Keown, T. W., Byrne, C., Lin, C. C., Marecek, G. S. and Scolaro, J. A., “ Use of 3D printed models in resident education for the classification of acetabulum fractures,” J. Surg. Educ. 75(6), 16791684 (2018).CrossRefGoogle ScholarPubMed
Zhang, Z., Xie, Y., Xing, F, McGough, M., Yang, L., “MDNet: A semantically and visually interpretable medical image diagnosis network,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) pp. 3549--3557.Google Scholar
Low, C. M., Morris, J. M., Matsumoto, J. S., Stokken, J. K., O’Brien, E. K. and Choby, G., “ Use of 3D-printed and 2D-illustrated international frontal sinus anatomy classification anatomic models for resident education,” Otolaryngol. Head Neck Surg. 161(4), 705713 (2019).CrossRefGoogle ScholarPubMed
Afifi, A., Takada, C., Yoshimura, Y. and Nakaguchi, T., “ Real-time expanded field-of-view for minimally invasive surgery using multi-camera visual simultaneous localization and mapping,” Sensors 21(6), 2106 (2021).CrossRefGoogle ScholarPubMed
Tatar, F., Mollinger, J. R., Den Dulk, R. C., van Duyl, W. A., Goosen, J. F. L. and Bossche, A., “Ultrasonic Sensor System for Measuring Position and Orientation of Laproscopic Instruments in Minimal Invasive Surgery,” 2nd Annual International IEEE-EMBS Special Topic Conference on Microtechnologies in Medicine and Biology. Proceedings (Cat. No. 02EX578), (2002) pp. 301304.Google Scholar
Lamata, P., Morvan, T., Reimers, M., E. Samset and J. Declerck, “Addressing Shading-based Laparoscopic Registration,” World Congress on Medical Physics and Biomedical Engineering, September 7-12, 2009, Munich, Germany: Vol. 25/6 Surgery, Nimimal Invasive Interventions, Endoscopy and Image Guided Therapy, (2009) pp. 189192.Google Scholar
Wu, C.-H., Sun, Y.-N. and Chang, C.-C., “Three-dimensional modeling from endoscopic video using geometric co Qax‘ nstraints via feature positioning,” IEEE Trans. Biomed. Eng. 54(7), 11991211 (2007).Google Scholar
Seshamani, S., Lau, W. and Hager, G.. Real-time endoscopic mosaicking. In: Medical Image Computing and Computer-Assisted Intervention-MICCAI 2006: 9th International Conference, , Copenhagen, Denmark, October 1-6, 2006. Proceedings, Part I 9, (2006) pp. 355363,Google Scholar
Thormahlen, T., Broszio, H. and Meier, P. N., “Three-dimensional Endoscopy,” Falk Symposium, (2002), 2002-01.Google Scholar
Koppel, D., Chen, C.-I., Wang, Y.-F., H. Lee, J. Gu, A. Poirson and R. Wolters, “Toward Automated Model Building from Video in Computer-assisted Diagnoses in Colonoscopy,” Medical Imaging 2007: Visualization and Image-Guided Procedures, (2007) pp. 567575.Google Scholar
Mirota, D., Wang, H., Taylor, R. H., M. Ishii and G. D. Hager, “Toward Video-based Navigation for Endoscopic Endonasal Skull Base Surgery,” Medical Image Computing and Computer-Assisted Intervention-MICCAI 2009: 12th International Conference, London, UK, September 20-24, 2009, Proceedings, Part I 12, (2009) pp. 9199.Google Scholar
Kaufman, A. and Wang, J., “3D surface reconstruction from endoscopic videos,” Math. Visual., 6174 (2008).CrossRefGoogle Scholar
Hong, D., Tavanapong, W., Wong, J., J. Oh, P.-C. De Groen, “3D reconstruction of colon segments from colonoscopy images, "2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering, (2009) pp. 5360.Google Scholar
Jang, J. Y., Han, H.-S., Yoon, Y.-S., Jai, Y. and Choi, Y., “Retrospective comparison of outcomes of laparoscopic and open surgery for T2 gallbladder cancer - thirteen-year experience,” Surg. Oncol. 29, 29147 (2019).CrossRefGoogle ScholarPubMed
Wu, H., Zhao, J., Xu, K., Zhang, Y., Xu, R., Wang, A. and Iwahori, Y., “ Semantic SLAM based on deep learning in endocavity environment,” Symmetry-Basel 14(3), 614 (2022).CrossRefGoogle Scholar
Xie, C., Yao, T., Wang, J. and Liu, Q., “ Endoscope localization and gastrointestinal feature map construction based on monocular SLAM technology,” J. Infect Public Health 13(9), 13141321 (2020).CrossRefGoogle ScholarPubMed
Mountney, P., Stoyanov, D., Davison, A., and G.-Z. Yang, “Simultaneous Stereoscope Localization and Soft-tissue Mapping for Minimal Invasive Surgery, ” Medical Image Computing and Computer-Assisted Intervention-MICCAI 2006: 9th International Conference, Copenhagen, Denmark, October 1-6, 2006. Proceedings, Part I 9, (2006) pp. 347354.Google Scholar
Mountney, P. and Yang, G.-Z., “Motion Compensated SLAM for Image Guided Surgery,” Medical Image Computing and Computer-Assisted Intervention-MICCAI 2010: 13th International Conference, Beijing, China, September 20-24, 2010, Proceedings, Part II 13, (2010) pp. 496504.Google Scholar
Klein, G., Murray, D., “Parallel Tracking and Mapping for Small AR Workspaces,” 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, (2007) pp. 225234.Google Scholar
Lin, B., Johnson, A., Qian, X., J. Sanchez and Y. Sun, “Simultaneous Tracking, 3D Reconstruction and Deforming Point Detection for Stereoscope Guided Surgery,” Augmented Reality Environments for Medical Imaging and Computer-Assisted Interventions: 6th International Workshop, MIAR 2013 and 8th International Workshop, AE-CAI 2013, Held in Conjunction with MICCAI 2013, Nagoya, Japan, September 22, 2013. Proceedings, (2013) pp. 3544.Google Scholar
Lin, B., Sun, Y., Sanchez, J. E. and Qian, X., “ Efficient vessel feature detection for endoscopic image analysis,” IEEE Trans. Biomed. Eng. 62(4), 11411150 (2014).CrossRefGoogle Scholar
Mur-Artal, R., Montiel J.M., M. and Tardos, J. D., “ORB-SLAM: A versatile and accurate monocular SLAM system,” IEEE Trans. Robot. 31(5), 11471163 (2015).CrossRefGoogle Scholar
Mahmoud, N., Cirauqui, I., Hostettler, A., C. Doignon, L. Soler, J. Marescaux and J. M. M. Montiel, “ORBSLAM-based Endoscope Tracking and 3D Reconstruction,” Computer-Assisted and Robotic Endoscopy: Third International Workshop, CARE 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, October 17, 2016, Revised Selected Papers 3, (2017) pp. 7283.Google Scholar
Mahmoud, N., Collins, T., Hostettler, A., Soler, L., Doignon, C. and Montiel, J. M. M, “Live tracking and dense reconstruction for handheld monocular endoscopy,” IEEE Trans. Med. Imag. 38(1), 7989 (2019).CrossRefGoogle ScholarPubMed
Recasens, D., Lamarca, J., Facil, J. M., Montiel, J. M. M. and Civera, J., “ Endo-depth-and-motion: Localization and reconstruction in endoscopic videos using depth networks and photometric constraints,” IEEE Robot. Automat. Lett. 6(4), 72257232 (2021).CrossRefGoogle Scholar
Rublee, E., Rabaud, V., Konolige, K., and G. Bradski, “ORB: An Efficient Alternative to SIFT or SURF,” IEEE International Conference on Computer Vision, ICCV 2011, (2011).Google Scholar
Campos, C., Elvira, R., Rodriguez, J., M. Montiel and J. D. Tardós, “ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM,” IEEE Trans. Robot. Publ. IEEE Robot. Automat. Soc 37(6), 18741890 (2021).Google Scholar
Detone, D., Malisiewicz, T. and Rabinovich, A., “SuperPoint: Self-supervised interest point detection and description (2017), arXiv: 1712.07629.Google Scholar
Chang, P.-L., Stoyanov, D., Davison, A. J., and P. E. Edwards, “Real-time Dense Stereo Reconstruction Using Convex Optimisation with a Cost-volume for Image-guided Robotic Surgery,” Medical Image Computing and Computer-Assisted Intervention-MICCAI 2013: 16th International Conference, Nagoya, Japan, September 22-26, 2013, Proceedings, Part I 16, (2013) pp. 4249.Google Scholar
Lin, B., Sun, Y., Sanchez, J. and X. Qian “Vesselness based Feature Extraction for Endoscopic Image Analysis, "2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI), (2014) pp. 12951298.Google Scholar
Engel, J., Koltun, V. and Cremers, D., “Direct sparse odometry,” (2016): arXiv e-prints.Google Scholar
Zubizarreta, J., Aguinaga, I. and Montiel, J., “Direct sparse mapping,” (2019): arXiv:1904.06577.Google Scholar
Forster, C., Pizzoli, M. and Scaramuzza, D., “SVO: Fast Semi-direct Monocular Visual Odometry,” IEEE International Conference on Robotics & Automation, (2014).Google Scholar
Sarlin, P. E., Detone, D., Malisiewicz, T., and A. Rabinovich, “SuperGlue: Learning Feature Matching With Graph Neural Networks,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020).Google Scholar
Zhang, S., Zhao, L., Huang, S. and Q. Hao, “A template-based 3D reconstruction of colon structures and textures from stereo colonoscopic images,” IEEE Trans. Med. Robot. Bionics 3(1), 85--95 (2021).CrossRefGoogle Scholar
Mur-Artal, R. and Tardós, J., “ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot 33(5), 12551262 (2017).CrossRefGoogle Scholar
Durrant-Whyte, H. and Bailey, T., “Simultaneous localization and mapping: Part I,” IEEE Robot. Automat. Magaz. 13(2), 99110 (2006).CrossRefGoogle Scholar
Bartoli, A., Montiel, J., Lamarca, J., and Q. Hao, DefSLAM: Tracking and Mapping of Deforming Scenes from Monocular Sequences, (2019): arXiv: 1908.08918.Google Scholar
Gong, H., Chen, L., Li, C., J. Zeng, X. Tao and Y. Wang, “Online tracking and relocation based on a new rotation-invariant haar-like statistical descriptor in endoscopic examination,” IEEE Access 8, 101867101883 (2020).CrossRefGoogle Scholar
Song, J., Wang, J., Zhao, L., S. Huang and G. Dissanayake, “Mis-slam: Real-time large-scale dense deformable slam system in minimal invasive surgery based on heterogeneous computing,” IEEE Robot. Automat. Lett. 3(4), 40684075 (2018).CrossRefGoogle Scholar
Wei, G., Feng, G., Li, H., T. Chen, W. Shi and Z. Jiang, “A Novel SLAM Method for Laparoscopic Scene Reconstruction with Feature Patch Tracking,” 2020 International Conference on Virtual Reality and Visualization (ICVRV), (2020) pp. 287291.Google Scholar
Song, J., Zhu, Q., Lin, J., and M. Ghaffari, “BDIS: Bayesian dense inverse searching method for real-time stereo surgical image matching,” IEEE Trans. Robot., 39(2), 1388--1406 (2022).Google Scholar
Wei, G., Yang, H., Shi, W., Z. Jiang, T. Chen and Y. Wang, “Laparoscopic Scene Reconstruction based on Multiscale Feature Patch Tracking Method,” International Conference on Electronic Information Engineering and Computer Science (EIECS), (2021) pp. 588592.Google Scholar
Yadav, R. and Kala, R., “Fusion of visual odometry and place recognition for slam in extreme conditions,” Appl. Intell. 52(10), 1192811947 (2022).CrossRefGoogle Scholar
Bruno H.M., S. and Colombini, E. L., “LIFT-SLAM: A deep-learning feature-based monocular visual SLAM method,” Neurocomputing 455, 97110 (2021).CrossRefGoogle Scholar
Schmidt, A. and Salcudean, S. E., “Real-time rotated convolutional descriptor for surgical environments,” Medical Image Computing and Computer Assisted Intervention-MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, (2021) pp. 279289.Google Scholar
Xu, K., Hao, Y., Wang, C., and L. Xie, “AirVO: An illumination-robust point-line visual odometry, (2022): arXiv preprint arXiv: 2212.07595.Google Scholar
Li, D., Shi, X., Long, Q., S. Liu, W. Yang, F. Wang and F. Qiao, “DXSLAM: A Robust and Efficient Visual SLAM System with Deep Features,” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (2020) pp. 49584965.Google Scholar
Liu, X., Zheng, Y., Killeen, B., M. Ishii, G. D. Hager, R. H. Taylor and M. Unberath, “Extremely Dense Point Correspondences using a Learned Feature Descriptor,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020) pp. 48474856.Google Scholar
Liu, X., Li, Z., Ishii, M., G. D. Hager, R. H. Taylor and M. Unberath, “Sage: Slam with Appearance and Geometry Prior for Endoscopy,” 2022 International Conference on Robotics and Automation (ICRA), (2022) pp. 55875593.Google Scholar
Muja, M. and Lowe, D. G., “Fast approximate nearest neighbors with automatic algorithm configuration,” Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, Lisboa, Portugal (February 5-8, 2009).Google Scholar
Barbed, O. L., Chadebecq, F., Morlana, J., J. M. Montiel and A. C. Murillo, “ SuperPoint Features in Endoscopy,” MICCAI Workshop on Imaging Systems for GI Endoscopy, (2022) pp. 4555.Google Scholar
Sarlin, P.-E., Cadena, C., Siegwart, R. and M. Dymczyk, “From Coarse to Fine: Robust Hierarchical Localization at Large Scale,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019) pp. 1271612725.Google Scholar
Oliva Maza, L., Steidle, F., Klodmann, J., K. Strobl and R. Triebel, “An ORB-SLAM3-based approach for surgical navigation in ureteroscopy,” Comput. Methods Biomech. Biomed. Eng. Imag. Visual., 11(4), 1005--1011 (2022).Google Scholar
Zuiderveld, K., “Contrast limited adaptive histogram equalization,” Graphics Gems, 474485 (1994).CrossRefGoogle Scholar
Jang, H., Yoon, S. and Kim, A., “Multi-session Underwater Pose-graph Slam using Inter-session Opti-acoustic Two-view Factor,” 2021 IEEE International Conference on Robotics and Automation (ICRA), (2021) pp. 1166811674.Google Scholar
Rao, S., “SuperVO: A Monocular Visual Odometry based on Learned Feature Matching with GNN,” 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), 2021) pp. 1826.CrossRefGoogle Scholar
Su, Y. and Yu, L., “A dense RGB-D SLAM algorithm based on convolutional neural network of multi-layer image invariant feature,” Meas. Sci. Technol. 33(2), 025402 (2021).CrossRefGoogle Scholar
Strasdat, H., Montiel, J., Davison, A. J., “Real-time Monocular SLAM: Why Filter?,” IEEE International Conference on Robotics and Automation, (2010) pp. 26572664.Google Scholar
Szeliski, R.. Computer Vision: Algorithms and Applications (Springer Nature, 2022).CrossRefGoogle Scholar