Background: Multiple types of surgical cameras are used in modern surgical practice and provide a rich visual signal that is used by surgeons to visualize the clinical site and make clinical decisions. This signal can also be used by artificial intelligence (AI) methods to provide support in identifying instruments, structures, or activities both in real-time during procedures and postoperatively for analytics and understanding of surgical processes. Summary: In this paper, we provide a succinct perspective on the use of AI and especially computer vision to power solutions for the surgical operating room (OR). The synergy between data availability and technical advances in computational power and AI methodology has led to rapid developments in the field and promising advances. Key Messages: With the increasing availability of surgical video sources and the convergence of technologies around video storage, processing, and understanding, we believe clinical solutions and products leveraging vision are going to become an important component of modern surgical capabilities. However, both technical and clinical challenges remain to be overcome to efficiently make use of vision-based approaches into the clinic.
Surgery has progressively shifted towards the minimally invasive surgery (MIS) paradigm. This means that today most operating rooms (OR) are equipped with digital cameras that visualize the surgical site. The video generated by surgical cameras is a form of digital measurement and observation of the patient anatomy at the surgical site. It contains information about the appearance, shape, motion, and function of the anatomy and instrumentation within it. Once recorded over the duration of a procedure it also embeds information about the surgical process, actions performed, instruments used, possible hazards or complications, and even information about risk. While such information can be inferred by expert observers, this is not practical for providing assistance in routine clinical use and automated techniques are necessary to effectively utilize the data for driving improvements in practice [1, 2].
Currently, the majority of surgical video is either not recorded or it is stored for a limited period of time on the stack accompanying the surgical camera and then discarded at a later date. Perhaps some video is used in case presentations during clinical meeting discussions or society conferences or for educational purposes, and on an individual level surgeons may choose to record their case history. Storage has an associated cost and hence it is sensible to reduce data stores to only relevant and clinically useful information. This is largely due to the lack of tools that can synthesize the surgical video into meaningful information, either about the process or about physiological information contained in the video observations. For certain diagnostic procedures, e.g., endoscopic gastroenterology, storage of images from the procedure into the patient medical record to document observed lesions is becoming standard practice but this is largely not done for surgical video.
In addition to surgical cameras, it is also nowadays common for other OR cameras to be present. These can be used to monitor activity throughout the OR and not just at the surgical site . As such, opportunities are present to capture this signal and provide an understanding of the entire room and activities or events that occur within it. This can potentially be used to optimize team performance or monitor room level events that can be used to improve the surgical process. To effectively make use of video data from the OR, it is necessary to build algorithms for video analysis and understanding. In this paper, we provide a short review of the state of the art in artificial intelligence (AI), and especially computer vision, for the analysis of surgical data and outline some of the concepts and directions for future development and practical translation into the clinic.
The field of computer vision is a sub-branch of AI focused on building algorithms and methods for understanding information captured in images and video . To make vision problems tractable, computational methods typically focus on sub-components of the human visual system, e.g., object detection or identification (classification), motion extraction, or spatial understanding (Fig. 1). Developing these building blocks in the context of surgery and surgeons’ vision can lead to exciting possibilities for utilizing surgical video .
Computer vision has seen major improvements in the last 2 decades driven by breakthroughs in computing, digital cameras, mathematical modelling, and most recently deep learning techniques. While previous systems required human intervention in the design and modelling of image features that capture different objects in a certain domain, in deep learning the most discriminant features are learned autonomously from extensive amounts of annotated data. The increasing access to high volumes of digitally recorded image-guided surgeries is sparking a significant interest in translating deep learning to intraoperative imaging. Annotated surgical video datasets in a wide variety of domains are being made publicly available for training validating new algorithms in the form of competition challenges , resulting in a rapid progress towards reliable automatic interpretation of surgical data.
Surgical Process Understanding
Surgical procedures can be decomposed into a number of sequential tasks (e.g., dissection, suturing, and anastomosis) typically called procedural phases or steps [6-8]. Recognizing and temporally localizing these tasks allows for process surgical modelling and workflow analysis . This further facilitates the current trend in MIS practice towards establishing standardized protocols for surgical workflow and guidelines for task execution, describing optimal tool positioning with respect to the anatomy, setting performance benchmarks, and ensuring operational safety and complication-free, cost-effective procedures [10, 11]. The ability to objectively quantify surgical performance could impact many aspects of the user and patient experience, like a reduced mental/physical load, increased safety, and more efficient training and planning [12, 13]. Intraoperative video is the main sensory cue for surgical operators and provides a wealth of information about the workflow and quality of the procedure. Applying computer vision in the OR for workflow and skills analysis extends beyond the interventional video. Operational characteristics can be extracted from tracking and motion analysis of clinical staff using wall-mounted cameras and embedded sensors  and from tracking eye movements and estimating gaze patterns in MIS . As in similar data science problems, learning-based AI has the potential to pioneer surgical workflow analysis and skills assessment and represents the focus of this section.
Surgical Phase Recognition
Surgical video has been used for segmenting surgical procedures into phases and the development of AI methods for workflow analysis (or phase recognition), facilitated by publicly annotated datasets [7, 8], has dramatically accelerated the stability and capability of recognizing and temporally localizing surgical tasks in different MIS procedures. The EndoNet architecture introduced a convolutional neural network for workflow analysis in laparoscopic MIS, specifically laparoscopic cholecystectomy, with the ability to recognize the 7 surgical phases of the procedure with over 80% accuracy . More complex AI models (SV-RCNet, Endo3D) have increased the accuracy to almost 90% [17, 18]. One of the main requirements for learning techniques is data and annotations which are still limited in the surgical context. In robotic-assisted MIS procedures, instrument kinematics can be used in conjunction with the video to add explicit information on instrument motion. The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) is a dataset of synchronized video and robot kinematics data from benchtop simulations of 3 (suturing, knot tying, and needle passing) fundamental surgical tasks . The JIGSAWS dataset has extended annotations at a sub-task level. AI techniques learn patterns and temporal interconnections of the sub-task sequences from combinations of robot kinematics and surgical video and detect and temporally localize each sub-task [19-24]. Recently, AI models for activity recognition have been developed and tested on annotated datasets from real cases of robotic-assisted radical prostatectomy and ocular microsurgery [18-20]. Future work in this area should focus on investigating the ability of AI methods for surgical workflow analysis to generalize with rigorous validation on multicenter annotated datasets of real procedures .
Surgical Technical Skill Assessment
Automated surgical skill assessment attempts to provide an objective estimation of the surgeons’ performance and quality of execution . AI models analyze the surgical video and learn high-level features to discriminate different performance and experience levels during the execution of surgical tasks. In studies on robotic surgical skills, using the JIGSAWS dataset, such systems can estimate manually assigned OSATS-based scores with more than 95% accuracy [26-28]. In interventional ultrasound (US) imaging, AI methods can automatically measure the operator’s skills by evaluating the image quality of the captured US images with respect to their medical content [29, 30].
Automatically detecting structures of interest in digital images is a well-established field in computer vision. Its real-time application to surgical video provides assistance in visualizing clinical targets and sensitive areas to optimize and increase the safety of a procedure.
Lesion Detection in Endoscopy
AI vision systems for computer-aided detection (CAD) can provide assistance during diagnostic interventions by automatically highlighting lesions and abnormalities that could otherwise be missed. CAD systems were firstly introduced in radiology, with existing US Food and Drug Administration (FDA)- and European Economic Area (EEA)-approved systems for mammography and chest computed tomography (CT) . In the interventional context there has been particular interest in developing CADe systems for gastrointestinal endoscopy. Colonoscopy has received the most attention to date, and prototype CAD systems for polyp detection report accuracies as high as 97.8% using magnified narrow band imaging . Similar systems have also been developed for endocytoscopy, capsule endoscopy, and conventional white light colonoscopes . Research on CADe systems is also targeting esophageal cancer and early neoplasia in Barret esophagus .
Detection and highlighting of anatomical regions during surgery may provide assisted guidance and avoid accidental damage to critical structures, such as vessels and nerves. While significant research on critical anatomy representation focuses on registration and display of data from preoperative scans (see Surgical Process Understanding), more recent approaches directly detect these structures from intraoperative images and video. In robotic prostatectomy, subtle pulsation of vessels can be detected and magnified to an extent that is perceivable by a surgeon . In laparoscopic cholecystectomy, automated video retrieval can help in assessing whether a critical view of safety was achieved , with potential risk reduction and a safer removal of the gallbladder. Additionally, the detection and classification of anatomy enables the automatic generation of standardized procedure reports for quality control assessment and clinical training .
Surgical Instrument Detection
Automatic detection and localization of surgical instruments, when integrated with anatomy detection, can also contribute to accurate positioning and ensure critical structures are not damaged. In robotic MIS (RMIS), this information has the added benefit of making progress toward active guidance, e.g., during needle suturing . For this reason, research on surgical instrument detection has seen its largest share of research targeting the articulated tools in RMIS . With RMIS there are interesting possibilities in using robotic systems to generate data for training AI models and bypassing the need for expensive manual labelling . However, instrument detection has also received some attention in nonrobotic procedures including colorectal, pelvis, spine, and retinal surgery . In such cases, vision for instrument analysis may assist in building systems that can report analytics about instrument usage for reporting, or instrument motion and activity for surgical technical skill analysis or verification.
Vision-based methods for localization and mapping of the environment using the surgical camera have advanced rapidly in recent years. This is crucial in both diagnostic and surgical procedures because it may enable more complete diagnosis or fusion of pre- and intraoperative information to enhance clinical decision making.
Enhancing Navigation in Endoscopy
For providing physicians with navigation assistance in MIS, these systems must be able to locate the position of the endoscope within explored organs while simultaneously inferring their shape. Simultaneous localization and mapping in endoscopy is, however, a challenging problem . The ability of deep learning approaches to learn characteristic data features has proven to outperform hand-crafted features detectors and descriptors in laparoscopy, colonoscopy, and sinus endoscopy . These approaches have also demonstrated promising results for registration and mosaicking in fetoscopy with the aim of augmenting the fetoscope field of view . Nonetheless, endoscopy image registration remains an open problem due to the complex topological and photometrical properties of organs producing significant appearance variations and complex specular reflections. Deep learning-based simultaneous localization and mapping approaches rely on the ability of neural networks to learn a depth map from a single image, overcoming the need for image registration. It has recently been shown that these approaches are able to infer dense and detailed depth maps in colonoscopy . By fusing consecutive depth maps and simultaneously estimating the endoscope motion using geometric constraints, it has been demonstrated that long range colon sections could be reconstructed . A similar approach has also been successfully applied to 3-D reconstruction of the sinus anatomy from endoscopic video so as to propose an alternative to CT scans – expensive procedures using ionizing radiation – for longitudinal monitoring of patients after nasal obstruction surgery . However, critical limitations, such as navigation within deformable environments, need to be overcome.
Navigation in Robotic Surgery
Surgical robots such as the da Vinci Surgical System generally use stereo endoscopes which have significant advantages over monocular endoscopes in their ability to capture 3-D measurements. Estimating a dense depth map from a pair of stereo images generally consists of estimating dense disparity maps defining the apparent pixel motion between 2 images. Most of the stereo registration approaches rely on geometric methods . It has, however, been shown that DL-based approaches could be successfully applied to partial nephrectomy outperforming state of the art stereo reconstruction methods . Surgical tool segmentation and localization contribute to safe tool-tissue interaction and are essential to visually guided manipulation tasks. Recent DL approaches demonstrate significant improvements over hand-crafted tool tracking methods offering a high degree of flexibility, accuracy, and reliability .
Image Fusion and Image-Guided Surgery
A key concept in enhancing surgical navigation has been the idea of fusing multiple preoperative and intraoperative imaging modalities in an augmented reality (AR) view of the surgical site . Vision-based AR systems generally involve mapping and localization of the environment in addition to blocks that align any preoperative 3-D data models to that reconstruction and then display the fused information to the surgeon . The majority of surgical AR systems have been founded on geometric vision algorithms but deep learning methods are emerging, e.g., for US to CT in spine surgery  or to design efficient deformable registration in laparoscopic liver surgery . Despite methodological advances, significant open problems persist in surgical AR, such as adding contextual information to the visualization (e.g., identifying anatomical structures and critical surgical areas and detecting surgical phases and complications) , ensuring robust localization despite occlusions and displaying relevant information to different stakeholders in the OR. Work is advancing to address these challenges and evaluation of the state-of-the-art learning-based method for visual human pose estimation in the OR has recently been reported  alongside a review dedicated to face detection into the OR  and methods to estimate both surgical phases and remaining surgery durations  which can be used to alter information displayed at different times.
In this paper, we have provided a succinct review of the broad possibilities for using computer vision in the surgical OR. With the increasing availability of surgical video sources and the convergence of technologies around video storage, processing, and understanding, we believe clinical solutions and products leveraging vision are going to become an important component of modern surgical capabilities. However, both technical and clinical challenges remain and we try to outline them below.
Priorities for technical research and development are:
Availability of datasets with labels and ground truth: despite efforts from challenges, the quality and availability of large scale surgical datasets remains a bottleneck. Efforts are needed to address this and cause a similar catalyst effect as was observed in wider vision and AI communities.
Technical development in unsupervised methods: developing approaches that do not require any labelled sensor data (ground truth) is needed to bypass the need for large scale dataset or adapt to new domains (i.e., adapt method dedicated to nonmedical data to medical imaging). Furthermore, even if the data gap is bridged, the domain of surgical problems and axes of variation (patient, disease, etc.) is huge and solutions need to be adaptive to be able to scale.
Challenges for clinical deployment are:
Technical challenges in infrastructure: computing facilities in the OR, access to cloud computing using limited bandwidth, and latency of delivering solutions are all practical problems that require engineering resources beyond the core AI development.
Regulatory requirements around solutions: various levels of regulation are needed for integrating medical devices and software within the OR. Because of their complexity, assessing the limitation and capabilities of AI-based solutions is difficult, particularly for problems in which human supervision cannot be used to validate their precision (e.g., simultaneous localization and mapping).
User interfaces design: it is critical to ensure that only relevant information is provided to the surgical teams and, for advanced AI-based solutions, a direct practitioner-surgical platform communication can be established. Integrating contextual information (e.g., surgical phase recognition and practitioner identification) is a major challenge for developing efficient user interfaces.
Finally, this short and succinct review has focused on research directions that are in active development. Due to limitations of space, we have not discussed opportunities around using computer vision with different imaging systems or spectral imaging despite the opportunities in AI systems to resolve ill posed inverse problems in that domain . Additionally, we have not covered in detail work in vision for the entire OR but this is a very active area of development with exciting potential for a wider team workflow understanding .
Conflict of Interest Statement
D. Stoyanov is part of Digital Surgery Ltd. and a shareholder in Odin Vision Ltd.
The authors have no conflict of interests to declare.
This work was supported by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) at the University College London (203145Z/16/Z), EPSRC (EP/P012841/1, EP/P027938/1, and EP/R004080/1), and the H2020 FET (GA 863146). D. Stoyanov is supported by a Royal Academy of Engineering Chair in Emerging Technologies (CiET1819\2\36) and an EPSRC Early Career Research Fellowship (EP/P012841/1).
E. Mazomenos wrote the section dedicated to surgical video analysis. F. Vasconcelos wrote the section dedicated to CAD and F. Chadebecq the section dedicated to computer-assisted interventions. D. Stoyanov wrote the Introduction and Discussion and was responsible for the organization of this paper.