Introduction: The accurate distinction between benign and malignant biliary strictures (BSs) poses a significant challenge. Despite the use of bile duct biopsies and brush cytology via endoscopic retrograde cholangiopancreaticography (ERCP), the results remain suboptimal. Single-operator cholangioscopy can enhance the diagnostic yield in BS, but its limited availability and high costs are substantial barriers. Convolutional neural network-based systems may improve the diagnostic process and enhance reproducibility. Therefore, we assessed the feasibility of using deep learning to differentiate BS using fluoroscopy images during ERCP. Methods: We conducted a retrospective review of adult patients (n = 251) from three university centers in Germany (Leipzig, Dresden, Halle) who underwent ERCP. We developed and evaluated a deep learning-based model using fluoroscopy images. The performance of the classifier was evaluated by measuring the area under the receiver operating characteristic curve (AUROC), and we utilized saliency map analyses to understand the decision-making process of the model. Results: In cross-validation experiments, malignant BSs were detected with a mean AUROC of 0.89 ± 0.03. The test set of the Leipzig cohort demonstrated an AUROC of 0.90. In two independent external validation cohorts (Dresden, Halle), the deep learning-based classifier achieved an AUROC of 0.72 and 0.76, respectively. The artificial intelligence model’s predictions identified plausible characteristics within the fluoroscopy images. Conclusion: By using a deep learning model, we were able to discriminate malignant BS from benign biliary conditions. The application of artificial intelligence enhances the diagnostic yield of malignant BS and should be validated in a prospective design.

Biliary strictures (BSs) represent a heterogeneous group of diseases and often demonstrate a diagnostic challenge for clinicians [1]. BSs, which can be benign or malignant in origin, are commonly caused by conditions such as cholangiocarcinoma (CCA) and pancreatic ductal adenocarcinoma (PDAC). These strictures can be visualized using fluoroscopy during an endoscopic retrograde cholangiopancreatography (ERCP).

However, the distinction between benign and malignant BS often requires a complex diagnostic approach including bile duct forceps biopsies, brush cytology, and/or single-operator cholangioscopy (SOC). In a 10-year review of the literature, brush cytology shows an overall sensitivity of only 41.6% with a negative predictive value of 58.0% [2]. In another systematic review, the pooled sensitivity of forceps biopsy also showed low sensitivity (48.1%) [3]. Due to limited diagnostic accuracy, some biopsies fail to provide a conclusive result and are categorized as intermediate lesions. However, approximately 20% of cases still have indeterminate lesions, meaning that it is unclear whether the lesion is benign or malignant [4]. This diagnostic challenge can lead to undertreatment (missing necessary surgery of malignant BS in the early stage) or overtreatment (unnecessary surgery of benign BS), both of which can result in severe consequences for patients. Previous studies showed that up to 24% of patients referred for surgery due to suspected malignant BSs were unexpectedly determined as a benign process in definitive histologic evaluation [5, 6]. SOC increased the diagnostic accuracy but is not broadly available and has high costs [7]. The reason for this is that SOC is a highly specialized medical examination that requires extensive experience and expertise. Furthermore, SOC devices are considered as disposable, which are discarded after a single use.

Thus, new diagnostic technologies are urgently needed to provide greater accuracy in the noninvasive or minimally invasive diagnosis of malignant BS. In the last 5 years, deep learning (DL) technologies have become the method of choice to analyze complex medical imaging data [8], such as radiology [9], endoscopy [10], or histopathology images [11]. DL belongs to the field of artificial intelligence (AI) and when it is applied to image data, it mostly relies on convolutional neural networks (CNNs), a class of artificial neural networks [12]. CNNs have repeatedly been shown to reach human expert performance in many tasks, given that they are trained on large clinical datasets [13‒15].

Human observers often find it challenging to distinguish between malignant and benign BSs when analyzing fluoroscopy images from ERCP examinations, which contribute to the high rate of indeterminate classifications. However, different types of BS are known to have different morphologies. The presence of long, irregular BS with shelf-like morphology or strictures in both the bile duct and pancreatic duct accompanied by proximal dilation is the primary indicator for a potentially malignant condition [16]. We performed a retrospective, multicentric, noninterventional study aiming to develop and evaluate a DL classifier for BS etiology. In this study, we aimed to use DL for an automated differentiation of malignant versus benign BS based on ERCP fluoroscopy images.

This study is conducted in accordance with the internationally recognized Standards for Reporting Diagnostic Accuracy (STARD) guidelines, ensuring reliable and quality results [17]. Our study rigorously adheres to the Minimum information about clinical artificial intelligence modeling (MI-CLAIM) standards, ensuring a comprehensive and systematic approach to reporting and conducting medical image analysis research [18].

Patient and Experimental Design

In this study, we screened a total patient cohort of 1,887 patients who received an ERCP at the University of Leipzig Medical Center from 2018 to 2022. Figure 1 demonstrates the selection process of the patients. In order to identify potentially eligible patients with malignant BS, we conducted a systematic review utilizing the keyword “malignant stenosis” within the digital endoscopy documentation system (ViewPoint 5, GE-Healthcare, Chicago, IL, USA). Only histologically confirmed malignant BSs, either by biopsy or definitive evaluation in resected specimens, were finally included in the analysis. BSs with uncertain diagnosis or inconclusive findings were excluded from the study.

Fig. 1.

Flowchart patient selection. A systematic review was conducted to find eligible patients with malignant BS. For every identified patient with suspected malignant BS, 2 patients with benign conditions who received an ERCP within 3 days by the same endoscopist were selected. Patients with no eligible images, no histology and no follow-up were excluded. Finally, 202 patients with 456 ERCP and 1,007 fluoroscopy images could include the study. These patients were splitted into training-, validation- and test cohort. BS, biliary stricture; ERCP, endoscopic retrograde cholangiopancreatography.

Fig. 1.

Flowchart patient selection. A systematic review was conducted to find eligible patients with malignant BS. For every identified patient with suspected malignant BS, 2 patients with benign conditions who received an ERCP within 3 days by the same endoscopist were selected. Patients with no eligible images, no histology and no follow-up were excluded. Finally, 202 patients with 456 ERCP and 1,007 fluoroscopy images could include the study. These patients were splitted into training-, validation- and test cohort. BS, biliary stricture; ERCP, endoscopic retrograde cholangiopancreatography.

Close modal

For every identified patient with suspected malignant BS, we selected 2 patients with benign conditions who received an ERCP within 3 days by the same endoscopist. The classification of benign was based on the endoscopists diagnosis and/or histological evaluation. In addition, benign BS required to have at least 6 months of follow-up from the initial examination to be eligible for this study. Two experienced endoscopists (K.V.T. and M.H.) conducted a detailed evaluation of the fluoroscopy images with regard to completeness, representativeness, and eligibility. This necessitates that the fluoroscopy images of the biliary tract must clearly depict the hilum, as well as the right and left hepatic ducts. Furthermore, should there be any presence of stenosis, it must also be accurately represented in the images. If the selected patients had undergone ERCP in the past and their fluoroscopy images are confirmed as eligible, then these images were also included.

In addition to validating the robustness and generalizability of our model, we performed an external validation of the model. Therefore, we collected data from two university hospitals in Dresden and Halle (Department of Gastroenterology each) to obtain 2 independent patient cohorts. Subsequently, we deployed our model, which was initially trained on the Leipzig training cohort.

Acquisition and Processing of Images

The Philips MultiDiagnost Eleva (Philips Healthcare, Amsterdam, Netherlands) was used to acquire grayscale images which were saved in the Digital Imaging and Communications in Medicine (DICOM) format. Anonymous DICOM-formatted ERCP data were retrieved to produce grayscale images of 1,024 × 1,024 pixels. To ensure accurate analysis, images were subjected to a comprehensive preprocessing procedure which included resizing and normalization. All images were resized to the optimal size of 224 × 224 pixels for DenseNet.

The methodology for image processing was carried out as follows: first, we minimized the influence of disruptive elements in the periphery of the images. Therefore, we enlarged each image and removed the borders prior to training and validation. Second, we used Contrast Limited Adaptive Histogram Equalization (CLAHE) [19] to enhance image contrast by redistributing intensity values of pixels (online suppl. Fig. 1; for all online suppl. material, see https://doi.org/10.1159/000543049). It divides the input image into small rectangular tiles and computes their histograms. The histograms of these tiles are then equalized, resulting in an image with enhanced contrast (Figure 2b). As a result, the obtained images had better dynamic range and avoided overly bright or overly dark regions [20].

Fig. 2.

Schematic workflow of the model. a Visualizing the bile ducts with ERCP and generating ERCP-images from routine. b Contrast adapting with histogram equalization. After adapting the histogram: pixel intensities exhibit a balanced distribution, ensuring each intensity value is equally probable, and resulting in an even cumulative histogram across the full range of values. c Workflow of preprocessing and DL algorithms. d Distribution of the validation cohorts. ERCP, endoscopic retrograde cholangiopancreatography; CDF, cumulative distribution function; CLAHE, contrast limited adaptive histogram equalization.

Fig. 2.

Schematic workflow of the model. a Visualizing the bile ducts with ERCP and generating ERCP-images from routine. b Contrast adapting with histogram equalization. After adapting the histogram: pixel intensities exhibit a balanced distribution, ensuring each intensity value is equally probable, and resulting in an even cumulative histogram across the full range of values. c Workflow of preprocessing and DL algorithms. d Distribution of the validation cohorts. ERCP, endoscopic retrograde cholangiopancreatography; CDF, cumulative distribution function; CLAHE, contrast limited adaptive histogram equalization.

Close modal

DL Workflow

For all DL experiments, we used Medical Open Network for Artificial Intelligence (MONAI) as our preferred framework. MONAI provides the necessary tools for creating advanced models to enhance medical imaging. This open-source platform enables collaborative research and clinical efforts between multiple parties, ensuring a fast development process [21].

Our CNN architecture incorporates DenseNet121 for an optimized network structure. DenseNet is a CNN architecture that was developed in 2017. It has become popular for medical image analysis because it improves upon the accuracy of other architectures such as ResNet and Inception. DenseNet utilizes the concept of feature reuse, where each layer in the network can access features from all previous layers. This allows for a more efficient use of parameters and helps reduce overfitting [22].

In order to avoid overfitting, we monitored the loss curves of both the training and validation sets during the training procedure. We implemented an early stopping mechanism that terminated training if no improvement in the loss curves was detected after a maximum of 50 epochs. Online supplementary Figure 2 presents an example run of our DL experiment, where early stopping was triggered at epoch 206 due to the absence of improvement in the loss metric on the validation set. Online supplementary Figure 2c convincingly demonstrates that continuing the training beyond this point would lead to a deterioration of performance, indicative of overfitting. In addition, we used data augmentation to mitigate possible effects of limited data and more effectively generalize the model. We applied techniques such as rotation, zooming, contrast adjustment, and brightness alteration. Online supplementary Figure 3 illustrates a few of the outcomes following data augmentation.

To evaluate our model we conducted five training trials with varying random seeds, where the training, validation, and test cohorts were randomly assigned. We generated the area under the receiver operating characteristic (AUROC) curve for each run and calculated mean AUROC curve from the five training runs. Furthermore, we used a three-fold cross-validation and reported the AUROC. In addition, we used the model of the best fold for external validation. To prevent any potential bias in our analysis, we ensured that no data from the same patient was included in both the training and test datasets for any of our experiments.

Statistical Analysis and Metrics

Statistical analyses were conducted using SPSS version 26.0.0 (released in 2019 by IBM Corporation, Armonk, NY, USA). The data were presented as counts with corresponding percentages for categorical variables, and the median with range was used for continuous variables.

For cross-validation, the cohort was fractionated into three subsets through a randomization process (by using sklearn.model_selection.GroupShuffleSplit function with n_splits = 3). Following the training phase, all models were rigorously evaluated on a test dataset. Performance and successful validation were assessed through multiple metrics based on per-image analysis – accuracy, recall, precision, F1, and AUC scores. Our primary evaluation metric was AUROC. We calculated the mean AUROC and its standard error of all folds to evaluate our model in cross-validation and external validation. Our quantitative analysis was achieved using Scikit-learn version 1.1.3 and Matplotlib version 3.6.2 for illustrating the results graphically.

To gain a better comprehension of our CNN, we used gradient-weighted class activation mapping (Grad-CAM) for creating class-specific heatmaps from any given input image. Grad-CAM uses the gradients of a given class to generate a coarse localization map highlighting the areas of an image that CNNs are using to make their prediction [23].

Patients Characteristics

During the selection process illustrated in Figure 1, a total of 202 patients from the Leipzig cohort were identified for training and validation. These patients had undergone 456 ERCP exams, producing 1,007 fluoroscopy images that were utilized in training a DL-based classifier for this study. Out of these, 173 individuals (911 images) were utilized for training and validation, while the remaining 29 patients (94 images) were reserved as the test cohort.

Table 1 demonstrates the patients’ characteristics of training and validation group, test group as well as external validation groups. In the training and validation group, 71.1% (n = 123) of patients had a benign disease and 28.9% (n = 50) were found to have malignant BS. Of the benign diseases, the majority with 51.2% (n = 63) had bile duct stones. In 16.3% (n = 20), a bile duct stone or stenosis was suspected in endoscopic ultrasound (EUS) or magnetic resonance cholangiopancreaticography (MRCP) but ERCP revealed a normal bile duct system. A benign BS (inflammatory, PSC, SSC, anastomotic, ischemic type biliary lesions) was found in 32.5% (n = 40). CCA was the most common cause of malignant strictures, followed by hilar metastases, PDAC, and gallbladder cancer. Hepatocellular carcinoma and neuroendocrine tumors occurred less frequently.

Table 1.

Baseline characteristics patients of all cohorts

LeipzigDresdenHalle
training and validationtest
Total patients 173 29 17 32 
Total images 911 94 31 36 
Age, median (range) 66 (23–100) 63.5 (18–95) 69 (34–89) 72 (43–86) 
Male gender 102 (59.0%) 20 (69.0%) 7 (41.2%) 20 (62.5%) 
Benign 123 (71.1%) 16 (55.2%) 11 (64.7%) 27 (84.4%) 
 Normal 20 (16.3%) 7 (43.8%) 3 (27.3%) 0 (0.0%) 
 Stone 63 (51.2%) 5 (31.3%) 4 (36.4%) 20 (74.1%) 
 Inflammatory 13 (10.6%) 1 (6.3%) 3 (27.3%) 6 (22.2%) 
 Anastomotic 13 (10.6%) 2 (12.5%) 0 (0.0%) 0 (0.0%) 
 ITBL 7 (5.7%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 
 PSC 4 (3.3%) 1 (6.3%) 0 (0.0%) 1(3.7%) 
 SSC 3 (2.4%) 0 (0.0%) 1 (9.1%) 0 (0.0%) 
Malignant 50 (28.9%) 13 (44.8%) 6 (35.3%) 5 (15.6%) 
 CCA 28 (56.0%) 8 (61.5%) 1 (16.7%) 3 (60.0%) 
 Metastasis 8 (16.0%) 4 (30.8%) 2 (33.3%) 0 (0.0%) 
 PDAC 7 (14.0%) 1 (7.7%) 3 (50.0%) 2 (40.0%) 
 Gallbladder cancer 5 (10.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 
 HCC 1 (2.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 
 NET 1 (2.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 
LeipzigDresdenHalle
training and validationtest
Total patients 173 29 17 32 
Total images 911 94 31 36 
Age, median (range) 66 (23–100) 63.5 (18–95) 69 (34–89) 72 (43–86) 
Male gender 102 (59.0%) 20 (69.0%) 7 (41.2%) 20 (62.5%) 
Benign 123 (71.1%) 16 (55.2%) 11 (64.7%) 27 (84.4%) 
 Normal 20 (16.3%) 7 (43.8%) 3 (27.3%) 0 (0.0%) 
 Stone 63 (51.2%) 5 (31.3%) 4 (36.4%) 20 (74.1%) 
 Inflammatory 13 (10.6%) 1 (6.3%) 3 (27.3%) 6 (22.2%) 
 Anastomotic 13 (10.6%) 2 (12.5%) 0 (0.0%) 0 (0.0%) 
 ITBL 7 (5.7%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 
 PSC 4 (3.3%) 1 (6.3%) 0 (0.0%) 1(3.7%) 
 SSC 3 (2.4%) 0 (0.0%) 1 (9.1%) 0 (0.0%) 
Malignant 50 (28.9%) 13 (44.8%) 6 (35.3%) 5 (15.6%) 
 CCA 28 (56.0%) 8 (61.5%) 1 (16.7%) 3 (60.0%) 
 Metastasis 8 (16.0%) 4 (30.8%) 2 (33.3%) 0 (0.0%) 
 PDAC 7 (14.0%) 1 (7.7%) 3 (50.0%) 2 (40.0%) 
 Gallbladder cancer 5 (10.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 
 HCC 1 (2.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 
 NET 1 (2.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 

CCA, cholangiocarcinoma; HCC, hepatocellular carcinoma; ITBL, ischemic type biliary lesions; NET, neuroendocrine tumor; PDAC, pancreatic ductal adenocarcinoma; PSC, primary sclerosing cholangitis; SSC, secondary sclerosing cholangitis.

In the test group, a total of 55.2% (n = 16) of the subjects had benign diseases and 44.8% (n = 13) and had malignant BS. Of those patients with benign disease, the ERCP showed in 43.8% (n = 7) a normal bile duct system although stones or stenoses were suspected in EUS or MRCP; 31.3% (n = 5) had stones and 25.0% (n = 4) showed benign strictures (inflammation, PSC, anastomotic). In the testing cohort, the majority of malignant strictures were CCA (61.5%, n = 8) and metastasis (30.8%, n = 4). 7.7% (n = 1) had a PDAC in the test-group.

External validation data were collected from two other hospitals in Dresden, Germany, and Halle, Germany, to validate the model. The Dresden cohort consisted of 17 patients with 31 fluoroscopy images, with 64.7% (n = 11) of patients being diagnosed with a benign disease and 35.3% (n = 6) with malignant BS. In the Halle cohort, 32 patients with 36 fluoroscopy images were included, and the majority with 84.4% (n = 27) were diagnosed with a benign disease, while 15.6% (n = 5) had malignant BS. Malignant BS in the Dresden cohort were PDAC (50%, n = 3), metastasis (33.3%, n = 2), and CCA (16.7%, n = 1). In the Halle cohort, the most common type of malignant BS was CCA (60%, n = 3), followed by PDAC (40%, n = 2).

Model Performance

Using three-fold cross-validation with different random seeds (Fig. 3a), we trained a DL classifier to detect malignant strictures in fluoroscopy images. The experimental procedure involved training each split five times (Fig. 3b). Split one exhibited a mean AUROC of 0.93 ± 0.00, split two demonstrated a mean AUROC of 0.87 ± 0.01, and split three showed a mean AUROC of 0.86 ± 0.02. Across all 15 iterations, the mean AUROC was calculated as 0.89 ± 0.03, with the best-performing model achieving an AUROC of 0.94.

Fig. 3.

Evaluation of the performance of the model by cross-validation and external validation. a ROC of cross-validation in the Leipzig cohort. b ROC of every split in the Leipzig cohort. c Validation on the test set of the Leipzig cohort and external validation (Dresden and Halle). d Confusion matrices of the different cohorts. ROC, receiver operating characteristic.

Fig. 3.

Evaluation of the performance of the model by cross-validation and external validation. a ROC of cross-validation in the Leipzig cohort. b ROC of every split in the Leipzig cohort. c Validation on the test set of the Leipzig cohort and external validation (Dresden and Halle). d Confusion matrices of the different cohorts. ROC, receiver operating characteristic.

Close modal

We assessed the precision, recall, F1-score, accuracy and AUROC (Fig. 3c) with the best-performing model. Confusion matrices were also created to display the true positive, true negative, false positive, and false negative cases (Fig. 3d). In the Leipzig cohort, the precision was calculated to be 0.86, while the recall was 0.79, resulting in an F1-score of 0.82 and an accuracy of 0.86. The AUROC for the test set was 0.90.

To further validate the robustness of our model, we employed two independent cohorts from University Medical Centers of Dresden and Halle, as external validation. In the Dresden cohort study, our calculations yielded a precision value of 0.75 and a recall value of 0.55 with an F1-score of 0.63. The accuracy of the model was 0.86, and the AUROC was 0.72. In the Halle cohort study, the calculated precision and recall were 0.80 and 0.44, respectively. These computations converged to an F1-score of 0.57. The accuracy was computed as 0.83, and the AUROC achieved was 0.76.

Explainability of the Model

In our study, we employed saliency maps as a means to identify factors and ascertain the degree to which our predictions are explicable. This enabled us to better comprehend the underlying mechanisms that drive our results and provide a clearer rationale for the outcomes observed. The saliency maps presented in Figure 4 provide insight into how our prediction model works.

Fig. 4.

Saliency maps: visual heatmap of the regions of an image that are most important for a deep neural network’s prediction. a Normal bile duct system. b ITBL. c, d CCA. CCA, cholangiocarcinoma; ITBL, ischemic-type biliary lesions.

Fig. 4.

Saliency maps: visual heatmap of the regions of an image that are most important for a deep neural network’s prediction. a Normal bile duct system. b ITBL. c, d CCA. CCA, cholangiocarcinoma; ITBL, ischemic-type biliary lesions.

Close modal

Our analysis of the saliency maps showed that the DL model primarily focused on the bile duct system and not on the background or artefacts (such as endoscope, black margins, or bright regions resulting from air). Furthermore, we found that in cases of malignant BS, the regions of interest were accurately identified by the model. Specifically, the stenosis and dilated bile duct were correctly detected. However, when no BS was present, the saliency maps marked a portion of the normal bile duct system. That means for detecting malignant BS requires a clear image acquisition (e.g. complete visualization of the hilum region). The model’s performance will be influenced by the quality and completeness of the imaging data. Therefore, there is a need for careful attention to image acquisition protocols and quality control. These findings suggest that the model is able to effectively identify and prioritize the relevant characteristics in the input image, which could have important implications for the accurate detection and diagnosis of bile duct abnormalities.

Figure 5 presents the saliency maps derived from the external cohorts. Figure 5a displays an accurate prediction of malignant stenosis (case of CCA) in the Dresden cohort. However, Figure 5b and d highlighted instances of false negatives associated with pancreatic carcinoma within both Dresden and Halle cohorts. Furthermore, it should be noted that inferior image quality can induce false positive outcomes, as exemplified by Figure 5c.

Fig. 5.

Saliency maps of the external cohorts. a CCA in the Dresden cohort (true positive). b, d Cases of pancreatic carcinoma in both cohorts (false negative). c False positive case in the Halle cohort. CCA, cholangiocarcinoma.

Fig. 5.

Saliency maps of the external cohorts. a CCA in the Dresden cohort (true positive). b, d Cases of pancreatic carcinoma in both cohorts (false negative). c False positive case in the Halle cohort. CCA, cholangiocarcinoma.

Close modal

Our study presents important evidence of the efficacy of a CNN model utilizing “simple” ERCP fluoroscopy images to detect malignant BS. Our classifier surpassed the sensitivity of traditional sampling techniques such as forceps biopsy and brush cytology, proving to be a more effective means of detection. This is the first time such results have been achieved, emphasizing the immense potential of CNN models in the field of BS diagnosis [3]. Although the cross-validation showed good performance, we observed a decrease performance during external validation. This drop in performance can be attributed to the model being trained on data collected from only one center, which limits its applicability to other centers. Factors such as differences in imaging protocols, patient populations, and center-specific practices may have contributed to overfitting. Notably, the quality of bile duct visualization via ERCP varies across centers (Fig. 5c). This variability is influenced by differences in contrast injection techniques and operator-dependent variability in image quality. To mitigate overfitting, we implemented an early stopping mechanism that terminated training when no improvement in the loss curves was detected (online suppl. Fig. 2). Additionally, malignant BS cases in our training cohort were predominantly composed of cholangiocarcinoma (CCA), which may explain the reduced performance when evaluating other types of malignant BS, such as pancreatic cancer (Fig. 5b, d), which were more prevalent in the external cohorts. However, the focus of our study was on CCA, as these lesions are more often a challenge for the definitive diagnosis when faced with indeterminate BS. In comparison, the histological diagnosis of other malignant BS, such as PDAC, hepatocellular carcinoma, and metastases, can often be confirmed through other techniques, such as ultrasound- [24] or EUS-guided biopsy [25, 26].

CCA, the most common type of malignant BS, often remains asymptomatic in the early stages [27]. Diagnosing CCA strictures often poses a difficult challenge due to the similarities of both benign and malignant lesions that may mimic CCA strictures [28]. In addition, robust prediction models to discriminate malignant from benign BS are lacking and both share similar clinical signs such as jaundice, pruritus, dark urine, clay-colored stool, and abdominal discomfort. Therefore, unlocking the potential of AI to discriminate against BS offers an exciting opportunity that could significantly impact the diagnostic approaches. Recent research has unveiled the potential of AI in gastroenterology and in particular in gastrointestinal endoscopy. Landmark studies have demonstrated its capacity for improved detection of cancer and precancerous lesions. In addition, AI can be useful to localize bleeding sites, polyps, or infections in the gastrointestinal tract [10, 29]. Moreover, CNNs have been applied to perform analysis on X-ray images for the diagnosis of lung diseases [30‒32], as well as in coronary angiography [33‒36].

Several studies have been conducted to develop predictive models that could help increase the accuracy in the detection of malignant BS. In this regard, the use of SOC reveals an emerging role [37, 38] as ERCP-guided biopsies and brush cytology remain insufficient for the diagnosis of malignant BS [3]. Cholangioscopy-guided biopsies have been found to be a reliable and more accurate method for the adequate discrimination of indeterminate BS compared to standard ERCP techniques [39, 40]. Recent studies of Marya et al. [37] and Saraiva et al. [38] showed the efficiency by combining cholangioscopy and CNN with an impressive accuracy beyond 0.9. Thus, cholangioscopy undoubtedly offers a powerful diagnostic tool that is, however, limited due to a time-consuming examination, need of adequate training, high costs, and a restricted availability outside expert centers [7, 41]. However, single-use devices, such as SOC, should be avoided to reduce the environmental footprint [42, 43]. Further MRCP offers a high-quality assessment of BSs; but it is notably cost-intensive and resources for its application are limited. Additionally, if an ERCP with stent implantation was conducted, the MRCPs quality may be compromised. Although MRCP remains a critical tool for improving diagnostic outcomes in BSs, the integration of an AI-assisted ERCP procedure could be pivotal in identifying patients who would benefit most from more comprehensive diagnostic approaches.

Saliency maps were utilized to highlight regions of interest in the images that the model relies on for decision-making. These saliency maps were reviewed by expert endoscopists to ensure their accuracy and relevance. This review helps confirm that the saliency maps correctly represent the model’s focus and provide meaningful insights into its decision-making process. These saliency maps show that the model depends on high-quality image acquisition; otherwise, its performance can be impacted by the quality and completeness of the imaging data.

Our study has some limitations. First, the DL model was trained on data obtained from a single center and thus cannot automatically be expanded to other centers and/or settings. Future multicenter studies are necessary to adequately train and evaluate this model, analyze additional features, and assess its clinical utility. By enhancing the diversity of training datasets, future CNNs may differentiate between various subtypes of benign BS and malignant BS. Second, the study has a retrospective design, which introduces the potential for a selection bias. To minimize this possible bias, we defined clear inclusion and exclusion criteria. Additionally, data from external centers were included for validation. However, external validation sets were relatively small and homogeneous, which may affect the model's ability to generalize across diverse populations.

The number of patients and images included in our study was determined by the availability of high-quality, diagnostically relevant data. Nevertheless, we aimed to include as many images as possible to enhance the model’s training and validation. The focus was on ensuring the data’s quality and relevance to achieve reliable results. We conducted a per-image analysis, as multiple images were available for many patients, and our focus was on image-level diagnostic accuracy. This approach allowed us to assess the model’s performance across individual images, which is crucial in real-world clinical settings where each image can provide critical diagnostic information. Given that images can vary significantly and not every image may clearly indicate the correct diagnosis, analyzing each image independently was essential.

While our study focuses on the development and validation of the AI model, its practical implementation in a clinical setting requires careful consideration. Future steps should include integrating the AI system with existing ERCP or fluoroscopy platforms to ensure seamless interoperability with clinical workflows. The primary goal should be to develop user-friendly interfaces and real-time processing capabilities to assist in the diagnosis of malignant bile duct strictures (BS). One potential implementation could involve integrating the model into real-time imaging systems used during procedures such as endoscopies or imaging scans. As imaging data are collected, the model would analyze it on the fly, identifying potential strictures and providing immediate feedback on whether they are likely malignant or benign. This real-time detection capability could significantly enhance the efficiency of diagnostic procedures, enabling clinicians to make informed decisions more promptly.

In conclusion, the preliminary findings from our DL model are promising, indicating its potential to enhance the accuracy of malignant BS detection through ERCP fluoroscopy imaging. The data points toward the possible creation of highly efficient and precise diagnostic support systems leveraging this technology. Additionally, AI-assisted assessments may facilitate earlier interventions, offering patients access to curative treatments sooner and reducing the likelihood of unnecessary surgeries for indeterminate BS. However, further validation and refinement of the model are necessary before definitive conclusions can be drawn about its efficacy in clinical practice.

The final study protocol was approved by the Ethics Committee of the Medical Faculty of the University of Leipzig (494/22-ek) in accordance with the Declaration of Helsinki, the “Medical Association’s Professional Code of Conduct,” and the principles of ICH-GCP guidelines (issued in June 1996, ISO14155 from 2012). Given the retrospective nature of this study and the use of anonymized data, individual written informed consent was not required. Written informed consent from participants was not required for the study presented in this article in accordance with local/national guidelines. However, the study protocol was reviewed and approved by the Ethics Committee of the Medical Faculty of the University of Leipzig.

Jakob Nikolas Kather: consulting services for Owkin, France; Panakeia, UK; and DoMore Diagnostic. Marcus Hollenbach: FUJIFILM honoraria for lectures and expert panel.

J.N.K. is supported by the German Federal Ministry of Health (DEEP LIVER, ZMVI1-2520DAT111), the Max-Eder-Programme of the German Cancer Aid (grant #70113864), the German Federal Ministry of Education and Research (PEARL, 01KD2104C), and the German Academic Exchange Service (SECAI, 57616814).

K.V.T. and M.H. designed the study, analyzed data, interpreted data, wrote and submitted the manuscript. J.N.K., A.H., G.P.V., and O.L.S. contributed to the analyses. K.V.T., M.H., G.P.V., O.L.S., J.G., J.R., S.K., P.M., J.F., T.K., J.H., A.H., and J.N.K. contributed to the experimental design, the interpretation of the results, and revised the manuscript and approved the manuscript.

Additional Information

Kien Vu Trung and Marcus Hollenbach share first authorship.Albrecht Hoffmeister and Jakob Nikolas Kather share last authorship.

Our Ethics Committee permits the use of images data for scientific purposes, but we are not authorized to make them publicly available. Due to privacy concerns and ethical considerations, it is not possible to make the image data available to external parties. However, interested parties may contact the study’s principal investigators if they wish to validate other AI models using the dataset. Further inquiries can be directed to the corresponding author. Our source codes, along with an example dataset, detailed instructions, and troubleshooting support, are available under an open source license at https://github.com/Kien283/ERCP.

1.
Restrepo
DJ
,
Moreau
C
,
Edelson
CV
,
Dev
A
,
Saligram
S
,
Sayana
H
, et al
.
Improving diagnostic yield in indeterminate biliary strictures
.
Clin Liver Dis
.
2022
;
26
(
1
):
69
80
.
2.
Burnett
AS
,
Calvert
TJ
,
Chokshi
RJ
.
Sensitivity of endoscopic retrograde cholangiopancreatography standard cytology: 10-y review of the literature
.
J Surg Res
.
2013
;
184
(
1
):
304
11
.
3.
Navaneethan
U
,
Njei
B
,
Lourdusamy
V
,
Konjeti
R
,
Vargo
JJ
,
Parsi
MA
.
Comparative effectiveness of biliary brush cytology and intraductal biopsy for detection of malignant biliary strictures: a systematic review and meta-analysis
.
Gastrointest Endosc
.
2015
;
81
(
1
):
168
76
.
4.
Bowlus
CL
,
Olson
KA
,
Gershwin
ME
.
Evaluation of indeterminate biliary strictures
.
Nat Rev Gastroenterol Hepatol
.
2017
;
14
(
12
):
749
.
5.
Clayton
RAE
,
Clarke
DL
,
Currie
EJ
,
Madhavan
KK
,
Parks
RW
,
Garden
OJ
.
Incidence of benign pathology in patients undergoing hepatic resection for suspected malignancy
.
Surgeon
.
2003
;
1
:
32
8
.
6.
Gerhards
MF
,
Vos
P
,
van Gulik
TM
,
Rauws
EA
,
Bosma
A
,
Gouma
DJ
.
Incidence of benign lesions in patients resected for suspicious hilar obstruction
.
Br J Surg
.
2001
;
88
(
1
):
48
51
.
7.
Chin
MW
,
Byrne
MF
.
Update of cholangioscopy and biliary strictures
.
World J Gastroenterol
.
2011
;
17
(
34
):
3864
9
.
8.
Litjens
G
,
Kooi
T
,
Bejnordi
BE
,
Setio
AAA
,
Ciompi
F
,
Ghafoorian
M
, et al
.
A survey on deep learning in medical image analysis
.
Med Image Anal
.
2017
;
42
:
60
88
.
9.
McBee
MP
,
Awan
OA
,
Colucci
AT
,
Ghobadi
CW
,
Kadom
N
,
Kansagra
AP
, et al
.
Deep learning in radiology
.
Acad Radiol
.
2018
;
25
(
11
):
1472
80
.
10.
Chahal
D
,
Byrne
MF
.
A primer on artificial intelligence and its application to endoscopy
.
Gastrointest Endosc
.
2020
;
92
(
4
):
813
20.e4
.
11.
Srinidhi
CL
,
Ciga
O
,
Martel
AL
.
Deep neural network models for computational histopathology: a survey
.
Med Image Anal
.
2021
;
67
:
101813
.
12.
Alzubaidi
L
,
Zhang
J
,
Humaidi
AJ
,
Al-Dujaili
A
,
Duan
Y
,
Al-Shamma
O
, et al
.
Review of deep learning: concepts, CNN architectures, challenges, applications, future directions
.
J Big Data
.
2021
;
8
(
1
):
53
.
13.
Scheppach
MW
,
Rauber
D
,
Stallhofer
J
,
Muzalyova
A
,
Otten
V
,
Manzeneder
C
, et al
.
Detection of duodenal villous atrophy on endoscopic images using a deep learning algorithm
.
Gastrointest Endosc
.
2023
;
97
(
5
):
911
6
.
14.
Urban
G
,
Tripathi
P
,
Alkayali
T
,
Mittal
M
,
Jalali
F
,
Karnes
W
, et al
.
Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy
.
Gastroenterology
.
2018
;
155
(
4
):
1069
78.e8
.
15.
Repici
A
,
Spadaccini
M
,
Antonelli
G
,
Correale
L
,
Maselli
R
,
Galtieri
PA
, et al
.
Artificial intelligence and colonoscopy experience: lessons from two randomised trials
.
Gut
.
2022
;
71
(
4
):
757
65
.
16.
Singh
A
,
Gelrud
A
,
Agarwal
B
.
Biliary strictures: diagnostic considerations and approach
.
Gastroenterol Rep
.
2015
;
3
(
1
):
22
31
.
17.
Bossuyt
PM
,
Reitsma
JB
,
Bruns
DE
,
Gatsonis
CA
,
Glasziou
PP
,
Irwig
L
, et al
.
Stard 2015: an updated list of essential items for reporting diagnostic accuracy studies
.
BMJ
.
2015
;
351
:
h5527
.
18.
Norgeot
B
,
Quer
G
,
Beaulieu-Jones
BK
,
Torkamani
A
,
Dias
R
,
Gianfrancesco
M
, et al
.
Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist
.
Nat Med
.
2020
;
26
(
9
):
1320
4
.
19.
Pizer
SM
,
Amburn
EP
,
Austin
JD
,
Cromartie
R
,
Geselowitz
A
,
Greer
T
, et al
.
Adaptive histogram equalization and its variations
.
Computer Vis Graphics, Image Process
.
1987
;
39
(
3
):
355
68
.
20.
Reza
AM
.
Realization of the Contrast Limited Adaptive Histogram Equalization (CLAHE) for real-time image enhancement
.
J VLSI Signal Process Syst Signal Image Video Technol
.
2004
;
38
(
1
):
35
44
.
21.
The MONAI consortium. Project MONAI.
2020
. Available from:
22.
Huang
G
,
Liu
Z
,
Pleiss
G
,
Maaten
L
,
Weinberger
KQ
.
Convolutional networks with dense connectivity
.
IEEE Trans Pattern Anal Mach Intell
.
2022
;
44
(
12
):
8704
16
.
23.
Selvaraju
RR
,
Cogswell
M
,
Das
A
, et al
.
Grad-CAM: visual explanations from deep networks via gradient-based localization
.
Proceedings of the IEEE international conference on computer vision
;
2017
. p.
618
26
.
24.
Buscarini
L
,
Fornari
F
,
Bolondi
L
,
Colombo
P
,
Livraghi
T
,
Magnolfi
F
, et al
.
Ultrasound-guided fine-needle biopsy of focal liver lesions: techniques, diagnostic accuracy and complications: a retrospective study on 2091 biopsies
.
J Hepatol
.
1990
;
11
(
3
):
344
8
.
25.
Eloubeidi
MA
,
Jhala
D
,
Chhieng
DC
,
Chen
VK
,
Eltoum
I
,
Vickers
S
, et al
.
Yield of endoscopic ultrasound-guided fine-needle aspiration biopsy in patients with suspected pancreatic carcinoma
.
Cancer
.
2003
;
99
(
5
):
285
92
.
26.
Fisher
L
,
Segarajasingam
DS
,
Stewart
C
,
Deboer
WB
,
Yusoff
IF
.
Endoscopic ultrasound guided fine needle aspiration of solid pancreatic lesions: performance and outcomes
.
J Gastroenterol Hepatol
.
2009
;
24
(
1
):
90
6
.
27.
Banales
JM
,
Marin
JJG
,
Lamarca
A
,
Rodrigues
PM
,
Khan
SA
,
Roberts
LR
, et al
.
Cholangiocarcinoma 2020: the next horizon in mechanisms and management
.
Nat Rev Gastroenterol Hepatol
.
2020
;
17
(
9
):
557
88
.
28.
Marušić
M
,
Paić
M
,
Knobloch
M
,
Vodanović
M
.
Klatskin-mimicking lesions
.
Diagnostics
.
2021
;
11
:
1944
.
29.
Okagawa
Y
,
Abe
S
,
Yamada
M
,
Oda
I
,
Saito
Y
.
Artificial intelligence in endoscopy
.
Dig Dis Sci
.
2022
;
67
(
5
):
1553
72
.
30.
Kundu
R
,
Das
R
,
Geem
ZW
,
Han
GT
,
Sarkar
R
.
Pneumonia detection in chest X-ray images using an ensemble of deep learning models
.
PLoS One
.
2021
;
16
(
9
):
e0256630
.
31.
Wang
G
,
Liu
X
,
Shen
J
,
Wang
C
,
Li
Z
,
Ye
L
, et al
.
Author Correction: a deep-learning pipeline for the diagnosis and discrimination of viral, non-viral and COVID-19 pneumonia from chest X-ray images
.
Nat Biomed Eng
.
2021
;
5
(
8
):
943
.
32.
Malhotra
P
,
Gupta
S
,
Koundal
D
,
Zaguia
A
,
Kaur
M
,
Lee
HN
.
Deep learning-based computer-aided pneumothorax detection using chest X-ray images
.
Sensors
.
2022
;
22
(
6
):
2278
.
33.
Moon
JH
,
Lee
DY
,
Cha
WC
,
Chung
MJ
,
Lee
KS
,
Cho
BH
, et al
.
Automatic stenosis recognition from coronary angiography using neural networks
.
Comput Methods Programs Biomed
.
2021
;
198
:
105819
.
34.
Yabushita
H
,
Goto
S
,
Nakamura
S
,
Oka
H
,
Nakayama
M
,
Goto
S
.
Development of novel artificial intelligence to detect the presence of clinically meaningful coronary atherosclerotic stenosis in major branch from coronary angiography video
.
J Atheroscler Thromb
.
2021
;
28
(
8
):
835
43
.
35.
Zhao
Q
,
Li
C
,
Chu
M
,
Gutiérrez-Chico
JL
,
Tu
S
.
Angiography-based coronary flow reserve: the feasibility of automatic computation by artificial intelligence
.
Cardiol J
.
2023
;
30
(
3
):
369
78
.
36.
Du
T
,
Xie
L
,
Zhang
H
,
Liu
X
,
Wang
X
,
Chen
D
, et al
.
Training and validation of a deep learning architecture for the automatic analysis of coronary angiography
.
EuroIntervention
.
2021
;
17
(
1
):
32
40
.
37.
Marya
NB
,
Powers
PD
,
Petersen
BT
,
Law
R
,
Storm
A
,
Abusaleh
RR
, et al
.
Identification of patients with malignant biliary strictures using a cholangioscopy-based deep learning artificial intelligence (with video)
.
Gastrointest Endosc
.
2023
;
97
(
2
):
268
78.e1
.
38.
Saraiva
MM
,
Ribeiro
T
,
Ferreira
JPS
,
Boas
FV
,
Afonso
J
,
Santos
AL
, et al
.
Artificial intelligence for automatic diagnosis of biliary stricture malignancy status in single-operator cholangioscopy: a pilot study
.
Gastrointest Endosc
.
2022
;
95
(
2
):
339
48
.
39.
Gerges
C
,
Beyna
T
,
Tang
RSY
,
Bahin
F
,
Lau
JYW
,
van Geenen
E
, et al
.
Digital single-operator peroral cholangioscopy-guided biopsy sampling versus ERCP-guided brushing for indeterminate biliary strictures: a prospective, randomized, multicenter trial (with video)
.
Gastrointest Endosc
.
2020
;
91
(
5
):
1105
13
.
40.
Jang
S
,
Stevens
T
,
Kou
L
,
Vargo
JJ
,
Parsi
MA
.
Efficacy of digital single-operator cholangioscopy and factors affecting its accuracy in the evaluation of indeterminate biliary stricture
.
Gastrointest Endosc
.
2020
;
91
(
2
):
385
93.e1
.
41.
Tringali
A
,
Lemmers
A
,
Meves
V
,
Terheggen
G
,
Pohl
J
,
Manfredi
G
, et al
.
Intraductal biliopancreatic imaging: European Society of Gastrointestinal Endoscopy (ESGE) technology review
.
Endoscopy
.
2015
;
47
(
8
):
739
53
.
42.
Rodríguez de Santiago
E
,
Dinis-Ribeiro
M
,
Pohl
H
,
Agrawal
D
,
Arvanitakis
M
,
Baddeley
R
, et al
.
Reducing the environmental footprint of gastrointestinal endoscopy: European society of gastrointestinal endoscopy (ESGE) and European society of Gastroenterology and Endoscopy Nurses and Associates (ESGENA) position statement
.
Endoscopy
.
2022
;
54
(
8
):
797
826
.
43.
Baddeley
R
,
Aabakken
L
,
Veitch
A
,
Hayee
B
.
Green endoscopy: counting the carbon cost of our practice
.
Gastroenterology
.
2022
;
162
(
6
):
1556
60
.