Abstract
Introduction: Assessing the malignancy of focal liver lesions (FLLs) is an important yet challenging aspect of routine patient care. Contrast-enhanced ultrasound (CEUS) has proved to be a highly reliable tool but is very dependent on the examiner’s expertise. The emergence of artificial intelligence has opened doors to algorithms that could potentially aid in the diagnostic process. In this study, we evaluate the performance of a weakly supervised deep learning model in classifying FLLs as malignant or benign. Methods: Our retrospective feasibility study was based on a cohort of patients from a tertiary care hospital in Germany undergoing routine CEUS examination to evaluate malignancy of FLL. We trained a weakly supervised attention-based multiple instance learning algorithm during 5-fold cross-validation to distinguish malignant from benign liver tumors, without using any manual annotations, only case labels. We aggregated the on-average best performing cross-validation cycle and tested this combined model on a held-out test set. We evaluated its performance using standard performance metrics and developed explainability methods to gain insight into the model’s decisions. Results: We enrolled 370 patients, comprising a total of 955,938 images extracted from CEUS videos or manually captured during the examination. Our combined model was able to identify malignant lesions with a mean area under the receiver operating curve of 0.844 in the cross-validation experiment and 0.94 (95% CI: 0.89–0.99) in the held-out test set. The accuracy, sensitivity, specificity, and F1-Score of the combined model in finding malignant lesions in the held-out test, yielded 80.0%, 81.8%, 84.6%, and 0.81, respectively. Our exploratory analysis using visual explainability methods revealed that the model appears to prioritize information that is also highly relevant to expert clinicians in this task. Conclusion: Weakly supervised deep learning can classify malignancy in CEUS examinations of FLLs and thus might one day be able to assist doctors’ decision-making in clinical routine.
Introduction
The enhanced accessibility of medical imaging modalities in recent decades has led to a rise in incidentally detected focal liver lesions (FLLs). Most of these lesions are of a benign origin like hemangiomas, focal nodular hyperplasias (FNH), or simple liver cysts, usually not requiring any treatment [1]. Yet some lesions are malignant tumors like hepatocellular carcinoma (HCC) with increasing incidence rates or metastases of non-hepatic tumors, occurring in about 5% of patients with primary cancers [2, 3]. The problem medical doctors face is classifying these lesions and distinguishing between malignant and benign while weighing the risks and benefits of further diagnostic work-up. While biopsies remain the gold-standard for the final diagnosis, they are invasive and the risks might not outweigh the benefits. A low-risk diagnostic examination is contrast-enhanced ultrasound (CEUS) that yields a high specificity and sensitivity when performed by well-trained medical experts [4, 5].
Multiple studies have shown the potential of artificial intelligence (AI) for detection, diagnosis, and prognostication of cancer [6‒9]. Classical machine learning techniques like support vector machines, random forests and gradient boosting have been applied to medical tasks in radiology over the past decade, e.g., in the analysis of chest radiographies [10]. Increasing computing power has facilitated the use of neural networks which have likewise been used to detect and diagnose FLLs in B-mode and CEUS examinations [11‒13]. Such studies show the potential of AI methods in the field of ultrasound. Nevertheless, most studies have used expert-picked images and annotation of regions of interest within the chosen frames, which is time consuming for experts and expensive for projects.
While traditional machine learning techniques (see above) require manual feature engineering, deep learning (DL) models have the capability to automatically learn and extract features from raw data. Attention-based multiple instance learning (aMIL), a weakly supervised DL technique, first developed by Ilse et al. [14], has shown great promise when applied to classification tasks within the medical field [15‒17]. With these models, there is no need for complicated and time consuming preprocessing. Therefore, it is not necessary to hand-pick the best diagnostic images or to annotate a region of interest that the algorithm should focus on. aMIL allows us to feed the algorithm with a vast amount of data and it automatically quantifies its importance. This is a groundbreaking innovation for the analysis of ultrasound images because, usually, the main bottleneck for such studies is the manual annotation process. This is why aMIL enables us to analyze much larger patient cohorts. Additionally, through reverse engineering the relevance of each image can then be traced back and compared to a human interpretation of the exam.
aMIL has mainly been applied to the field of computational histopathology where digitized histopathology slides are cut into smaller images in order to allow processing. In our study, we reinterpreted this methodology and applied it to CEUS videos of liver examinations, where the linking information between two images is not spatial, but temporal. The objective of our study was to train, evaluate, and explain a weakly supervised aMIL algorithm, requiring a minimum of preprocessing and human annotation, to study its potential for the classification of malignancy in liver lesions in CEUS examinations. To our knowledge, this is the first application of aMIL to CEUS videos.
Materials and Methods
Patient Selection and Data Collection
We report our study adhering to the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [18]. In this retrospective, feasibility study we collected videos and images of CEUS examinations from patients with liver lesions from the University Hospital of Düsseldorf, a provider of tertiary patient care. To maximize the sample size, a priori sample size calculations were not performed. All patients undergoing CEUS examination of the liver between 1 January 2019 and 31 July 2023 were screened for eligibility. Digital medical records were reviewed to decide on the inclusion of patients based on image quality and initial diagnosis of HCC, metastasis, hemangioma or FNH (Fig. 1). Data collection took place from 1 August 2021 to 30 November 2023. In total, 466 patients examined between 2019 and 2023 were included for further work-up. In this stage, 80 patients were excluded (Fig. 1). The remaining 386 patients were screened for gold-standard diagnosis. Final diagnosis was assured by histopathology, magnetic resonance imaging, or computed tomography scan or follow-up exam after 12 months. For all other cases (n = 119), imaging experts from external institutions (Y.D., U.M., and J.B.) assessed the pseudo-anonymized videos blinded to the diagnoses. In these cases, final diagnosis was assumed if all experts made the same decision. Cases were excluded if at least one expert had a different opinion (n = 16). For included patients, we collected all of the available videos and images of each patient's CEUS examination. Subsequently, we grouped FLLs of HCC and metastasis into the class “malignant” and lesions of hemangioma and FNH into “benign.”
Flowchart showing the flow of patients through the screening processes and train-test split. “Other” reasons for exclusion were: one patient changed his name, artificially creating an additional entry, and one uncertain diagnosis.
Flowchart showing the flow of patients through the screening processes and train-test split. “Other” reasons for exclusion were: one patient changed his name, artificially creating an additional entry, and one uncertain diagnosis.
CEUS Examination
All CEUS examinations were performed using a Canon Aplio i800 ultrasound device (Canon Medical System Corporation, Otawara, Japan). In each patient, we performed b-mode and CEUS. The applied contrast agent was sulfur-hexafluoride (SonoVue, Bracco Imaging, Italy) in an amount of 0.8–1.6 mL i.v. followed by 10 mL of 0.9% saline solution. During each phase of the examination, videos and images were recorded. During the arterial phase, we used a continuous video and during the other phases short videos of approximately 5 s duration. Phases were defined according to EFSUMB-Guidelines based on the first frame of the video (arterial: 0–45 s post injection (p.i.), portal venous: 45–120 s p.i., late: 121–300 s p.i.) [19]. Especially when HCC was considered a relevant differential diagnosis, a “very late” phase was added (>300 s p.i.). The number of examinations per patient was not limited; in most patients, more than one cycle was performed. With only a few exceptions, all examinations were performed by one single physician with more than 15 years of experience in CEUS (M.K.).
Sample Preprocessing and Experimental Design
We extracted every frame from each video (initially “.mov” format) using the openly available openCV library (https://opencv.org/), so that each patient’s examination was represented as a bag of images formed by the frames extracted from the video(s) and the additional images taken during the examination (“.png” format, see Fig. 2a). We excluded frames and images which were not showing applications of CEUS. However, we included frames showing CEUS application and standard B-mode ultrasound in a split-view image. We performed an 80/20 random split at the patient-level stratified by lesion type, dividing the cohort into a training set of 298 and a held-out test set of 72 patients (Fig. 1). Next, we calculated the mean pixel values and their standard deviations from all images of our training cohort to use those values for normalization and as a hyperparameter within our following grid-search.
Outline of the study procedures. a Examinations with contrast-enhanced ultrasound (CEUS) were performed in patients with focal liver lesions (FLLs) to evaluate malignancy. Each patient’s videos taken during examination were cut into frames and combined with the images giving a bag of images for each patient. A pretrained convolutional neural network (ResNet-50 pretrained on ImageNET) was used to obtain one dimensional feature vectors as a distilled representation of each image inside a bag. These bags, together with the known status of malignancy for each patient’s examination were then used to train and test our weakly supervised attention-based multiple instance learning (aMIL) network. b In the first step, our total patient (pat.) cohort of 370 was subdivided into a training and held-out test set. Five cycles of five-fold cross-validation were performed within the training set for 72 different combinations of hyperparameters, summing up to 360 cross-validation experiments. The cross-validation with the highest average area under the receiver operator characteristic (AUROC) was then chosen and all five models together constitute our combined model. This ensemble was then tested on the held-out test set giving a final prediction score for the evaluation of malignancy (benign vs. malignant) for each patient as the average over all internal model’s scores.
Outline of the study procedures. a Examinations with contrast-enhanced ultrasound (CEUS) were performed in patients with focal liver lesions (FLLs) to evaluate malignancy. Each patient’s videos taken during examination were cut into frames and combined with the images giving a bag of images for each patient. A pretrained convolutional neural network (ResNet-50 pretrained on ImageNET) was used to obtain one dimensional feature vectors as a distilled representation of each image inside a bag. These bags, together with the known status of malignancy for each patient’s examination were then used to train and test our weakly supervised attention-based multiple instance learning (aMIL) network. b In the first step, our total patient (pat.) cohort of 370 was subdivided into a training and held-out test set. Five cycles of five-fold cross-validation were performed within the training set for 72 different combinations of hyperparameters, summing up to 360 cross-validation experiments. The cross-validation with the highest average area under the receiver operator characteristic (AUROC) was then chosen and all five models together constitute our combined model. This ensemble was then tested on the held-out test set giving a final prediction score for the evaluation of malignancy (benign vs. malignant) for each patient as the average over all internal model’s scores.
In a first step, we carried out a hyperparameter grid-search within our training set. This consisted of five cycles of 5-fold cross-validation, with each cycle being a different random, class-stratified split of the training cohort. We optimized the following parameters during the grid-search: color normalization of the image entering the feature extractor, bag size, batch size, and learning rate (shown in online suppl. Table 1; for all online suppl. material, see https://doi.org/10.1159/000545098). Of the resulting 1,800 models, we chose the best performing hyperparameter set by comparing the average area under the receiver operator characteristic (AUROC) within each 5-fold cross-validation. The chosen five models were then deployed on the held-out test set and together constituted our combined model (shown in Fig. 2b). Similar to the averaging technique in ensemble learning, our combined model averaged the prediction probabilities of each sub-model and then used this prediction score as the final basis for decision-making.
Deep Learning Workflow
To process these bags of images, we built on a publicly available DL pipeline which was originally designed to be used with histopathological imaging data (https://github.com/KatherLab/marugoto) [15]. We made several modifications in order to adapt its functionality to sonography data (code available at https://github.com/DeepLiverLab/marusono).
The extracted frames were normalized, either within the dataset or using the ImageNET-based standard transform, and resized then input to a feature extraction network [20]. We also automatically cropped the images in order to remove superfluous information included by the ultrasound device. We used a ResNet-50 (convolutional neural network) pretrained on ImageNET as the feature extractor in this work to distill the information on each patient down from n × 224 × 224 × 3 to n × 2,048 numbers (n is the number of frames and images for the given patient, 224 × 224 × 3 is the image size accepted by ResNet-50). These bags of n × 2,048 numbers were then used as input to the aMIL model (see Fig. 2a). Training started from a randomized state, used the Adam optimizer and proceeded for 100 epochs with an early stopping mechanism that ended training early if the validation loss did not improve for 10 epochs.
Performance Evaluation, Visualization, Explainability, and Hardware
We primarily evaluated the performance of our models using the AUROC. For the chosen combined model, we further calculated accuracy, sensitivity, specificity, F1-Score, and average precision using the optimal threshold for the prediction score, deduced from the cross-validation results. We also explored the effect of different decision thresholds of our binary classifier, chosen to increase sensitivity, on these metrics. The threshold is the decision cut-off for the final prediction score of our model. As such a decimal number between 0 and 1 with all prediction scores below or equal to the threshold yielding a final classification as benign and all above as malignant. We plotted the receiver operating characteristic (ROC) curves for each fold within the best cross-validation cycle. For the performance evaluation of the combined model on the held-out test set, we plotted the ROC curve together with its 95% confidence interval (95% CI), obtained through bootstrapping, and the precision-recall curve (https://github.com/KatherLab/wanshi-utils).
In an attempt to gain insight into our model’s decision-making, we chose the 5 patients with the highest bag label scores of each lesion type (HCC, metastasis, FNH, hemangioma) from the test set. Subsequently, we calculated the prediction score and attention together with their weighted score (prediction score × attention) for each image. We did this by forwarding every feature vector (representing an image) for each of these 20 patients through our combined model. We then plotted the calculated values sorted by relative examination time. For example, for a patient’s bag containing frames of two CEUS examinations plus additional images taken during each phase of the examination, the order would be the following: arterial phase images, arterial phase frames from exam one, arterial phase frames from exam two, portal venous phase images, portal venous phase frames from exam one, portal venous phase frames from exam two, late phase images, late phase frames from exam one, late phase frames from exam two.
Next, we extracted the 30 images with the highest (for malignant cases) or lowest (for benign cases) weighted scores. These pictures together with the plots were then reviewed and analyzed by an expert sonographer (M.K.). In the same fashion, we examined the misclassified cases, where we extracted the 30 images with the highest (for cases misclassified as malignant) and lowest (for cases misclassified as benign) weighted scores.
In all experiments, we used the Python version and packages required by the pipelines used. All experiments were conducted on a Dell Precision 7920 Workstation running Fedora 36 with an Intel Xeon Gold 5217 × 16 Processor, 128 GB RAM and an NVIDIA RTX Quadro 8000 with 48 GB VRAM.
Results
Patient Characteristics
After applying the inclusion and exclusion criteria, we included 370 patients which we then split into training and test sets (see Fig. 1). Within our training cohort, 162 patients were diagnosed with a benign and 136 with a malignant lesion, while in our testing cohort 39 patients had a benign and 33 a malignant liver tumor. We could retrieve a total of 955,938 images for all our patients, with a median of 1,944 (interquartile range 2,182) images per patient (see Table 1).
Characteristics of the dataset
Parameter . | All patients . | Training set . | Test set . |
---|---|---|---|
Patients, n | 370 | 298 | 72 |
Images (median, IQR), n | 955,938 (1,944, 2,182) | 767,798 (1,938, 2,079) | 188,140 (1,975, 2,596) |
Malignancy, % | |||
Benign tumors | 54.3 | 54.4 | 54 |
Malignant tumors | 45.7 | 45.6 | 46 |
Parameter . | All patients . | Training set . | Test set . |
---|---|---|---|
Patients, n | 370 | 298 | 72 |
Images (median, IQR), n | 955,938 (1,944, 2,182) | 767,798 (1,938, 2,079) | 188,140 (1,975, 2,596) |
Malignancy, % | |||
Benign tumors | 54.3 | 54.4 | 54 |
Malignant tumors | 45.7 | 45.6 | 46 |
IQR, interquartile range.
Weakly Supervised Models Recognize Malignancy in CEUS Examinations
We performed 5-fold cross-validation to train our weakly supervised aMIL model using features of images extracted from videos of CEUS examinations. The cross-validation cycle with the highest average AUROC of 0.844 is shown in Figure 3a and used the following hyperparameter settings: color normalization within the dataset, bag size 512, batch size 128, and learning rate 0.001. The combined model was then assembled from these sub-models and deployed on the test set. From this cross-validation cycle, an optimal threshold of 0.46 was calculated to maximize the prediction accuracy. For this threshold, the overall accuracy, sensitivity, specificity, and F1-Score (for malignancy) for the cross-validation was 77.9%, 77.2%, 78.4%, and 0.76. When exploring the effects of changing thresholds, we found a threshold of 0.33 resulted in values of 74.8%, 89.7%, 62.3%, and 0.77, respectively (see online suppl. Table 2).
Weakly supervised multiple attention learning is able to diagnose malignant liver lesions. Performance measures of the trained model predicting malignant class are depicted showing receiver operator characteristic (ROC) curves with their area under the curve (AUC) for (a) the best cycle of 5-fold cross-validation within training and (b) the combined model on the held-out test set with 95% confidence interval. The red dashed lines show an AUC of 0.5, equivalent to random guessing. c Precision-recall curve, visualizing the trade-off between sensitivity (recall) and positive predictive value (precision) for different thresholds of the prediction score when deploying the combined model on the test set together with its average precision (AP). The black dashed line (chance level) shows the accuracy the model would achieve if it constantly predicted benign. d Associated confusion matrix for the deployment on the test set (n = 72).
Weakly supervised multiple attention learning is able to diagnose malignant liver lesions. Performance measures of the trained model predicting malignant class are depicted showing receiver operator characteristic (ROC) curves with their area under the curve (AUC) for (a) the best cycle of 5-fold cross-validation within training and (b) the combined model on the held-out test set with 95% confidence interval. The red dashed lines show an AUC of 0.5, equivalent to random guessing. c Precision-recall curve, visualizing the trade-off between sensitivity (recall) and positive predictive value (precision) for different thresholds of the prediction score when deploying the combined model on the test set together with its average precision (AP). The black dashed line (chance level) shows the accuracy the model would achieve if it constantly predicted benign. d Associated confusion matrix for the deployment on the test set (n = 72).
Weakly Supervised Model Generalizes to Held-Out Patient Cohort
We deployed the combined model on the held-out test set which averages the sub-model prediction scores on a patient-level to produce the final classification score (as shown in Fig. 2b). In this way, we achieved an AUROC of 0.94 (95% CI: 0.89–0.99) (shown in Fig. 3b). The average precision was 0.94 (shown in Fig. 3c). The accuracy, sensitivity, specificity, and F1-Score of the combined model in finding malignant lesions, with the optimal threshold of 0.46, is 80.0%, 81.8%, 84.6%, and 0.81, respectively. The corresponding confusion matrix is shown in Figure 3d. Our findings reveal the very good performance of the final classification model and show the potential of our weakly supervised DL approach for CEUS examinations of the liver when evaluating FLLs’ malignancy. We also applied the thresholds explored during cross-validation to the test set (see online suppl. Table 2) and found, as before, that a threshold of 0.33 increases the sensitivity (to >95%). Interestingly, this also resulted in a higher accuracy and F1-score than the “optimal”: the accuracy, sensitivity, specificity, and F1-Score were 84.5%, 96.9%, 71.7%, and 0.842, respectively. This demonstrates how our model could be further tuned toward higher sensitivity without sacrificing accuracy.
Explainability Allows Insight into Weakly Supervised Model’s Decisions
In order to understand our model’s decision-making, we developed an explainability approach to plot the final model’s attention and scoring together with its weighted score for each image (shown in Fig. 4, 5). In these plots, attention varied between 0 and 1 for low and high attention, respectively; the (weighted) score varied between −0.5 and 0.5 for “definitely benign” and “definitely malignant” diagnoses, with 0 representing “unsure.” This allowed us to highlight the images with the highest individual weighted scores and thus the highest relevance for the model’s decision. Additionally, we show the most relevant images and link them to their position in the plot. In Figures 4 and 5, we showcase 4 correctly classified patients with different liver lesions which yielded the highest patient-level prediction scores for their corresponding class. As can be seen in Figure 4a (hemangioma), the model was giving high attention to the arterial phase. The corresponding images were at times and exhibited features that are indicative of a benign lesion to the expert sonographer. In Figure 4b (FNH), we found that the model was giving most relevance to parts of the portal venous and late phases where the B-mode image was shown adjacent to the CEUS image. Additionally, we found a high weighted score on a so-called flash image which is frequently done when FNH is suspected and thus is a relevant, yet artificial, indicator of a benign case. In the analysis of misclassified cases, we found that such flashes appeared among the most relevant frames in 3 out of 6 cases classified as benign, despite being malignant lesions. Among the cases misclassified as benign, 2 were metastases and 4 were cases of HCC. For the exemplified case of metastasis shown in online supplementary Figure 1a, no distinct pattern in the model’s attention toward specific examination phases was identified. However, we noted a highly relevant frame that could be mistaken for an arterial phase image of a hemangioma. Nonetheless, an expert rationale for the model’s decision could not be determined for every frame deemed highly relevant by the model.
Attempting to uncover the combined model’s basis of decision-making for benign cases. The explainability method is shown for benign lesions of patients with (a) hemangioma and (b) focal nodular hyperplasia (FNH) from the test set with the highest patient-level prediction scores for the correct class. The weighted score (red line), the multiplication of attention (blue line) and raw prediction score (light blue vertical bar), is plotted for each image in the patient’s examination as calculated by the model. Within the plots attention varies between 0 and 1 for low and high attention, respectively; the (weighted) score varies between −0.5 and 0.5 for “definitely benign” and “definitely malignant” diagnoses, with 0 representing “unsure.” The calculated results are sorted on the x-axis in the following order: arterial phase (art), portal venous (pv), late phase; standalone images are placed before those extracted from videos. Gray dashed lines indicate changes from one video to another. Adjacent: the four most important images as decided by the model, numbers indicate chronologically labeled location on the plot axis. For both cases, the network is scoring nearly every image as benign, evidenced by the amount of light blue bars, forming an area between 0 and −0.5. The images which the model defined as most likely to be benign (i.e., highest negative weighted scores) show relevant diagnostic features of the respective focal liver lesion.
Attempting to uncover the combined model’s basis of decision-making for benign cases. The explainability method is shown for benign lesions of patients with (a) hemangioma and (b) focal nodular hyperplasia (FNH) from the test set with the highest patient-level prediction scores for the correct class. The weighted score (red line), the multiplication of attention (blue line) and raw prediction score (light blue vertical bar), is plotted for each image in the patient’s examination as calculated by the model. Within the plots attention varies between 0 and 1 for low and high attention, respectively; the (weighted) score varies between −0.5 and 0.5 for “definitely benign” and “definitely malignant” diagnoses, with 0 representing “unsure.” The calculated results are sorted on the x-axis in the following order: arterial phase (art), portal venous (pv), late phase; standalone images are placed before those extracted from videos. Gray dashed lines indicate changes from one video to another. Adjacent: the four most important images as decided by the model, numbers indicate chronologically labeled location on the plot axis. For both cases, the network is scoring nearly every image as benign, evidenced by the amount of light blue bars, forming an area between 0 and −0.5. The images which the model defined as most likely to be benign (i.e., highest negative weighted scores) show relevant diagnostic features of the respective focal liver lesion.
Attempting to uncover the combined model’s basis of decision-making for malignant cases. The explainability method is shown for malignant lesions of patients with (a) metastasis and (b) hepatocellular carcinoma (HCC) from the test set with the highest patient-level prediction scores for the correct class. The weighted score (red line), the multiplication of attention (blue line) and raw prediction score (light blue vertical bar), is plotted for each image in the patient’s examination as calculated by the model. Within the plots attention varies between 0 and 1 for low and high attention, respectively; the (weighted) score varies between −0.5 and 0.5 for “definitely benign” and “definitely malignant” diagnoses, with 0 representing “unsure.” The calculated results are sorted on the x-axis in the following order: arterial phase (art), portal venous (pv), late phase; standalone images are placed before those extracted from videos. Gray dashed lines indicate changes from one video to another. Adjacent: the four most important images as decided by the model, numbers indicate chronologically labeled location on the plot axis. In (a) the model focuses on arterial and late phases with the highest scoring images clearly depicting multiple metastatic lesions. In (b) the model only rated a few pictures from the arterial phase highly while mainly focussing on images from the very late phase where B-mode information was available and at times when the contrast agent had exited the liver.
Attempting to uncover the combined model’s basis of decision-making for malignant cases. The explainability method is shown for malignant lesions of patients with (a) metastasis and (b) hepatocellular carcinoma (HCC) from the test set with the highest patient-level prediction scores for the correct class. The weighted score (red line), the multiplication of attention (blue line) and raw prediction score (light blue vertical bar), is plotted for each image in the patient’s examination as calculated by the model. Within the plots attention varies between 0 and 1 for low and high attention, respectively; the (weighted) score varies between −0.5 and 0.5 for “definitely benign” and “definitely malignant” diagnoses, with 0 representing “unsure.” The calculated results are sorted on the x-axis in the following order: arterial phase (art), portal venous (pv), late phase; standalone images are placed before those extracted from videos. Gray dashed lines indicate changes from one video to another. Adjacent: the four most important images as decided by the model, numbers indicate chronologically labeled location on the plot axis. In (a) the model focuses on arterial and late phases with the highest scoring images clearly depicting multiple metastatic lesions. In (b) the model only rated a few pictures from the arterial phase highly while mainly focussing on images from the very late phase where B-mode information was available and at times when the contrast agent had exited the liver.
In Figure 5a (metastasis), we observed overall higher weighted scores during the arterial and late phases with the highest scoring images clearly depicting multiple metastatic lesions. In Figure 5b (HCC), the model only rated a few pictures from the arterial phase highly. The model mainly focussed on images from the very late phase with the weighted score being close to 0 but positive for most of the patient data. For this case of HCC, the model most likely based its decision on images where B-mode information was also available and at times when the contrast agent had exited the liver (see Fig. 5b). Analyzing cases misclassified as malignant, we found that the majority (5 out of 6) were true hemangiomas. This misclassification may partly result from artifacts introduced by our minimally supervised approach and preselection process. As illustrated in online supplementary Figure 1b, the model’s weighted scores leaned correctly toward benign (with high prediction scores and moderate attention scores) during the arterial phase. However, in the late phase, the weighted scores inverted toward malignant (with moderate prediction scores and higher attention scores), likely due to confusion caused by an artifact generated when the ultrasound probe was removed, but the video continued recording.
Discussion
During the past years, the developments in the field of AI have revolutionized medical imaging with FDA-approved applications in radiology, endoscopy, and histopathology [21‒23]. CEUS is a widely available and noninvasive diagnostic tool for the diagnosis of FLLs. Nevertheless, many years of training are needed to become an expert in this field, resulting in a notable interobserver variability [24, 25]. AI could potentially help less experienced examiners during their training as well as experts in their decision-making.
The present feasibility study explored – to our knowledge – for the first time the potential of a weakly supervised aMIL algorithm for the evaluation of malignancy of FLLs examined during CEUS. Its great advantage when compared to formerly applied AI methodology is the low level of human data preprocessing needed and thus reducing costs and time during training. Our approach showed good performance during cross-validation in the training set and achieved an AUROC of 0.94 and an average precision of 0.94 on the test set. This performance is similar to previous studies using different approaches. Hu et al. [12] used a modified ResNet network for classification and attained AUROC values of up to 0.934. Multiple studies applied classical machine learning algorithms, e.g., support vector machines, random forests, and k-nearest neighbors, yielding equally good performances [26‒28]. A relevant difference between our approach and those of the previously cited articles is that they used time- and cost-intensive hand-picking of relevant images and annotation of regions of interest to achieve such performance. Our approach learned where to focus both in time and within images, removing the need for expert input in the data curation and making it much easier to train such a model on large datasets. Nevertheless, in our explainability analyses we found our model apparently being confused by artifacts induced into the cohort by this minimal supervision approach. While we believe that these artifacts could be irrelevant in a sufficiently large training cohort, our approach needs a relatively small training set to reach very good AUROC values compared to recent studies using 1,000s of patients’ examinations [29, 30]. The accuracy (using the optimal threshold of 0.46) of 80% is good, but would leave around 2 out of 10 patients falsely classified considering a sensitivity for malignant lesions of 81%. Nevertheless, when studying different classifier thresholds to maximize the sensitivity to malignancy we found that our model misclassified just 1 out of 33 malignant FLLs, yielding a sensitivity of 96.9% with an increase in accuracy to 84.5% for both classes. However, a recent study compared the accuracy of humans evaluating malignancy with that of their model. They found that the two human experts achieved accuracies of 83.0% and 90.1%, while the nonexperts achieved 79.7%, 67.0%, and 79.7%. Their so-called 9-input model, trained on 163 patients, gave an accuracy of 85.2% on their test set of 18 [25]. We believe that our model, with a considerably larger test cohort, compares favorably with their model and four of their five human readers. This underlines the hypothesis that AI support systems could help nonexperts in their decision-making in the future.
Despite yielding high performances, this proof of concept study has some relevant limitations. First, as the test set came from the same cohort as the training data, there is a significant chance of overfitting to the training data. Nevertheless, nearly all the previously published studies applying AI for CEUS examinations of the liver suffer from this shortcoming [31]. This might be due to the difficulty of obtaining datasets which need high levels of expert knowledge. In future research, the robustness of the model could be improved by a multicentric study protocol. Second, all data were derived from examinations carried out by one physician who consistently used the same ultrasound device. While we do not see disadvantage in this for the proof of concept, i.e., classifying lesions without annotating regions of interest, we admit that the model’s generalization capacity might, therefore, be limited. It is possible that our model has learned to focus on characteristics which are specific to the given machine and examiner, such as the timing and number of flashes for certain diagnostic entities. These obstacles could also be overcome in the future by using multicenter datasets. Third, DL algorithms are by nature so-called black boxes due to the near impossibility of deciphering the mathematical equations linking input to output. In order to reduce this deficiency, we developed a new way to visualize the decision basis of our model. Our plots of attention and score contributions provide strong evidence that the model recognized important aspects of the patients’ videos. Our automated cropping of superfluous information from the images may have also introduced some distortion to or even potentially removed information from the images. Alternative methods will be investigated in future work.
Conclusion
For the first time, we have shown that a weakly supervised aMIL can be successfully applied to CEUS examinations of FLLs. As this method obviates the need for exhaustive human preprocessing of data, our proof of concept paves the way for model development on large patient cohorts, which has previously been prohibitively expensive in the CEUS domain.
Our internal validation of the model evaluating malignancy in FLLs yielded promising results, evidenced by an excellent AUROC and precision. Additionally, it showed robust accuracy and made seemingly reasonable decisions as shown in our explainability method. To advance this work, future studies should test the proposed approach in conjunction with the resulting models on more heterogeneous and diverse datasets to overcome the limitations identified above. We hope to inspire AI-based studies prospectively validating models in a multicenter setting and as such creating broadly available, secure, and reliable AI-driven healthcare applications.
Acknowledgments
Parts of the figures were generated using and modifying artwork from Servier Medical Art (https://smart.servier.com/) which is licensed under Creative Commons Attribution 4.0 (https://creativecommons.org/licenses/by/4.0/).
Statement of Ethics
This study was carried out in accordance with the Declaration of Helsinki. We received confirmation from the Ethics Committee of the Heinrich Heine University Düsseldorf on 19 July 2021 and 3 November 2023 (Application numbers 2021-1443, 2021-1443_1). Written informed consent from participants was not required for the study presented in this article in accordance with local guidelines.
Conflict of Interest Statement
J.N.K. declares consulting services for Owkin, France, DoMore Diagnostics, Norway, Panakeia, UK, Scailyte, Switzerland, Cancilico, Germany, Mindpeak, Germany, MultiplexDx, Slovakia, and Histofy, UK; furthermore, he holds shares in StratifAI GmbH, Germany, has received a research grant by GSK, and has received honoraria by AstraZeneca, Bayer, Eisai, Janssen, MSD, BMS, Roche, Pfizer and Fresenius. T.L. declares consulting fees from AstraZeneca, BMS, EISAI, Incyte, MSD, Roche, HepaRegeniX and honorary talks and travel support from Abbvie and Gilead. M.K. declares honorary talks and travel support from Bracco Imaging and Canon Medical, furthermore he received a research grant by Bracco Imaging. M.G. has received honoraria for lectures sponsored by Techniker Krankenkasse (TK). No other conflicts of interest are declared by any of the authors.
Funding Sources
J.N.K. and T.L. are supported by the German Federal Ministry of Health (DEEP LIVER, ZMVI1-2520DAT111) and the German Cancer Aid (DECADE, 70115166). J.N.K. is supported by, the German Federal Ministry of Education and Research (PEARL, 01KD2104C; CAMINO, 01EO2101; SWAG, 01KD2215A; TRANSFORM LIVER, 031L0312A; TANGERINE, 01KT2302 through ERA-NET Transcan), the German Academic Exchange Service (SECAI, 57616814), the German Federal Joint Committee (TransplantKI, 01VSF21048), the European Union’s Horizon Europe and innovation programme (ODELIA, 101057091; GENIAL, 101096312), the European Research Council (ERC; NADIR, 101114631), and the National Institute for Health and Care Research (NIHR, NIHR213331) Leeds Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. This work was funded by the European Union. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them. T.L. is supported by the European Research Council (ERC; Consolidator Grant No. 771083) and the German Federal Ministry of Education and Research (TRANSFORM LIVER, 031L0312A). The laboratory of T.L. was further funded by the German Cancer Aid (Deutsche Krebshilfe – 110043), the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 403224013, 279874820, 461704932, 440603844, and support from the Medical Faculty of the Heinrich Heine University. J.A.B. is supported by the MODS project funded from the programme “Profilbildung 2020” [Grant No. PROFILNRW-2020–107-A], an initiative of the Ministry of Culture and Science of the State of North Rhine-Westphalia. A.O. was supported by a PhD bursary of the Deutsche Gesellschaft für Gastroenterologie, Verdauungs- und Stoffwechselkrankheiten (DGVS) and by the Studienstiftung des Deutschen Volkes eV through his regular scholarship. The funders had no role in the design, data collection, data analysis, and reporting of this study.
Author Contributions
Following CreDiT statement: conceptualization: J.A.B., J.N.K., M.K., T.L., A.O., and T.P.S.; methodology: J.A.B., M.G., J.N.K., M.K., A.O., O.L.S., T.P.S., and M.T.; software: J.A.B., A.O., T.P.S., and M.T.; analysis, visualization, and writing – original draft: J.A.B., M.K., A.O., T.P.S.; investigation: J.B., Y.D., M.K., U.M., and A.O.; resources and supervision: J.N.K., M.K., T.L., and T.P.S.; data curation: M.K. and A.O.; : writing – review and editing: all authors;: project administration and funding acquisition: J.N.K. and T.L.
Additional Information
Jakob Nikolas Kather, Tobias Paul Seraphin, and Michael Kallenbach contributed equally to this work.
Data Availability Statement
The data supporting the findings of this study are not publicly available due to reasons of sensitivity but are available from the corresponding authors T.P.S. and T.L. upon reasonable request.