Introduction: In this retrospective cohort study, we wanted to evaluate the performance and analyze the insights of an artificial intelligence (AI) algorithm in detecting retinal fluid in spectral-domain OCT volume scans from a large cohort of patients with neovascular age-related macular degeneration (AMD) and diabetic macular edema (DME). Methods: A total of 3,981 OCT volumes from 374 patients with AMD and 11,501 OCT volumes from 811 patients with DME were acquired with Heidelberg-Spectralis OCT device (Heidelberg Engineering Inc., Heidelberg, Germany) between 2013 and 2021. Each OCT volume was annotated for the presence or absence of intraretinal fluid (IRF) and subretinal fluid (SRF) by masked reading center graders (ground truth). The performance of an already published AI algorithm to detect IRF and SRF separately, and a combined fluid detector (IRF and/or SRF) of the same OCT volumes was evaluated. An analysis of the sources of disagreement between annotation and prediction and their relationship to central retinal thickness was performed. We computed the mean areas under the curves (AUC) and under the precision-recall curves (AP), accuracy, sensitivity, specificity, and precision. Results: The AUC for IRF was 0.92 and 0.98, for SRF 0.98 and 0.99, in the AMD and DME cohort, respectively. The AP for IRF was 0.89 and 1.00, for SRF 0.97 and 0.93, in the AMD and DME cohort, respectively. The accuracy, specificity, and sensitivity for IRF were 0.87, 0.88, 0.84, and 0.93, 0.95, 0.93, and for SRF 0.93, 0.93, 0.93, and 0.95, 0.95, 0.95 in the AMD and DME cohort, respectively. For detecting any fluid, the AUC was 0.95 and 0.98, and the accuracy, specificity, and sensitivity were 0.89, 0.93, and 0.90 and 0.95, 0.88, and 0.93, in the AMD and DME cohort, respectively. False positives were present when retinal shadow artifacts and strong retinal deformation were present. False negatives were due to small hyporeflective areas in combination with poor image quality. The combined detector correctly predicted more OCT volumes than the single detectors for IRF and SRF, 89.0% versus 81.6% in the AMD and 93.1% versus 88.6% in the DME cohort. Discussion/Conclusion: The AI-based fluid detector achieves high performance for retinal fluid detection in a very large dataset dedicated to AMD and DME. Combining single detectors provides better fluid detection accuracy than considering the single detectors separately. The observed independence of the single detectors ensures that the detectors learned features particular to IRF and SRF.

The advent of high-resolution optical coherence tomography (OCT) has led to the identification of biomarkers in diabetic retinopathy with diabetic macular edema (DME) and neovascular age-related macular degeneration (AMD), delivering a vast amount of morphological information, which could be used to individualize treatment decisions [1‒3]. In neovascular AMD, subretinal fluid (SRF) is associated with a better visual outcome and a lower rate of transformation into geographic atrophy, while intraretinal fluid (IRF) has a poorer visual outcome with increasing rates of macular atrophy [4, 5]. In DME, the size and location of the macular cystoid spaces are predictive in the functional outcomes, large cysts (>200 μm) located in the outer nuclear layer being associated with poorer visual prognosis [6, 7].

However, the necessary human resources and expertise to assess these images are challenged with the increasing number of patients requiring OCT examinations for optimal management of macular diseases [5, 8]. In addition, even in the pivotal clinical trials discrepant qualitative assessment of fluid between ophthalmologists and certified reading center graders has been reported, leading to a significant number of missed treatments [9, 10].

Machine learning has already shown an immense cost-effective potential to assist ophthalmologists in clinical routine, such as detecting biological markers in retinal OCT scans and predicting the treatment demand for a given patient [11‒14]. Most of these works evaluated their proposed methods in controlled data, usually on a single dataset obtained from the same center, a single clinical trial, or using OCT scans with a single disease, selected manually by a specialist [11‒13, 15‒17]. However, this promising technology requires a stronger assessment of the performances and robustness, evaluating on larger datasets and understanding the discrepancies between algorithmic predictions and human annotations.

The aim of this study was to evaluate such a method in a more thorough way and to have a clearer view of its practical capability. For this, our evaluation relied on a large set of OCTs and two groups of pathologies and was intended to verify that the detectors of IRF and SRF learned specific features to IRF and SRF, respectively. Along with reporting the general performance, we analyzed the performances according to central subfield thickness (CST) and identified the most important cases where the algorithm fails. The assessed method is a CE-marked deep-learning-based biomarker classifier capable of identifying the presence of SRF and IRF in OCT volumes and comparing it with reading center grader annotations in a systematic way.

Study Cohort and Evaluation Data

The evaluation was performed using completely anonymized OCT volumes of two cohorts of patients, including patients with exudative AMD and DME, respectively. We refer to them as AMD and DME cohorts. All patients received an anti-VEGF treatment, were treated and monitored between October 2013 and June 2021, and were initially treatment naïve.

We extracted anonymized OCT volumes from these patients from the database of the Ophthalmology Department at the University Hospital of Bern, Bern, Switzerland, which have been imaged with a Spectralis SD-OCT imaging system (Heidelberg Engineering Inc., Heidelberg, Germany) and for which grader annotations were available. Our study focuses on OCT volumes using a 49 B-scan acquisition protocol and a resolution of 496 × 512 pixels per B-scan corresponding to an area of 5.90 mm × 5.75 mm × 1.92 mm centered on the fovea. For each horizontal scan, 9 B-scans were averaged. We did not exclude patients with poor image quality.

In the AMD cohort, we collected 3,981 OCT volume scans corresponding to 374 patients (748 eyes) for SRF and 3,618 OCT volume scans from 371 patients (742 eyes) for IRF. In the DME cohort, we collected 11,501 OCT volume scans from 809 patients (1,618 eyes) for SRF and 11,499 OCT volume scans related to 811 patients (1,620 eyes) for IRF. The discrepancies in the number of OCT volume scans for IRF and SRF are explained by the fact that only the annotated ones were taken into consideration, ungradable OCT volume scans being excluded.

Annotation Protocol – Reading Center Assessments

A central reading center (Bern Photographic Reading Center, Inselspital, University Hospital Bern, Department of Ophthalmology, Bern, Switzerland) performed an independent, masked review of all OCT volume scans. For each abovementioned cohort, a grading protocol and grading form were used, containing for instance the definitions of morphological changes/biomarkers. Each OCT volume scan was graded independently by two certified graders and without access to each other’s grading. In case of disagreement among the annotations of both graders, a third evaluation by a senior grader was performed. Details of the annotation protocols and biomarker definition are available as online supplementary material (see www.karger.com/doi/10.1159/000527345 for all online suppl. material). Figure 1 reports the distribution of present and absent biomarkers in both studies, as annotated by the graders of the central reading center.

Fig. 1.

Distribution of intraretinal fluid (IRF) and subretinal fluid (SRF) in the age-related macular degeneration (AMD) and diabetic macular edema (DME) cohorts, as annotated by the experts of the Bern Photographic Reading Center (BPRC). The “n” value indicates the total number of annotated (i.e., present/absent) OCT volume scans for IRF and SRF in both cohorts. In the DME cohort, we included 11,499 and 11,501 annotated OCT volume scans for IRF and SRF, respectively. In the AMD cohort, we included 3,618 and 3,981 annotated OCT volume scans for IRF and SRF, respectively.

Fig. 1.

Distribution of intraretinal fluid (IRF) and subretinal fluid (SRF) in the age-related macular degeneration (AMD) and diabetic macular edema (DME) cohorts, as annotated by the experts of the Bern Photographic Reading Center (BPRC). The “n” value indicates the total number of annotated (i.e., present/absent) OCT volume scans for IRF and SRF in both cohorts. In the DME cohort, we included 11,499 and 11,501 annotated OCT volume scans for IRF and SRF, respectively. In the AMD cohort, we included 3,618 and 3,981 annotated OCT volume scans for IRF and SRF, respectively.

Close modal

Automatic System for Biomarker Detection

The biomarker detector we evaluated in this study is the CE-marked Discovery® OCT Biomarker Detector (RetinAI AG, Switzerland), the one from Kurmann et al. [12]. We used the trained model from Kurmann et al. [12], but evaluated on a different dataset described earlier. This model aims to detect the presence probability of a set of 10 biomarkers at the B-scan level. These include SRF, IRF, hyperreflective foci, drusen, reticular pseudodrusen, epiretinal membrane, geographic atrophy, outer retinal atrophy, fibrovascular pigment epithelial detachment, and healthy B-scans (defined by the absence of all the listed biomarkers). This detector relies on a CNN architecture and was trained in a supervised way on B-scans for which the presence of these 10 biomarkers was manually annotated. 23,030 annotated B-scans were used to train this architecture, and 1,029 B-scans were used for the evaluation performance reported by the authors. Training pathologies included early, intermediate, and late AMD, as well as diabetic retinopathy but without DME. It is important to note that none of the B-scans used for model development, i.e., training and evaluation, are included in the present study. We refer the readers to the paper for the description of these biomarkers, the annotation protocol, and the distribution of each biomarker in the training and testing set [12]. While Kurmann et al. [12] evaluated the detection of 10 biomarkers at the B-scan level, we evaluated the detection of IRF and SRF at the volume level. This is because the 15,482 OCT volumes used in our work are annotated at the volume level for IRF and SRF, as explained in the previous section. As the biomarker detector gives the presence probability of the B-scan level, we used the maximum presence probability of the biomarkers across the whole volume as the final predicted presence probability for the OCT volume.

Evaluation

The evaluation was conducted by comparing the predictions of the biomarker detection model and the annotations performed by the BPRC independently in both cohorts. First, the performance of the model was measured using the area under the receiver operating characteristic (ROC) curve (AUC) with 95% confidence intervals (CIs), sensitivities, and specificities. ROC curves were computed separately for IRF and SRF. Second, we computed the precision and recall (PR) curves, which are useful in case of distribution imbalance between two classes of a binary model.

Third, we performed an analysis of the sources of disagreement between annotation and prediction. For this purpose, the false-positive (FP) and false-negative (FN) rates of each cohort and of each biomarker were calculated. The operating point of the predictive model was set to maximize sensitivity and specificity for each biomarker in both cohorts. Among the FP and FN samples, we also identified the main categories of the underlying errors. Some representative cases with strong disagreements between the annotation and the prediction were manually inspected and reported here. This aimed to identify the current challenges and evaluate whether these are inherent to the collected data, the acquisition protocol, or the evaluated system. Fourth, the retinal thickness of the FP and FN samples was compared to that of true negative (TN) and true positive (TP).

Lastly, we wished to analyze the relationship of the detectors of IRF and SRF and the performance of a joint detector of IRF and SRF, referred to as fluid detector. We first aimed to identify any covariate between the two single detectors, i.e., another variable/factor, which affects the detection of fluid and which appears only when the IRF and SRF are considered jointly. Thus, we compared the final predictions, i.e., after setting an operating point, of the single detectors of IRF and SRF to the ones of the fluid detector. The operating point was set to maximize sensitivity and specificity. The output probability of the joint detector was set to the maximum probability between the output of the IRF and SRF detectors.

We used Python 3 and its open-source package scikit-learn v0.20.3 for the ROC and PR implementation and AUC/area under the precision-recall curve (AP) calculations. Statistical software was used to conduct data analysis (Prism 9.2.0; GraphPad Software, Inc., La Jolla, CA, USA, and IBM, SPSS statistics, version 21; SPSS Inc, Chicago, IL, USA). To compare retinal thickness values among the subgroups, we performed a Welch’s unequal variance t test, as the groups have a different size and variance.

Performance of IRF and SRF Detection

AMD Cohort

Predicted and annotated IRF and SRF presence correlated strongly in the AMD cohort. The AUC per OCT volume scan for IRF detection was 0.92 (CI = 0.91–0.93, n = 3,618), and the AUC for SRF detection per OCT volume scan was 0.98 (CI = 0.98, n = 3,981). These results are presented in Figure 2a and demonstrate the robust capability of the deep learning algorithm to accurately predict the presence of IRF and SRF among a significant number of OCT volume scans. The computed PR curves with the calculated APs are presented in Figure 2b and showed an AP per OCT volume scan of 0.97 (CI = 0.96–0.98) and 0.89 (CI = 0.87–0.90) for SRF and IRF, respectively.

Fig. 2.

Receiver operating characteristic (ROC) and precision-recall (PR) curves on detection performance of intraretinal fluid (IRF) and subretinal fluid (SRF). First row: Age-related macular degeneration (AMD). Second row: Diabetic macular edema (DME). a, c ROC curves. b, d PR curves. The area under the curve (AUC) and area under the precision-recall curve (AP) with the confidence intervals as measures of general performance are specified in parentheses.

Fig. 2.

Receiver operating characteristic (ROC) and precision-recall (PR) curves on detection performance of intraretinal fluid (IRF) and subretinal fluid (SRF). First row: Age-related macular degeneration (AMD). Second row: Diabetic macular edema (DME). a, c ROC curves. b, d PR curves. The area under the curve (AUC) and area under the precision-recall curve (AP) with the confidence intervals as measures of general performance are specified in parentheses.

Close modal

DME Cohort

Predicted and annotated IRF and SRF also showed a high level of correlation in the DME cohort. The AUC for IRF (n = 11,499) and SRF (n = 11,501) per OCT volume scan was 0.98 (CI = 0.98–0.99) and 0.99 (CI = 0.98–0.99), respectively. These results are displayed in Figure 2c. The PR curves are presented in Figure 2d and supported our observations with an AP per OCT volume scan for SRF of nearly 1.00, though a modest decrease in the calculated AP regarding SRF was found, with an AP per OCT volume scan of 0.93 (CI = 0.91–0.94).

Annotation and Prediction Disagreement, Analysis of Representative Cases

Figure 3 summarizes the total number of FP, FN, TN, and TP samples, the ground truth being the grader annotation. The threshold value used to assign each OCT volume scan to the categories was calculated with the aim of maximizing sensitivity and specificity for each cohort and each biomarker. The threshold for IRF was 0.946 and 0.985 and for SRF 0.95 and 0.68 in the AMD and DME cohorts, respectively. For instance, an OCT volume scan in the AMD cohort with IRF annotated as present by the grader and with a prediction probability of 0.94 was classified as FN.

Fig. 3.

Confusion matrix on detection performance (prediction, x-axis) of intraretinal fluid (IRF) and subretinal fluid (SRF), the samples divided into four categories based on the grader’s annotations (ground truth, y-axis); true negative (TN), false positive (FP), false negative (FN), and true positive (TP). First row: age-related macular degeneration (AMD). Second row: diabetic macular edema (DME). a, c IRF. b, d SRF. The threshold was set to maximize sensitivity and specificity for each group and each biomarker. The horizontal bottom bar gives the color code from the minimum (purple) to the maximum (yellow) sample numbers for the four categories.

Fig. 3.

Confusion matrix on detection performance (prediction, x-axis) of intraretinal fluid (IRF) and subretinal fluid (SRF), the samples divided into four categories based on the grader’s annotations (ground truth, y-axis); true negative (TN), false positive (FP), false negative (FN), and true positive (TP). First row: age-related macular degeneration (AMD). Second row: diabetic macular edema (DME). a, c IRF. b, d SRF. The threshold was set to maximize sensitivity and specificity for each group and each biomarker. The horizontal bottom bar gives the color code from the minimum (purple) to the maximum (yellow) sample numbers for the four categories.

Close modal

To understand the sources of disagreement, 15% of FN and FP samples of each biomarker were randomly selected and manually analyzed. Figure 4 illustrates eight representative sources of discrepancies. In the AMD cohort, cases 1 and 3 present FP results for IRF and SRF, respectively. Cases 2 and 4 display FN results for IRF and SRF, respectively. Regarding the DME cohort, cases 5 and 7 show FP samples and cases 6 and 8 FN examples for IRF and SRF, respectively.

Fig. 4.

Eight OCT volumes with discrepancies between the annotation and prediction of intraretinal fluid (IRF) and subretinal fluid (SRF). The study cohort, evaluated fluid, grader annotation, prediction of the evaluated system with presence probability in parenthesis, and source of discrepancy are displayed above the OCT scan image. For each case, in the bottom left of the OCT scan image is the image index reported (out of 49 B-scans in a volume). The location of the error source is framed in red with a zoomed view. The computed presence probability for both detectors per OCT B-scan image over the whole OCT volume scan (total 49 images) is presented underneath the OCT scan image. The red box corresponds to the image index of the displayed OCT images. The presence probability is illustrated with a color code, ranging from purple (0%, absent) to yellow (100%, present).

Fig. 4.

Eight OCT volumes with discrepancies between the annotation and prediction of intraretinal fluid (IRF) and subretinal fluid (SRF). The study cohort, evaluated fluid, grader annotation, prediction of the evaluated system with presence probability in parenthesis, and source of discrepancy are displayed above the OCT scan image. For each case, in the bottom left of the OCT scan image is the image index reported (out of 49 B-scans in a volume). The location of the error source is framed in red with a zoomed view. The computed presence probability for both detectors per OCT B-scan image over the whole OCT volume scan (total 49 images) is presented underneath the OCT scan image. The red box corresponds to the image index of the displayed OCT images. The presence probability is illustrated with a color code, ranging from purple (0%, absent) to yellow (100%, present).

Close modal

Relation to CST of the Retina

We compared the mean CST value of the FN OCT volume scans to that of TP, and for the FPs to TNs, the results are displayed in Table 1. The definition of the CST was the average value of the retinal thickness in the central 1 mm ETDRS region, where the retina is defined as the space between the ILM and Bruch’s membrane.

Table 1.

Relation between CST and classification

 Relation between CST and classification
 Relation between CST and classification

Independence of the Single Detectors and Analysis of a Combined Fluid Detector

Table 2 compares the number of OCTs correctly predicted (TPs and TNs) by the single detectors of IRF and SRF, when combined into a single fluid detector, and reports the number of OCTs correctly predicted by the combination of the single detectors and the fluid detector, as well as the accuracy, specificity, sensitivity, and precision. The proportion of OCT volume scans annotated with fluid (IRF and/or SRF) was 54.64% in the AMD cohort and 89.05% in the DME cohort.

Table 2.

Single fluid detector versus combined fluid detector

 Single fluid detector versus combined fluid detector
 Single fluid detector versus combined fluid detector

We note, in general, that there were very high percentages of correctly predicted samples across the different detectors, especially in the DME cohort. Checking independence of the two single detectors led to consider the product of the probabilities of correct detection by the IRF and SRF detectors (rows 1 and 2 in Table 2), respectively, and to compare the product to the probability of correct detection of IRF and SRF simultaneously. We used the detector accuracy as the probability of correct detection. We then obtained a probability of 0.8684 × 0.9384 = 0.8149 of correct detection of any fluid when considering the two single detectors separately and a probability of 0.8159 of correct detection for the combined fluid detector. Such a small difference supported the independence between the two single detectors.

We report very good performances of the fluid detection (IRF and SRF) with an AUC of 0.95 and 0.98 for AMD and DME, respectively, and an average precision of 0.97 and 1.00 for AMD and DME, respectively. In Table 2, we observe that the fluid detector presented a higher accuracy than the combination of the two single detectors. A more precise analysis of the performance of the single detectors in comparison with the combined detector is presented in the online supplementary material.

IRF and SRF Detection

The presence or absence of pathological fluid in exudative retinal diseases is of great importance to evaluate disease activity. Our experimental results demonstrate that the evaluated deep learning algorithm can predict the presence of both major features of exudative disease activity, IRF and SRF, on a large dataset of OCT volumes of patients with AMD and DME with a very high degree of accuracy. In addition, since the testing data differ from the training data, this study demonstrates a powerful external validation of an artificial intelligence (AI)-based algorithm. These excellent results are supported by the computed AP (Fig. 2), which do not consider the TN classes. However, we noticed a slight decrease in the AP of IRF (0.89) in the AMD cohort and SRF (0.93) in the DME cohort as compared to their respective AUC. The accuracy, sensitivity, specificity, and precision (Table 2) for IRF were superior in the DME cohort (0.93, 0.93, 0.95, and 0.99) than in the AMD cohort (0.87, 0.84, 0.88, and 0.76). However, this trend was reversed for SRF, with sensitivity and precision of 0.93 and 0.88 in the AMD cohort versus 0.87 and 0.83 in the DME cohort. The accuracy and specificity for SRF in the AMD cohort remained slightly lower than in the DME cohort (0.95 and 0.93 vs. 0.95 and 0.95, respectively). The distribution of the presence of probabilities of the biomarkers can explain these discrepancies, as the positive class for IRF in the AMD and SRF in the DME cohort was lower than SRF and IRF in the AMD and DME cohorts, respectively (Fig. 1). An alternative explanation for the slightly lower performance for IRF in the AMD cohort is the relatively high FP rate of 8.18% and FN rate of 4.98% (Fig. 3a), suggesting a greater challenge for the evaluated system in discriminating IRF from degenerative small cystoid fluid or retinal shadow artifacts. A diffuse noncystoid fluid accumulation with clearer borders could also be considered as an additional factor, as this kind of fluid is more difficult to distinguish in OCT volume scans of low quality. The slight improvement in performance in the DME cohort may also lie in the intrinsic characteristic of retinal lesion architecture, as AMD lesions tend to have more pigment epithelial detachments, which sometimes distort heavily the image, complicate focusing in the context of a thicker retina, and expand into areas that might be prone to IRF [7, 18].

Our results are also in line with the results of recent works on automated detection and/or quantification of retinal fluid, which demonstrated high accuracy in detecting retinal fluid such as IRF and SRF [19, 20], detecting any retinal fluid [21‒23], or giving a binary yes-no classifier of the presence of an exudative disease [24‒26]. In our experiment, we focused on the accuracy of IRF and SRF detection as well as the ability of the evaluated algorithm to discriminate between these two features, since IRF and SRF are associated with variable visual outcomes [4, 7, 27]. The developed method of Schlegl et al. [19] showed excellent results on a testing set containing 1,200 OCT volume scans from 400 patients with AMD, DME, and retinal vein occlusion, of which 65 and 100 contained IRF and 69 and 11 contained SRF in AMD and DME, respectively, and were performed with the a Spectralis device as the present study. However, a decrease in the sensitivity for SRF in eyes with DME was noticed by the authors and explained by the fact that the algorithm was solely trained on AMD and retinal vein occlusion cases and SRF is scarce in DME patients. Lu et al. [20] obtained promising similar results concerning SRF in DME on testing their method on an unseen dataset containing 750 B-scan images. Our results suggest that these challenges can be overcome with a larger sample size and a sufficient amount of positive classes, and showed that training an algorithm on a set with different biomarker distributions among multiple exudative diseases does not necessarily reduce the detection performance. Furthermore, testing a deep learning approach on a set with entire OCT volume scans (in contrast to individual B-scans) may be beneficial in the perspective of a potential upcoming generalized clinical application, allowing the training of an algorithm to come as close as possible to routine conditions.

Discrepancies between Algorithmic Predictions and Human Annotations

In recent years, an increasing number of research groups focused their works on automated detection and segmentation of retinal fluid, comparing their results to expert human graders [28, 29]. In this way, Kurmann et al. [12] provided some additional insights on the biomarker detector for which we propose a more thorough evaluation in this paper. We focused in this paper on a qualitative and quantitative analysis of cases with disagreement to unveil the limitations of the system predictions.

Explanation of the Major Sources of Errors

IRF was mistakenly predicted when retinal shadow artifacts caused by anterior hyperreflective opacities were present (Fig. 4, case 1) as well as in cases of strong retinal alteration with poor acquisition quality and/or more segmentation failures, where the low contrast resulted in a darkening of the neurosensory retina, which in turn were falsely interpreted as fluid (case 5). The mean CST of the FP volume scans regarding IRF in the AMD cohort was higher than the one for TNs (p = <0.0001), which supports our previous finding. However, these observations were not applicable to the FP samples for IRF in the DME cohort, where no significant difference with the TNs was detected (p = 0.559). In addition, the FP rate of IRF in DME (0.52%) was lower than in AMD (8.18%). This could be explained by the very small sample size of the FP class for IRF in the DME cohort, as nearly 90% of the DME dataset was annotated with IRF present (Fig. 1) and because low acquisition quality represents the principal factor, leading to FP results in this cohort (Fig. 4, case 5).

The vast majority of missed IRF was attributed to the presence of small hyporeflective areas in combination with poor image quality and resulting lower contrast (cases 2 and 6). This observation is supported by the comparison of the CST of each positive and negative class. Regarding IRF in both cohorts, the mean CST of the FN samples was significantly smaller than one of the TP volume scans (AMD; p = 0.0048, DME; p = <0.0001). A previous study comparing the performance of an AI algorithm to retinal specialists [22] showed that the estimated fluid volume in the FN cases was significantly lower to that of the TPs. The clinical implication of this observation is questionable, as the presence of peripheral small degenerative cystoid fluid does not always lead to a modification of the therapy management in DME and as degenerative cystoid fluid does not necessarily respond to anti-VEGF therapy [27]. Therefore, the ability to detect small amounts of fluid does not necessarily provide an advantage to an automated method, but underlines that an algorithm can be considered as a second grader in the scope of clinical trials.

The algorithm tended also to overestimate the presence of SRF, even if the homogeneous hyporeflective area was less than 100 μm in a horizontal extent (case 3). This situation occurred mostly in cases of strong retinal deformation with retinal shadowing (case 7), but did not result in a significant difference in the scope of the CST concerning the FP and TN samples in the AMD cohort (p = 0.112). In the DME cohort, SRF was falsely predicted as present in cases with strong retinal deformation (case 7), which resulted in a slightly thicker CST in the FP subgroup compared to the TNs (p = 0.0088).

Concerning SRF in both cohorts, the mean CST value of the FNs did not differ significantly from one of the TPs (AMD; p = 0.2526, DME; p = 0.185). This could be explained either in light of the relatively small sample size or in the context of extended retinal alterations within the FN samples, causing shadow artifacts and/or low-quality acquisition (cases 4 and 8). Further, a marked difference was observed between the FN rates of SRF in the AMD (2.54%) and DME cohorts (0.39%), which could be explained by the biomarker distribution, given that only 8.5% of the DME set was annotated with SRF versus 34.9% of the AMD set (Fig. 1).

This nonexhaustive analysis of cases provides an interesting insight into the characteristics of the volume scans where the algorithm encountered difficulties, suggesting that clinical relevant areas in an OCT scan are considered as critical in the classification task of a deep learning algorithm, which correlate with previous studies [22, 28]. We must add that the standard deviation values were relatively high, indicating a certain dispersion around the mean, which could negatively influence our findings. Furthermore, the relatively small sample size of the false classes emphasizes the robustness of the evaluated system but limits the interpretation of Table 1.

Independence of the IRF and SRF Detectors and Analysis of a Combined Fluid Detector

We found, overall, a high diagnostic accuracy, specificity, sensitivity, and precision in detecting fluid on a large set of OCT volume scans of patients with DME and AMD (Table 2). These metrics are slightly higher than the one reported by a recent published prospective study [22]. The observed independence of the single detectors is a positive characteristic, which ensures in some ways that the detectors of IRF and SRF learned features particular to IRF and SRF, respectively. The higher accuracy of the fluid detector compared to the two single detectors suggests that combining single detectors in the case of fluid provides better fluid detection accuracy than considering separately the single detectors. These observations support the development of an AI-based assistance for clinicians with the intention of minimizing missed cases of fluid detection and thus improving the visual outcome. A combined fluid detector cannot discriminate the fluid type but would be relevant in such a context. Another interesting application would be in clinical trial settings with large numbers of OCT volume scans requiring grading, where such a detector could be used as a supportive tool to reduce the burden on expert graders by not sending for annotations the OCTs where no fluid is detected. This would allow a better allocation of workload of the human graders and could improve the performance of reading centers. Lastly, such an algorithm could be implemented in clinical routine as a supportive tool to screen patients with active exudative macular disease, especially in settings where ophthalmologists are not readily available.

Limitations and Future Works

A major limitation of our study is its retrospective design and resulting lack of data and possible selection bias, as the OCT volume scans were completely anonymized. Second, our data were obtained from devices from a same manufacturer, which severely limits the generalization of the evaluated system. Third, our dataset comprised OCT volume scans with a single disease per OCT. The performance of the tested algorithm could be lower in the real-world setting, especially in the presence of two or more distinct pathologies. Fourth, as we considered the grader’s annotations as ground truth, possible human errors might have appeared and led to incorrect results. However, the excellent correlation between annotations and prediction in the present study and the low probabilities of three expert graders being mistaken somewhat mitigates this limitation. Fifth, the evaluated system faces difficulties in discriminating IRF from intraretinal degenerative cysts, which can be tackled by a larger collection of this biomarker to improve the capability to distinguish cysts. However, we obtained an overall small FP rate, indicating an acceptable global performance. In order to generalize the application in clinical routine, further evaluation using real-life data is needed. A continuation of this work could focus on the quantification of fluid and testing and validating the evaluated AI-based algorithm on different OCT devices and other biomarkers. We were able to illustrate the ability of such an algorithm to learn specific features about IRF and SRF, and its robustness according to many CST, i.e., in different pathomorphological stages in DME and AMD. This study presents some interesting insights to understand the limitations of an automated method, which appears to approach the human levels in grading difficult OCT volume scans. Further works are needed to understand to which extent an algorithm transposes bias, especially human biases and how this affects its performance. Similarly, future work could include saliency maps combined with different grading methods in order to improve the interpretability of an automated method.

This study includes human subjects. The study was performed in accordance with International Conference on Harmonization Good Clinical Practice guidelines and the tenets of the Declaration of Helsinki. The study protocol was reviewed and approved by the Ethics Committee of the Canton of Bern (Kantonale Ethikkommission Bern), Switzerland, approval number 2021-01280, and written informed consent to participate in the study was obtained from participants.

The authors have no conflicts of interests regarding this study to declare. We refer the readers to the ICMJE Form for Disclosure of Potential Conflicts of Interest of each author.

There was no financial support to be declared for this study.

Mathias Gallardo and Oussama Habra contributed equally in research design, data acquisition, research execution, analysis, interpretation, and manuscript preparation. Till Meyer zu Westram contributed in data acquisition, research execution, and data analysis and interpretation. Sandro De Zanet contributed in data acquisition and research execution. Damian Jaggi contributed in data acquisition and interpretation. Martin Zinkernagel and Sebastian Wolf contributed in research design, data analysis, and interpretation. Raphael Sznitman contributed in research design, data analysis, interpretation, and manuscript preparation.

The data in this study were obtained from the Bern Photographic Reading Center (BPRC) where restrictions may apply. Such dataset may be requested from the Bern Photographic Reading Center (BPRC), bprc@insel.ch, Universitätsklinik für Augenheilkunde, Inselspital, Freiburgstrasse, 3010 Bern.

1.
Yu
S
,
Rückert
R
,
Munk
MR
.
Treat-and-extend regimens with anti-vascular endothelial growth factor agents in age-related macular degeneration
.
Expert Rev Ophthalmol
.
2019
;
14
(
6
):
287
307
.
2.
Wong
DT
,
Berger
AR
,
Bourgault
S
,
Chen
J
,
Colleaux
K
,
Cruess
AF
,
.
Imaging biomarkers and their impact on therapeutic decision-making in the management of neovascular age-related macular degeneration
.
Ophthalmologica
.
2021
;
244
(
4
):
265
80
.
3.
Flores
R
,
Carneiro
Â
,
Vieira
M
,
Tenreiro
S
,
Seabra
MC
.
Age-related macular degeneration: pathophysiology, management, and future perspectives
.
Ophthalmologica
.
2021
;
244
(
6
):
495
511
.
4.
Jaffe
GJ
,
Ying
GS
,
Toth
CA
,
Daniel
E
,
Grunwald
JE
,
Martin
DF
,
.
Macular morphology and visual acuity in year five of the comparison of age-related macular degeneration treatments trials
.
Ophthalmology
.
2019
;
126
(
2
):
252
60
.
5.
Schmidt-Erfurth
U
,
Waldstein
SM
.
A paradigm shift in imaging biomarkers in neovascular age-related macular degeneration
.
Prog Retin Eye Res
.
2016
;
50
:
1
24
.
6.
Markan
A
,
Agarwal
A
,
Arora
A
,
Bazgain
K
,
Rana
V
,
Gupta
V
.
Novel imaging biomarkers in diabetic retinopathy and diabetic macular edema
.
Ther Adv Ophthalmol
.
2020
;
12
:
2515841420950513
.
7.
Zur
D
,
Iglicki
M
,
Busch
C
,
Invernizzi
A
,
Mariussi
M
,
Loewenstein
A
,
.
OCT biomarkers as functional outcome predictors in diabetic macular edema treated with dexamethasone implant
.
Ophthalmology
.
2018
;
125
(
2
):
267
75
.
8.
Bourne
RRA
,
Flaxman
SR
,
Braithwaite
T
,
Cicinelli
MV
,
Das
A
,
Jonas
JB
,
.
Magnitude, temporal trends, and projections of the global prevalence of blindness and distance and near vision impairment: a systematic review and meta-analysis
.
Lancet Glob Health
.
2017
;
5
(
9
):
e888
97
.
9.
Martin
DF
,
Maguire
MG
,
Fine
SL
,
Ying
GS
,
Jaffe
GJ
,
Grunwald
JE
,
.
Ranibizumab and bevacizumab for treatment of neovascular age-related macular degeneration: two-year results
.
Ophthalmology
.
2012
;
119
(
7
):
1388
98
.
10.
Monés
J
,
Singh
RP
,
Bandello
F
,
Souied
E
,
Liu
X
,
Gale
R
.
Undertreatment of neovascular age-related macular degeneration after 10 years of anti-vascular endothelial growth factor therapy in the real world: the need for A change of mindset
.
Ophthalmologica
.
2020
;
243
(
1
):
1
8
.
11.
De Fauw
J
,
Ledsam
JR
,
Romera-Paredes
B
,
Nikolov
S
,
Tomasev
N
,
Blackwell
S
,
.
Clinically applicable deep learning for diagnosis and referral in retinal disease
.
Nat Med
.
2018
;
24
(
9
):
1342
50
.
12.
Kurmann
T
,
Yu
S
,
Márquez-Neila
P
,
Ebneter
A
,
Zinkernagel
M
,
Munk
MR
,
.
Expert-level automated biomarker identification in optical coherence tomography scans
.
Sci Rep
.
2019
;
9
(
1
):
13605
.
13.
Gallardo
M
,
Munk
MR
,
Kurmann
T
,
De Zanet
S
,
Mosinska
A
,
Karagoz
IK
,
.
Machine learning can predict anti-VEGF treatment demand in a treat-and-extend regimen for patients with neovascular AMD, DME, and RVO associated macular edema
.
Ophthalmol Retina
.
2021
;
5
(
7
):
604
24
.
14.
Rêgo
S
,
Dutra-Medeiros
M
,
Soares
F
,
Monteiro-Soares
M
.
Screening for diabetic retinopathy using an automated diagnostic system based on deep learning: diagnostic accuracy assessment
.
Ophthalmologica
.
2021
;
244
(
3
):
250
7
.
15.
Moraes
G
,
Fu
DJ
,
Wilson
M
,
Khalid
H
,
Wagner
SK
,
Korot
E
,
.
Quantitative analysis of OCT for neovascular age-related macular degeneration using deep learning
.
Ophthalmology
.
2021
;
128
(
5
):
693
705
.
16.
Schmidt-Erfurth
U
,
Bogunovic
H
,
Sadeghipour
A
,
Schlegl
T
,
Langs
G
,
Gerendas
BS
,
.
Machine learning to analyze the prognostic value of current imaging biomarkers in neovascular age-related macular degeneration
.
Ophthalmol Retina
.
2018
;
2
(
1
):
24
30
.
17.
Lee
CS
,
Baughman
DM
,
Lee
AY
.
Deep learning is effective for classifying normal versus age-related macular degeneration OCT images
.
Ophthalmol Retina
.
2017
;
1
(
4
):
322
7
.
18.
Mrejen
S
,
Sarraf
D
,
Mukkamala
SK
,
Freund
KB
.
Multimodal imaging of pigment epithelial detachment: a guide to evaluation
.
Retina
.
2013
;
33
(
9
):
1735
62
.
19.
Schlegl
T
,
Waldstein
SM
,
Bogunovic
H
,
Endstraßer
F
,
Sadeghipour
A
,
Philip
AM
,
.
Fully automated detection and quantification of macular fluid in OCT using deep learning
.
Ophthalmology
.
2018
;
125
(
4
):
549
58
.
20.
Lu
D
,
Heisler
M
,
Lee
S
,
Ding
GW
,
Navajas
E
,
Sarunic
MV
,
.
Deep-learning based multiclass retinal fluid segmentation and detection in optical coherence tomography images using a fully convolutional neural network
.
Med Image Anal
.
2019
;
54
:
100
10
.
21.
Chakravarthy
U
,
Goldenberg
D
,
Young
G
,
Havilio
M
,
Rafaeli
O
,
Benyamini
G
,
.
Automated identification of lesion activity in neovascular age-related macular degeneration
.
Ophthalmology
.
2016
;
123
(
8
):
1731
6
.
22.
Keenan
TDL
,
Clemons
TE
,
Domalpally
A
,
Elman
MJ
,
Havilio
M
,
Agrón
E
,
.
Retinal specialist versus artificial intelligence detection of retinal fluid from OCT: age-related eye disease study 2: 10-year follow-on study
.
Ophthalmology
.
2021
;
128
(
1
):
100
9
.
23.
Mantel
I
,
Mosinska
A
,
Bergin
C
,
Polito
MS
,
Guidotti
J
,
Apostolopoulos
S
,
.
Automated quantification of pathological fluids in neovascular age-related macular degeneration, and its repeatability using deep learning
.
Transl Vis Sci Technol
.
2021
;
10
(
4
):
17
.
24.
Sidibé
D
,
Sankar
S
,
Lemaître
G
,
Rastgoo
M
,
Massich
J
,
Cheung
CY
,
.
An anomaly detection approach for the identification of DME patients using spectral domain optical coherence tomography images
.
Computer Methods Programs Biomed
.
2017
;
139
:
109
17
.
25.
Rim
TH
,
Lee
AY
,
Ting
DS
,
Teo
K
,
Betzler
BK
,
Teo
ZL
,
.
Detection of features associated with neovascular age-related macular degeneration in ethnically distinct data sets by an optical coherence tomography: trained deep learning algorithm
.
Br J Ophthalmol
.
2021
;
105
(
8
):
1133
9
.
26.
Li
F
,
Chen
H
,
Liu
Z
,
Zhang
X
,
Wu
Z
.
Fully automated detection of retinal disorders by image-based deep learning
.
Graefes Arch Clin Exp Ophthalmol
.
2019
;
257
(
3
):
495
505
.
27.
Sophie
R
,
Lu
N
,
Campochiaro
PA
.
Predictors of functional and anatomic outcomes in patients with diabetic macular edema treated with ranibizumab
.
Ophthalmology
.
2015
;
122
(
7
):
1395
401
.
28.
Kermany
DS
,
Goldbaum
M
,
Cai
W
,
Valentim
CCS
,
Liang
H
,
Baxter
SL
,
.
Identifying medical diagnoses and treatable diseases by image-based deep learning
.
Cell
.
2018
;
172
(
5
):
1122
31.e9
.
29.
Wu
J
,
Philip
A-M
,
Podkowinski
D
,
Gerendas
BS
,
Langs
G
,
Simader
C
,
.
Multivendor spectral-domain optical coherence tomography dataset, observer annotation performance evaluation, and standardized evaluation framework for intraretinal cystoid fluid segmentation
.
J Ophthalmol
.
2016
;
2016
:
3898750
.