Abstract
Background: An increasing number of diagnostic tests and biomarkers have been validated during the last decades, and this will still be a prominent field of research in the future because of the need for personalized medicine. Strict evaluation is needed whenever we aim at validating any potential diagnostic tool, and the first requirement a new testing procedure must fulfill is diagnostic accuracy. Summary: Diagnostic accuracy measures tell us about the ability of a test to discriminate between and/or predict disease and health. This discriminative and predictive potential can be quantified by measures of diagnostic accuracy such as sensitivity and specificity, predictive values, likelihood ratios, area under the receiver operating characteristic curve, overall accuracy and diagnostic odds ratio. Some measures are useful for discriminative purposes, while others serve as a predictive tool. Measures of diagnostic accuracy vary in the way they depend on the prevalence, spectrum and definition of the disease. In general, measures of diagnostic accuracy are extremely sensitive to the design of the study. Studies not meeting strict methodological standards usually over- or underestimate the indicators of test performance and limit the applicability of the results of the study. Key Messages: The testing procedure should be verified on a reasonable population, including people with mild and severe disease, thus providing a comparable spectrum. Sensitivities and specificities are not predictive measures. Predictive values depend on disease prevalence, and their conclusions can be transposed to other settings only for studies which are based on a suitable population (e.g. screening studies). Likelihood ratios should be an optimal choice for reporting diagnostic accuracy. Diagnostic accuracy measures must be reported with their confidence intervals. We always have to report paired measures (sensitivity and specificity, predictive values or likelihood ratios) for clinically meaningful thresholds. How much discriminative or predictive power we need depends on the clinical diagnostic pathway and on misclassification (false positives/negatives) costs.
Introduction
An increasing number of diagnostic tests and biomarkers [1] have become available during the last decades, and the need for personalized medicine will strengthen the impact of this phenomenon in the future. Consequently, we need a careful evaluation of any potential new testing procedure in order to limit the potentially negative consequences on both health and medical care expenditures [2].
Evaluating the diagnostic accuracy of any diagnostic procedure or test is not a trivial task. In general it is about answering several questions. Will the test be used in a clinical or screening setting? In which part of the clinical pathway will it be placed? Has the test the ability to discriminate between health and disease? How well does the test do that job? How much discriminative ability do we need for our clinical purposes?
In the following pages we will try and answer these questions. We will give an overview of diagnostic accuracy measures, accompanied by their definitions including their purpose, advantages and weak points. We will implement some of these measures by discussing a real example from the medical literature performing an evaluation of the use of velocity criteria applied to transcranial Doppler (TCD) signals in the detection of stenosis of the middle cerebral artery [3]. Then we will end the paper with a list of take-home messages.
Diagnostic Accuracy Measures
Overview
The discriminative ability of a test can be quantified by several measures of diagnostic accuracy:
- sensitivity and specificity;
- positive and negative predictive values (PPV, NPV);
- positive and negative likelihood ratios (LR+, LR-);
- the area under the receiver operating characteristic (ROC) curve (AUC);
- the diagnostic odds ratio (DOR);
- the overall diagnostic accuracy.
While these measures are often reported interchangeably in the literature, they have specific features and fit specific research questions. These measures are related to two main categories of issues:
- classification of people between those who are and those who are not diseased (discrimination);
- estimation of the posttest probability of a disease (prediction).
While discrimination purposes are mainly of concern in health policy decisions, predictive measures are most useful for predicting the probability of a disease in an individual once the test result is known. Thus, these measures of diagnostic accuracy cannot be used interchangeably. Some measures largely depend on disease prevalence, and all of them are sensitive to the spectrum of the disease in the population studied [4]. It is therefore of great importance to know how to interpret them, as well as when and under what circumstances to use them.
When we conduct a test, we have a cutoff value indicating whether an individual can be classified as positive (above/below the cutoff) or negative (below/above the cutoff), and a gold standard (or reference method) which will tell us whether the same individual is ill or healthy. Therefore, the cutoff divides the population of examined subjects with and without disease into 4 subgroups, which can be displayed in a 2 × 2 table:
- true positive (TP) = subjects with the disease with the value of a parameter of interest above/below the cutoff;
- false positive (FP) = subjects without the disease with the value of a parameter of interest above/below the cutoff;
- true negative (TN) = subjects without the disease with the value of a parameter of interest below/above the cutoff;
- false negative (FN) = subjects with the disease with the value of a parameter of interest below/above the cutoff (table 1).
Sensitivity and Specificity
Sensitivity [5] is generally expressed in percentage and defines the proportion of TP subjects with the disease in a total group of subjects with the disease: TP/(TP + FN). Sensitivity estimates the probability of getting a positive test result in subjects with the disease. Hence, it relates to the ability of a test to recognize the ill. Specificity [5], on the other hand, is defined as the proportion of subjects without the disease with a negative test result in a total group of subjects without the disease: TN/(TN + FP). In other words, specificity estimates the probability of getting a negative test result in a healthy subject. Therefore, it relates to the ability of a diagnostic procedure to recognize the healthy.
Both sensitivity and specificity are not dependent on disease prevalence, meaning that results from one study could easily be transferred to some other setting with a different prevalence of the disease in a population. Nonetheless, sensitivity and specificity may largely depend on the spectrum of the disease. In fact, both sensitivity and specificity benefit from evaluating patients with more severe disease. Sensitivity and specificity are good indices of a test's discriminative ability; however, in clinical practice a more common line of reasoning is to know how good the test is at predicting illness or health: How confident can we be about the disease status of a subject if the test has yielded a positive result? What is the probability that the person is healthy if the test is negative? These questions need predictive values to address them.
Predictive Values
PPV define the probability of being ill for a subject with a positive result. Therefore, a PPV represents a proportion of patients with a positive test result in a total group of subjects with a positive result: TP/(TP + FP) [6]. NPV describe the probability of not having a disease for a subject with a negative test result. An NPV is defined as a proportion of subjects without the disease and with a negative test result in a total group of subjects with a negative test result: TN/(TN + FN) [6].
Unlike sensitivity and specificity, both PPV and NPV depend on disease prevalence in an evaluated population. Therefore, predictive values from one study should not be transferred to some other setting with a different prevalence of the disease in the population. PPV increase while NPV decrease with an increase in the prevalence of a disease in a population. Both PPV and NPV increase when we evaluate patients with more severe disease.
Likelihood Ratios
LR are useful measures of diagnostic accuracy, but they are often disregarded, even if they have several particularly powerful properties that make them very useful from a clinical perspective. They are defined as the ratio of the probability of an expected test result in subjects with the disease to the probability in the subjects without the disease [7].
LR+ tells us how many times more likely a positive test result occurs in subjects with the disease than in those without the disease. The farther LR+ is from 1, the stronger the evidence for the presence of the disease. If LR+ is equal to 1, the test is not able to distinguish the ill from the healthy. For a test with only 2 outcomes, LR+ can simply be calculated according to the following formula: LR+ = sensitivity/(1 - specificity).
LR- represents the ratio of the probability that a negative result will occur in subjects with the disease to the probability that the same result will occur in subjects without the disease. Therefore, LR- tells us how much less likely the negative test result is to occur in a subject with the disease than in a healthy subject. LR- is usually less than 1, because it is less likely that a negative test result occurs in subjects with than in subjects without the disease. For a test with only 2 outcomes, LR- is calculated according to the following formula: LR- = (1 - sensitivity)/specificity.
The lower the LR-, the stronger the evidence for absence of the disease. Since both specificity and sensitivity are used to calculate the LR, it is clear that neither LR+ nor LR- depend on disease prevalence. Like all the other measures, LR depend on the disease spectrum of the population under study. LR are directly linked to posttest probabilities. Here, we concentrate on the posttest probability of having the disease. The pretest probability is the probability of having the disease before the test is done. If no other clinical characteristics are available of the subject, we can assume the disease prevalence as the pretest probability. If we have a pretest probability p, we calculate the pretest odds as a1 = p/(1 - p) and the posttest odds as a2 = a1 × LR; then, by reverse calculation, the posttest probability P = a2/(1 + a2).
ROC Curve
It is clear that diagnostic measures depend on the cutoff used. There is a pair of diagnostic sensitivity and specificity values for every single cutoff. To construct an ROC curve, we plot these pairs of values on an ROC space with 1 - specificity on the x-axis and sensitivity on the y-axis (fig. 1) [8]. The shape of an ROC curve and the AUC helps us to estimate how high the discriminative power of a test is. The closer the curve is located to the upper-left corner and the larger the AUC, the better the test is at discriminating between disease and health. The AUC can have any value between 0 and 1 and is a good indicator of the goodness of the test. A perfect diagnostic test has an AUC of 1.0, whereas a nondiscriminating test has an area of 0.5 (as if one were tossing a coin).
ROC curve. From bottom-left to upper-right corner, we observe increasing sensitivity and decreasing specificity at lowering thresholds.
ROC curve. From bottom-left to upper-right corner, we observe increasing sensitivity and decreasing specificity at lowering thresholds.
AUC is a global measure of diagnostic accuracy. It does not tell us anything about individual parameters, such as sensitivity and specificity, which refer to specific cutoffs. A higher AUC indicates higher specificity and sensitivity along all the available cutoffs. Out of 2 tests with identical or similar AUC, one can have significantly higher sensitivity, whereas the other can have significantly higher specificity. For comparing the AUC of two ROC curves, we use statistical tests which evaluate the statistical significance of an estimated difference, with a previously defined level of statistical significance.
Overall Diagnostic Accuracy
Another global measure is the so-called ‘diagnostic accuracy', expressed as the proportion of correctly classified subjects (TP + TN) among all subjects (TP + TN + FP + FN). Diagnostic accuracy is affected by disease prevalence. With the same sensitivity and specificity, the diagnostic accuracy of a particular test increases as the disease prevalence decreases.
Diagnostic Odds Ratio
DOR is also a global measure of diagnostic accuracy, used for general estimation of the discriminative power of diagnostic procedures and also for comparison of diagnostic accuracies between 2 or more diagnostic tests. The DOR of a test is the ratio of the odds of positivity in subjects with the disease to the odds in subjects without the disease [9]. It is calculated according to the following formula: DOR = LR+/LR- = (TP/FN)/(FP/TN).
The DOR depends significantly on the sensitivity and specificity of a test. A test with high specificity and sensitivity with low rates of FP and FN has a high DOR. With the same sensitivity of the test, the DOR increases with the increase in the test's specificity. The DOR does not depend on disease prevalence; however, it depends on the criteria used to define the disease and its spectrum of pathological conditions of the population examined.
Example
With the help of a study published by Rorick et al. [10], we will now apply the diagnostic measures discussed in the previous section. The purpose of the study was to evaluate the use of mean velocity (MV) criteria applied to TCD signals in the detection of stenosis of the middle cerebral artery. The reference test was an angiographic examination. When the positiveness of TCD was fixed at an MV cutoff of >90 cm/s, results as can be seen in table 2 were observed (the 2 × 2 table was reconstructed by using information available from the study). We can calculate the diagnostic accuracy measures, which are reported in table 3.
If a test aims at diagnosing a subject, then it is crucial to determine the posttest probability of the disease if the test is positive. If we assume that the pretest probability equals the prevalence of the disease (p = 12/99 = 0.12), the pretest odds are a1 = 0.12/(1 - 0.12) = 0.14 and the posttest odds a2 = a1 × LR+ = 0.14 × 9.32 = 1.29, then the posttest probability of the disease is P = a2/(1 + a2) = 1.29/ (1 + 1.29) = 0.56.
Not surprisingly, the posttest probability equals the PPV. The reason is that we assumed as pretest probability the disease prevalence of the study. Let us suppose that the pretest probability is higher (0.25). Then, if we repeat our calculations, the pretest odds are a1 = 0.25/(1 - 0.25) = 0.33 and the posttest odds a2 = a1 × LR+ = 0.33 × 9.32 = 3.11, and the posttest probability of the disease is P = a2/(1 + a2) = 3.11/(1 + 3.11) = 0.76.
The study also reported results when using a cutoff of >80 cm/s. Sensitivity increased to 83%, while specificity decreased to 87%. We can plot the results in the ROC space (fig. 2). The lower threshold (MV cutoff >80 cm/s) was positioned top right of the higher threshold (MV cutoff >90 cm/s). If we were able to have the cutoff along a continuum, we would see an ROC curve. The study was then reanalyzed in a review of Navarro et al. [11], which demonstrated a good TCD performance against angiography.
Visual display of test results evaluated at an MV cutoff of >80 cm/s and an MV cutoff of >90 cm/s in an ROC space.
Visual display of test results evaluated at an MV cutoff of >80 cm/s and an MV cutoff of >90 cm/s in an ROC space.
Key Messages
Population
The testing procedure should be verified on a reasonable population; thus it needs to include those with mild and severe disease, aiming at providing a comparable spectrum. Disease prevalence affects predictive values, but the disease spectrum has an impact on all diagnostic accuracy measures.
Sensitivity and Specificity, PPV and NPV or LR?
For clinicians and health professionals, the key issue is to know how a diagnostic procedure can predict a disease. Sensitivity and specificity are not predictive measures, they only describe how a disease predicts particular test results. Predictive values inform us about the probability of having the disease with a positive test result (PPV), or the probability of being healthy with a negative test result. Unfortunately, these probabilities largely depend on disease prevalence and their meaning can rarely be transferred beyond the study (except when the study is based on a suitably random sample, e.g. population screening studies). While often disregarded, LR should be an optimal choice in reporting diagnostic accuracy, because they share with sensitivity and specificity the feature of being independent of disease prevalence, but they can be used to calculate the probability of a disease while adapting for varying prior probabilities.
Multiple Thresholds: Paired Measures or ROC
Diagnostic accuracy can be presented at a specific threshold by using paired results, such as sensitivity and specificity, or, alternatively, predictive values or LR. Other methods summarize accuracy across all the different test thresholds available. In general, it is better to present both summary and paired results, because a good discriminative power of a test can occur only by chance with a particular threshold. In that case, a graphical presentation can be highly informative, in particular an ROC plot. At the same time, concentrating only on the AUC when comparing 2 tests can be misleading because this measure averages in a single piece of information both clinically meaningful and irrelevant thresholds. Thus, we always have to report paired measures for clinically relevant thresholds.
Variability
As always, it is crucial to report variability/uncertainty measures for diagnostic accuracy results (95% CI).
How Much Discriminative or Predictive Power?
In general, answering this question is dependent on the stage in which it will be placed in the clinical diagnostic pathway, and on the misclassification cost. For instance, if I need a triage and the cost of FP is of no importance, I need to focus on sensitivity, PPV or LR+.