Abstract
Background/Aims: Many cognitive screening instruments (CSI) are available to clinicians to assess cognitive function. The optimal method comparing the diagnostic utility of such tests is uncertain. The effect size (Cohen's d), calculated as the difference of the means of two groups divided by the weighted pooled standard deviations of these groups, may permit such comparisons. Methods: Datasets from five pragmatic diagnostic accuracy studies, which examined the Mini-Mental State Examination (MMSE), the Mini-Mental Parkinson (MMP), the Six-Item Cognitive Impairment Test (6CIT), the Montreal Cognitive Assessment (MoCA), the Test Your Memory test (TYM), and the Addenbrooke's Cognitive Examination-Revised (ACE-R), were analysed to calculate the effect size (Cohen's d) for the diagnosis of dementia versus no dementia and for the diagnosis of mild cognitive impairment versus no dementia (subjective memory impairment). Results: The effect sizes for dementia versus no dementia diagnosis were large for all six CSI examined (range 1.59-1.87). For the diagnosis of mild cognitive impairment versus no dementia, the effect sizes ranged from medium to large (range 0.48-1.45), with MoCA having the largest effect size. Conclusion: The calculation of the effect size (Cohen's d) in diagnostic accuracy studies is straightforward. The routine incorporation of effect size calculations into diagnostic accuracy studies merits consideration in order to facilitate the comparison of the relative value of CSI.
Introduction
The utility of cognitive screening instruments (CSI) for the diagnosis of dementia and lesser degrees of cognitive impairment may be indicated by a number of summary parameters, of which the most familiar are probably sensitivity and specificity. Predictive values, likelihood ratios, clinical utility indexes, agreement between tests (kappa statistic), and the area under the receiver operating characteristic curve (AUC ROC) may also be used as parameters of diagnostic utility [1], but all these measures have potential shortcomings. For example, sensitivity and specificity may be difficult to apply to individual patients, predictive values are influenced by the prevalence of the disease in the population being tested, and AUC ROC combines the test accuracy over a range of thresholds, which may be both clinically relevant and clinically nonsensical.
Another metric that may be used to demonstrate utility is the effect size. The effect size may be denoted by a variety of summary indices, of which Cohen's d is probably the most commonly used in the medical literature [2]. This parameter is calculated as the difference of the means of two groups divided by the weighted pooled standard deviations of these groups (fig. 1). Cohen [3 ]suggested that effect sizes of 0.2-0.3 were small, 0.5 medium, and ≥0.8 large.
One example of the potential utility of this approach in clinical studies of cognitive impairment was demonstrated by Brønnick [4 ]who compared standardized effect sizes of cognitive functions between groups of patients diagnosed with Parkinson's disease, Parkinson's disease dementia, Alzheimer's disease, and normal controls to identify differences in group mean values (Cohen's d). This study showed larger effect sizes for tests of memory in Alzheimer's disease and of executive and visuospatial function in Parkinson's disease dementia, indicating greater impairments in these domains. Hence, testing of these selected cognitive functions may be of particular utility for the differential diagnosis of these conditions.
The aim of the study presented here was to calculate the Cohen's d metric from the datasets of a number of pragmatic prospective diagnostic accuracy studies undertaken in dedicated secondary care memory clinics to calculate effect sizes for several CSI, specifically the Mini-Mental State Examination (MMSE) [5], the Mini-Mental Parkinson (MMP) [6], the Six-Item Cognitive Impairment Test (6CIT) [7], the Montreal Cognitive Assessment (MoCA) [8], the Test Your Memory (TYM) test [9], and the Addenbrooke's Cognitive Examination-Revised (ACE-R) [10]. The calculation of the effect size was undertaken for both the diagnosis of dementia versus no dementia and of mild cognitive impairment versus no dementia.
Materials and Methods
Data from five previous pragmatic diagnostic accuracy studies [11,12,13,14,15], which examined six different CSI [5,6,7,8,9,10], namely the MMSE [11], MMP [11], 6CIT [12], MoCA [13], TYM [14], and ACE-R [15], were reanalysed. Study details (setting, sample size, dementia prevalence, sex ratio, and age range) are shown in table 1.
In each of these studies, the criterion diagnosis was established by the judgment of an experienced clinician based on widely accepted clinical diagnostic criteria. The mean test scores for demented and non-demented groups as well as for mild cognitive impairment and non-demented groups, along with their standard deviations, were applied to the Cohen's d formula (fig. 1) to calculate effect sizes. Because these were clinic-based pragmatic studies, there was no normal control group, the non-demented cases consisting of patients with at least subjective memory impairment as well as patients with mild cognitive impairment insufficient to mandate a dementia diagnosis.
Results
The calculation of Cohen's d comparing patients with and without dementia suggested large but similar effect sizes for all CSI examined (table 2, left-hand column), using the classification suggested by Cohen [2,3]. These values suggested a consistent difference in test scores between demented and non-demented individuals.
The calculation of Cohen's d comparing patients with mild cognitive impairment and no dementia (subjective memory impairment) suggested smaller effect sizes for all CSI examined than in the dementia versus no dementia distinction (table 2, right-hand column), using the classification suggested by Cohen [2,3]. However, one effect size, namely that for the MoCA, was clearly larger than all the others. These values suggested a consistent difference in test scores between MCI and non-demented individuals, but with the MoCA performing best.
For comparative purposes, other parameters of diagnostic utility derived from the sampled diagnostic accuracy studies [11,12,13,14,15] are shown, namely sensitivity, specificity, predictive values, likelihood ratios, and AUC ROC (table 3).
Discussion
In this study, data from five pragmatic diagnostic accuracy studies of CSI [11,12,13,14,15] were re-analysed to calculate effect sizes (Cohen's d) for the diagnosis of dementia versus no dementia and for the diagnosis of mild cognitive impairment versus no dementia. These were observational studies, which examined unselected patient groups with cognitive complaints of unknown aetiology. Such pragmatic diagnostic accuracy studies [16] differ from experimental studies in which patient groups are selected by known diagnostic categories, often with a normal control group, and then have the test or intervention applied (essentially case-control studies). Since pragmatic diagnostic accuracy studies reflect the idioms of clinical practice in terms of the typical spectrum of patients seen more closely, it may be argued that their results are more broadly generalizable [1], although the groupings are more heterogeneous than in experimental studies.
Making comparisons between studies is, of course, problematic, notwithstanding the consistency of study protocols and authorship in the studies examined here. One potential shortcoming of the current analysis was the different settings and sample characteristics for each of the five studies (table 1). Inevitably, the case-mix in clinics run by neurologists and by old age psychiatrists will differ in terms of both patient age and dementia prevalence. One outcome of this heterogeneity was insufficient data to compare effect sizes by patient age.
Overall, the calculations suggested a large effect size for all CSI examined for the diagnosis of dementia versus no dementia (table 2, left-hand column). Since the purpose of these instruments is to screen for the diagnosis of dementia, this finding is perhaps not surprising.
Effect size calculations for the differential diagnosis of mild cognitive impairment versus no dementia (subjective memory impairment) gave lower values for all CSI (table 2, right-hand column), as might be anticipated for this more challenging clinical distinction. These calculations suggested that the MoCA performed best, as might be expected since this instrument was specifically designed to detect mild cognitive impairment [9]. This distinction between the performance of different CSI may be of clinical importance when disease-modifying drugs for cognitive impairment are developed.
How do effect sizes relate to other diagnostic utility measures (table 3)? As a global measure, encompassed in a single outcome number compared to criterion values [3], effect sizes are perhaps akin to AUC ROC, which likewise differed little between the various CSI examined in this study (table 3). The effect size thus gives an overall index of test diagnostic utility, but obviously this metric may conceal differences between tests in terms of, for example, their relative sensitivity (range 0.45-0.82) and specificity (range 0.79-0.98), although these parameters are obviously dependent on the chosen test cutoff. Thus, effect sizes may need to be used in conjunction with other summary test metrics to decide which might be most appropriate for the given clinical situation, for example, whether the clinician seeks high sensitivity or specificity.
The calculated effect sizes for the CSI examined in this study permit some comparison between instruments, although other summary measures and analyses are also available for this purpose. Meta-analysis is perhaps the most favoured approach for its methodological rigor, but the outcomes are dependent on the stringency of study inclusion and exclusion criteria, which may sometimes give results that are not anticipated (e.g. comparability of the MMSE and ACE/ACE-R, due to a surprisingly high sensitivity of the MMSE in included studies [17]). The test of agreement between tests (kappa statistic) can be used to measure whether agreement between tests is perfect (kappa = 1) or due to chance alone (kappa = 0) [18]. The AUC ROC is a measure of diagnostic accuracy but has been criticised for combining test accuracy over a range of thresholds, which may be both clinically relevant and clinically non-sensical [19]. Weighted comparison may be used to indicate the net benefit of one test with respect to another and permits the calculation of the equivalent increase in the number of true positive patients identified per 1,000 patients tested [19]. One such weighted comparison study suggested the superiority of the ACE-R and MoCA over the MMSE and the inferiority of TYM to MMSE [20].
As shown in this study, the calculation of effect size (Cohen's d) for CSI is straightforward. No previous analyses of the comparative diagnostic utility of CSI using this method have been identified. This study suggests that there is a case for the routine incorporation of effect sizes (it should be noted that there are effect size formulae other than Cohen's d [21]) as a measure of diagnostic test performance into diagnostic accuracy studies. Although this is not an explicit recommendation of the STARDdem guidelines (Standards for the Reporting of Diagnostic Accuracy studies specific to diagnostic accuracy studies in dementia; www.starddem.org), summary measures such as Cohen's d and weighted comparison might be considered in future iterations of these guidelines.
Disclosure Statement
The author has no conflicts of interest to declare.