Abstract
Background/Aims: Automated, volumetrically defined atrophy in the left anterior cingulate (LAC) and anterior temporal regions (LAT) on MRI can be used to distinguish most patients with frontotemporal dementia (FTD) from controls. FTD and Alzheimer’s disease (AD) can differ in the degree of anterior temporal atrophy. We explored whether clinicians can visually detect this atrophy pattern and whether they can use it to classify the 2 groups of dementia patients with the same accuracy. Methods: Four neurologists rated atrophy in the LAC and LAT regions on MRI slices from 21 FTD, 21 controls, and 14 AD participants. Inter-rater reliability and diagnostic accuracy were assessed. Results: All 4 raters agreed on the presence of clinically significant atrophy, and their atrophy scoring correlated with the volumes, but without translation into high inter-rater diagnostic agreement. Conclusions: Volumetric analyses are difficult to translate into routine clinical practice.
Introduction
Although frontotemporal abnormalities on structural [1,2,3,4,5,6] and functional imaging [7,8,9,10,11] have been reported in frontotemporal dementia (FTD), consensus groups deemed such findings supportive and not necessary for clinical diagnosis [12,13]. These consensus groups were convened prior to volumetric studies replicating robust differences between FTD, healthy aging, and Alzheimer’s disease (AD). In a prior study [14], logistic regression modeling showed that the combination of atrophy in the left anterior cingulate (LAC) and the left anterior temporal pole (LAT) can distinguish FTD from controls with age-associated frontal atrophy with 90% accuracy. The application of these findings to routine clinical practice has not been explored. Although clinical diagnosis can be enhanced using visual ratings of coronal views of the LAT with a high correlation of the visual ratings to automated volumetric measurements [15,16], not all clinical MRI protocols provide coronal sections.
In this study, we explored the ability of clinical neurologists experienced in dementia diagnosis and neuroimaging research to detect selective regional atrophy on ‘real life’ MR images. MRI studies obtained for volumetric studies often contain more data and use 3D T1-weighted images, which are not routinely acquired in clinical MRI scan protocols. Our study aim was to determine how applicable the findings from our prior study are to frontline clinical neurologists, especially those who may have Internet access (i.e. picture archive and communication systems) to patients’ MRI scans.
The first study question was whether our raters could use atrophy scores to distinguish FTD from control MRIs with acceptable inter-rater reliability. We also wanted to test whether coronal views (not included in most clinical MRI brain protocols) are necessary for accurate identification of atrophy in the LAT.
It is frequently possible to distinguish healthy middle-aged patients from FTD patients with a good informant alone. We assigned an additional exercise for our raters: to use the LAC and LAT atrophy pattern to solve the more common clinical dilemma of differentiating FTD from AD [17].
Materials and Methods
Participants
We used MRI data from 21 participants diagnosed with FTD according to criteria published from a consensus meeting [13], 14 participants with AD per NINCDS-ADRDA criteria [18] matched on the basis of duration of illness to the FTD group, and 21 age- and gender-matched controls. All participants were recruited from memory clinics at Sunnybrook Health Sciences Centre (S.E.B.) and Baycrest (T.W.C. and M.F.). Ethics committees at both institutions approved the protocols. The inclusion and exclusion criteria for the study and description of sample have already been detailed in the prior study using high-resolution volumetric data [14].
In brief, all participants underwent a standardized MRI of the brain between 1995 and 2004 on the same 1.5-T MRI scanner (General Electric Signa version 8.4M4), using the dementia protocol at the Sunnybrook Health Sciences Centre. Scans were typically performed within 3 months of clinical diagnosis of FTD or AD, with the exception of 4 cases of FTD who had their first MRI for research purposes 1–2 years after diagnosis. Findings of atrophy did not bias inclusion or exclusion for this study, and MRI results contributed to the clinical diagnoses made for study inclusion. Autopsy data were used to exclude potential participants from the study.
Autopsies confirmed FTD diagnoses in 10 participants from this group; the other 11 are still living or did not give consent for autopsy. Comparing the autopsy-confirmed group against the remainder of FTD participants revealed a higher level of education (mean 15.6 vs. 11.5 years, t = 2.36, p = 0.03), higher proportion of left-handers (8/10, χ2 p = 0.004), and smaller mean LAT volume (6.8 vs. 9.64 cm3, t = –3.02, p = 0.007) but not LAC volume for the autopsied group, and no differences in sex distribution (χ2 p = 0.63), mean age at time of MRI (63.9 vs. 64 years, t = –0.29, p = 0.98), mean duration of illness (4.3 vs. 4.0 years, t = 0.3, p = 0.77) or mean MMSE score (17.7 vs. 23.6, t = –1.45, p = 0.16).
The control group consisted of 21 healthy community volunteers without complaints of mood or behavioral disturbance, not taking psychotropic medications, and with no history suggestive of neurological or psychiatric conditions. They were matched to each of the 21 FTD participants by gender and age within 5 years.
The FTD sample for this study included 20 of the 30 FTD participants described in our prior study [14] plus 1 additional participant, and the same 18 controls, plus 3 additional control participants.
Imaging Analysis
T1-weighted (1.2-mm slice thickness) and proton density/T2-weighted (3-mm slice thickness) imaging parameters were selected to provide optimal intensity separation for the in-house previously published segmentation and parcellation program, SABRE [19,20].
The LAC and LAT volumes of grey matter were normalized by dividing each volumetric component by the same individual’s total supratentorial intracranial volume, in order to accommodate for normal variation in head size. The measure was then multiplied by the mean total supratentorial intracranial volume for the control sample (1,186.19, n = 21) to yield a sample-normalized volume in cm3 for each volume of interest (VOI) [14].
Visual Rating
A research neuroradiologist expert in dementia imaging (F.G.) used Analyze™-formatted MRI data aligned to the anterior commissure and posterior commissure plane to select 5 slices that simulated the clinical format for each of the 56 participants and highlighted the LAC and LAT VOIs, based on our prior volumetric method that identified those VOIs as particularly discriminating between FTD and controls [21]. Slice simulations included: (1) the T1-weighted axial slice just superior to the last view of insular cortex on either side of the brain (LAC), (2) the sagittal T1-weighted slices with the most complete view of the left optic nerve on either side of the brain (LAC), (3) the most inferior T1-weighted axial slice showing the interpeduncular cistern (LAT), (4) the sagittal T1-weighted slice at the lateral extreme of the left hippocampus, and (5) coronal T1-weighted slice at the level of the limen insulae, where insular cortex becomes continuous with frontal cortex (LAC and LAT; fig. 1). The volume averaging on the slices viewed from the 3D T1 MRI generated for this exercise differed from routine clinical scans (which usually have thicker slices of approx. 5–10 mm). However, slice thickness should not affect accuracy of the atrophy rating, as the atrophy of the LAC and LAT was determined on 3 sections from different orientations. The standardization procedure for this study’s slices should have enhanced reproducibility.
Sample template of the 5 slices shown to raters to score atrophy in the regions of interest of the LAC [axial (a) and mid-sagittal (b)] and the LAT [coronal (c), axial (d) and sagittal (e)].
Sample template of the 5 slices shown to raters to score atrophy in the regions of interest of the LAC [axial (a) and mid-sagittal (b)] and the LAT [coronal (c), axial (d) and sagittal (e)].
Although not duplicating a naturalistic setting, to review all participants in only 2 sessions, all 4 visual raters viewed the timed slideshow simultaneously. Each rater completed 1 rating sheet per participant independently. The raters were all neurologists involved in clinical research, not certified in neuroradiology. Two of the 5 images were displayed at a time, sequentially, in the same order (LAC then LAT) from participant to participant in a Powerpoint slide presentation programmed to show each slide (image pair) for 10 s. This amount of time was more than the raters needed to score atrophy and was set at this length to decrease the confound of any inter-rater differences that could be attributed to processing speed. Participants were presented with diagnosis order (i.e. FTD, AD, control) randomized. This way, the 4 visual raters could make their ratings of the timed slide show presentation without bias from diagnosis. During the rating sessions, raters understood that approximately one third of the scans were from control participants, but were blinded to all clinical information about the other participants, including age.
Each rating sheet showed standard images for each atrophy rating, and scores ranged from 0, indicating no atrophy, to 4, the most severe atrophy, similar to the methods of Davies et al. [16]. The standard images were derived from a single control and a single participant with FTD, neither of whom were part of the study sample, so that the raters would not be able to simply match some images from the slideshow to the rating sheet. For intra-rater reliability of atrophy scoring, the raters re-reviewed a subsample of 18 participants randomly chosen from the initial group of 56 during a second scoring session.
A diagnosis was selected immediately after viewing each 5-image set during the first rating session, without being able to return to any set once completed.
Debriefing after the first session revealed a need to clarify the diagnostic criteria as raters had differed on use of non-LAC and non-LAT regions to make these decisions.
Raters were instructed that the main criterion for an FTD diagnosis was presence of atrophy most marked in the LAC and/or LAT regions. Hippocampal and parietal atrophy equivalent to that seen in the LAC and/or LAT would lead a rater towards an AD diagnosis.
No demographic information on participants was provided on the slides presented, to emphasize the weight to be placed on regional atrophy, as done with SABRE in the previous study [14]. In the second session, raters stated they would have used participant ages to inform their diagnostic impressions if those data had been available.
Statistical Analysis
We subjected the raters’ 0–4 atrophy scores to Spearman’s rho correlations with the corresponding LAC and LAT volumes yielded by SABRE, to give a measure of raters’ accuracy apart from choosing a diagnosis.
Inter-Rater Agreement: Analysis of Visual Ratings
We ran Kendall’s W for weighted ĸ statistics to test inter-rater agreement for the 0–4 atrophy scores from each of the 4 visual raters.
We also captured a simple presence versus absence of atrophy response for LAC and LAT. We subjected these responses to Cochran’s Q statistical analysis, in the event this measure had a different reliability from the atrophy scoring. The Q statistic for agreement yields larger p values (i.e. p > 0.05) with increasing inter-rater agreement [22].
To determine the added impact of coronal views for designation of clinically significant atrophy in the LAT, we conducted a binary logistic regression for each rater’s score set. The three planes of view for LAT (axial, sagittal and coronal) were entered in that order for a binary logistic regression, with the atrophy response (yes/no) as the dependent variable. The lack of contribution of the coronal atrophy score to the model for atrophy would indicate that the rater would have labeled the LAT as atrophied regardless of whether the coronal slices were available to view along with standard axial and sagittal slices.
Overall Diagnostic Accuracy
The clinical consensus-based diagnoses used to select the sample were considered the gold standard for comparison. We calculated the percentage accuracy by dividing the tally of the unanimously correctly identified FTD and AD participants by the total number of patients with FTD or AD (n = 35). Because our original study reported significant VOIs for FTD versus control, we performed a χ2 test for any significant difference between the accuracies of FTD versus control and FTD versus AD diagnoses.
SPSS software version 16.0 (for Windows) was used for all statistical analyses.
Results
Although 18 participants with AD met the inclusion criteria by being duration-of-illness-matched to the FTD group and had undergone MRI of the brain, motion or dental implant artifacts rendered the images unreadable by the volumetrics software in 4 participants. The AD group therefore had 14 members in this study. The participants with AD, while selected for duration-of-illness matches to the FTD participants, as opposed to age at the time of the MRI, happened to have early-onset ages. Demographic data for the groups appear in table 1.
The FTD and AD groups were in a mild stage of illness near the time of MRI scanning, based on Mattis Dementia Rating Scale scores close to 118. Mean MMSE scores for the 2 dementia groups did not differ (p = 0.324).
Figure 2 depicts the volumes of the LAC and LAT for each of the 3 groups.
Boxplots of median values of LAC and LAT volumes for each of the 3 participant groups show that participants with FTD generally had the smallest volumes, but outliers within the control group had lower volumes than participants with dementia.
Boxplots of median values of LAC and LAT volumes for each of the 3 participant groups show that participants with FTD generally had the smallest volumes, but outliers within the control group had lower volumes than participants with dementia.
Accuracy of Visual Rating
Atrophy scores (0–4) for LAC in the sagittal view, LAT in the coronal view, and LAT in the axial view by all 4 raters correlated appropriately (inversely) with actual volumes. More severe atrophy indicates smaller volume, rho ranging among individual raters from –0.61 to –0.42, p values <0.01.
Inter-Rater Agreement
Inter-rater agreement for the 0–4 severity of atrophy was high for both LAC and LAT, yet inconsistent between axial, sagittal, or coronal views. Scoring was more uniform among raters for atrophy in the LAC on axial views (Kendall’s W = 0.062, p = 0.016) than sagittal views (larger Kendall’s W = 0.070, p = 0.008). This was true regardless of whether the raters focused on cortical atrophy or ventricular enlargement. LAT atrophy scoring was in high agreement between raters both when looking at a coronal view and when examining an axial view but significantly reduced when raters were asked to score ventricular enlargement (Kendall’s W = 0.204, p < 0.00001) in any of these orientations.
Inter-rater agreement was strong for the presence versus absence of atrophy in LAC (Q = 5.769, p = 0.123), as observed with scoring for this VOI. Matching of atrophy response among raters was not close for the LAT VOI (much higher Q of 32.124, p = 4.866 × 10–7). This would have contributed to discrepancies among raters for distinguishing FTD from controls, because the combined presence of LAC and LAT atrophy were used to make an FTD designation.
Binary logistic regressions for LAT atrophy responses from any of the raters using the axial, then sagittal, then coronal view atrophy scores as dependent variables showed that none of the views singularly created a model for prediction of atrophy. We temper our interpretation of this finding, because raters could review all slice ratings for a participant for about 1 min before the next set of images appeared. This could potentially allow the coronal views to influence scoring of atrophy on the axial and sagittal slices. Figure 3 shows rater-by-rater scores on axial and coronal views shown against actual volumes.
Scatter plots of atrophy scores from each rater against the actual volumes indicates that neither axial (a) nor coronal (b) views prevented disagreement about whether the VOI is atrophied. Note markers corresponding to approximately 4 cm3 volumes, not all of which were considered to be even moderately atrophied by some raters. Control n = 21, FTD n = 21, AD n = 14. Scoring key: 0 = normal; 1–2 = mild deepening of sulci, could be normal for participants aged >60 years; 3 = moderate atrophy beyond that consistent with age of participant; 4 = severe atrophy, e.g. ‘paper-thin’ cortex.
Scatter plots of atrophy scores from each rater against the actual volumes indicates that neither axial (a) nor coronal (b) views prevented disagreement about whether the VOI is atrophied. Note markers corresponding to approximately 4 cm3 volumes, not all of which were considered to be even moderately atrophied by some raters. Control n = 21, FTD n = 21, AD n = 14. Scoring key: 0 = normal; 1–2 = mild deepening of sulci, could be normal for participants aged >60 years; 3 = moderate atrophy beyond that consistent with age of participant; 4 = severe atrophy, e.g. ‘paper-thin’ cortex.
Accuracy of Differential Diagnosis
Accuracy for categorizing MRI scans as control, FTD, or AD was low in this study. Unanimous rater agreement for the correct diagnoses occurred in no more than one-third of each diagnostic group: 3/21 controls, 9/21 FTD, and 5/14 AD. Six out of 10 (67%) autopsy-confirmed FTD and more non-autopsied FTD participants (7 out of 11, 77%) received an accurate diagnosis from 3 or more of the 4 raters. The raters were able to correctly distinguish between FTD and AD scans with an average of 63.1% accuracy.
Unanimous agreement for incorrect diagnoses by all 4 raters occurred in 2 cases (1 control mistaken for AD and 1 FTD mistaken for AD). FTD scans were mistaken as ‘AD’ by 3 of the 4 raters in 2 of the 21 cases. These 2 had autopsy-confirmed diagnoses of FTD, 1 of whom had motor neuron disease and died only 1 year into illness. Their mean LAC and LAT volumes landed right between the remainder of the FTD sample (n = 19) and the AD group despite this short duration of illness (mean LAC volume 25.4 ± 3.2, 23.2 ± 2.6, and 26.5 ± 2.5 cm3, and LAT 9 ± 2.4, 8.2 ± 2.6, and 10.8 ± 1.5 cm3, respectively). Post hoc contrasts showed LAC and LAT to be smaller in accurately diagnosed FTD (n = 19) than in AD: LAC (F = 5.49, p = 0.009; post hoc FTD vs. AD contrast p = 0.002) and LAT (F = 6.58, p = 0.004; post hoc FTD vs. AD contrast p = 0.001). Table 2 shows that there were no significant differences between the 2 mistaken FTD cases, the rest of the FTD group, and the AD group in demographics or functional/behavioral assessments, but this conclusion is limited by the unavailability of Dementia Rating Scale, Cornell Depression Scale, and BEHAVE-AD scores for half of the larger FTD group.
Demographic, cognitive, functional, and behavioral inventory scores for autopsy-confirmed FTD mistaken for AD by 3 raters, the rest of the FTD sample, and the AD sample

Only 1 AD scan was mistaken for FTD by more than 2 raters.
The average individual rater accuracy for controls versus FTD was 59.5%. Rater 1 was better at differentiating FTD versus control (χ2 = 12.1, p < 0.001), while raters 2–4 had higher accuracy making FTD versus AD discriminations than FTD versus controls (rater 2, χ2 = 9.5, p < 0.01; rater 3, χ2 = 10.7, p < 0.01; rater 4, χ2 = 6.1, p < 0.05). Despite that, rater 4’s accuracy for FTD versus AD was lower than the rest of the group (23.8% for FTD, 64.3% for AD; p < 0.02).
Discussion
The 4 raters showed strong inter-rater agreement for recognizing and scoring atrophy. In the present study, a determination of moderate or worse atrophy of the LAT could have come from the axial and sagittal views alone, but inter-rater agreement was recorded best on coronal views of the LAT.
Agreement on the presence of atrophy did not translate to agreement or accuracy for classifying the 2 types of dementia. Our experienced clinicians could not differentiate controls from FTD at the 90% accuracy achieved by semi-automated volumetry [21]. This has been a challenge for several investigators [24,25,26,27]. This difficulty echoes the conclusion of our prior paper showing that many frontotemporal regions are mildly atrophied with normal aging and may overlap with FTD-related mild atrophy [14]. Others have identified the insula as a volumetric discriminator for FTD versus controls [16,28], and the exclusion of insular VOIs from the SABRE protocol may have accounted for some of our inaccuracy in making the FTD distinction.
This study bears many other possible confounds: first, the control participants shown on figure 2 who had low LAC or LAT volumes would have looked like FTD. The likely larger confound was that 14 AD MRI scans were interspersed with the 42 FTD and controls. Though 3D MRI protocols can be done in a few minutes and have been recommended, for example, in the consensus guidelines for Vascular Cognitive Impairment [29], 76.8% of our errors were FTD and controls scans misattributed to AD. The boxplots in figure 2 show the overlap in LAC and LAT atrophy between AD and FTD, even if the mean AD volumes were still larger than for FTD.
Our methods did not entirely replicate a clinical situation: the slices prepared for review were in 16-bit Analyze format, whereas routine MRIs are presented to clinicians in 32-bit. Visual raters saw only the 5 selected MRI slices in sequence and never other brain images for a participant. Other investigators have commented on the importance of more naturalistic approaches [30]. Clinicians can also go back to see views as needed to update their interpretations. In this study, at least 1 rater changed criteria for scoring the atrophy once he had gone through the slides for the first 30 participants.
Bias in favor of volumetric analysis over visual rating may exist in our study due to combining the behavior and language presentation FTD subgroups, but participants for this study had shown great enough similarity between LAC and LAT critical cutoff values for accurate classification in the prior study [14] that we felt justified in taking this step. Kipps et al. [31] found that 75% of behavioral variant FTD patients had marked frontal and/or temporal atrophy which could be detected by raters on visual inspection, but 100% of their semantic dementia cases had focal LAT atrophy. Otherwise, 47% of patients with specifically behavioral variant FTD could not be discriminated from controls when raters examined frontal and temporal areas on MRI scans. Other limitations of the present study may include a relatively small sample size and the use of the regional, as opposed to lobar foci [32]. Studies using other brain regions for visual rating have been more successful [16,33].
With the pertinent history usually available in a clinical setting, structural imaging may be more helpful for further characterizing the severity of the illness or anticipating further behavioral and cognitive changes for caregivers than it is for predicting the etiology of the dementia. In the absence of historical information, however, structural imaging may help to rule out other neurological disorders, such as neoplasm, stroke, infection, cerebral autosomal dominant arteriopathy with subcortical infarcts, and leukoencephalopathy.
That visual rating procedures are not reliable even among experts is a frequent concern [33,34,35,36,37]. Neuroradiologists primed to the diagnosis of behavioral variant FTD versus semantic dementia versus AD show higher accuracy of diagnostic categorization and inter-rater agreement than the neurologists in this and a previous study [27], although sensitivity remains low at approx. 60% [33]. Difficulties with regional neuroimaging assessments are not unique to the MRI modality. Comparisons of perfusion on SPECT imaging for differentiating FTD from AD yield similar results [38].
Conclusion
It is difficult to test whether volumetric analyses can be translated into routine clinical practice, particularly when the goal is to differentiate between FTD and early-onset AD. Structural neuroimaging, in conjunction with a complete history and neuropsychological testing, can help distinguish between FTD and normal aging, but routine clinical MRIs from participants with FTD, early-onset AD, and controls with age-related atrophy may appear very similar to clinicians.
Acknowledgments
NIA grant F32 AG022802 (T.W.C.); the Saul A. Silverman Family Foundation, as part of a Canada International Scientific Exchange Program project (M.F.); University of Toronto Dean’s Fund for New Faculty (457494 to T.W.C.); an endowment to the Sam and Ida Ross Memory Clinic (T.W.C., M.F.); Canadian Institutes of Health Research MT – 12853, MRC-GR-14974, James S. McDonnell Foundation 21002032 (D.T.S.); Canadian Institutes of Health Research 13129 (S.E.B.).
D.T.S. is the Reva James Leeds Chair in Neuroscience and Research Leadership and S.E.B. holds the Brill Chair in Neurology, Department of Medicine, University of Toronto and Sunnybrook Health Sciences Centre.