Objective: To evaluate a new approach for creating a composite measure of cognitive function, we calibrated a measure of general cognitive performance from existing neuropsychological batteries. Methods: We applied our approach in an epidemiological study and scaled the composite to a nationally representative sample of older adults. Criterion validity was evaluated against standard clinical diagnoses. Convergent validity was evaluated against the Mini-Mental State Examination (MMSE). Results: The general cognitive performance factor was scaled to have a mean of 50 and standard deviation of 10 in a nationally representative sample of older adults. A cutoff point of approximately 45, corresponding to an MMSE of 23/24, optimally discriminated participants with and without dementia (sensitivity = 0.94, specificity = 0.90, area under the curve = 0.97). The general cognitive performance factor was internally consistent (Cronbach's α = 0.91) and provided reliable measures of functional ability across a wide range of cognitive functioning. It demonstrated minimal floor and ceiling effects, which is an improvement over most individual cognitive tests. Conclusions: The cognitive composite is a highly reliable measure, with minimal floor and ceiling effects. We calibrated it using a nationally representative sample of adults over the age of 70 in the USA and established diagnostically relevant cutoff points. Our methods can be used to harmonize neuropsychological test results across diverse settings and studies.
Despite its central role in the daily lives of older adults, there is no widely used, standardized method of assessing overall or global cognitive function across a wide range of functioning. Over 500 neuropsychological tests exist for clinical and research purposes . This tremendous diversity complicates comparison and synthesis of findings about cognitive functioning across a broad range of performance in multiple samples in which different neuropsychological batteries were administered. Although test batteries are used to examine domain-specific cognitive function, summary measures provide global indices of function . Brief global cognitive tests such as the Mini-Mental State Examination (MMSE) and others are limited by prominent ceiling effects and skewed score distributions [3,4,5,6], thus evaluating a limited range of cognitive function. These measurement properties substantially hamper the capacity to measure longitudinal change, since score ranges are limited and a point change has different meanings across the range of values [3,7]. Summary measures of general cognition, if properly calibrated, may be more sensitive to impairments across a broader range of cognitive function and be more sensitive to changes over time .
Approaches to creating summary cognitive measures have been limited and controversial. One approach involves standardizing scores of component cognitive tests, which are then averaged into a composite [9,10]. Although widely used, this approach is limited because standardizing variables does not address skewed response distributions, does not allow differential weighting of tests, and ultimately does not facilitate comparisons of findings across studies. An alternative approach uses latent variable methods to summarize tests. The tests are weighted and combined in a composite measure that has more favorable measurement properties, including minimal floor and ceiling effects, measurement precision over a wide range of function and interval-level properties that make the composite optimal for studying longitudinal change [8,11].
In a previous study, Jones et al.  used confirmatory factor analysis to develop a general cognitive performance factor from a neuropsychological battery . This measure was shown to be unidimensional and internally consistent (Cronbach's α = 0.82). The factor was defined by high loadings on 6 of the 10 component tests. The cognitive factor was designed to be sensitive to a wide range of cognitive function from high functioning to scores indicative of dementia. Scores were normally distributed and reliable (reliability index >0.90). With these properties, the general cognitive performance factor provides a robust approach to assess cognitive function over time.
Despite these clear advantages, an important limitation of the general cognitive performance factor is that its scores are not yet clinically interpretable or generalizable across studies. To address this limitation, the aims of the present study were as follows: (1) to calibrate a general cognitive performance factor to a nationally representative sample of adults over the age of 70 in the USA, (2) to validate the general cognitive performance factor against reference standard clinical diagnoses, (3) to examine convergent validity of the cognitive performance factor score and (4) to identify clinically meaningful cutoff points for the cognitive factor score. Our overall goal was to create a clinically meaningful measurement tool and, importantly, to demonstrate an approach that is generalizable to other different neuropsychological test batteries on a broader scale.
Materials and Methods
Participants were drawn from the Successful Aging after Elective Surgery (SAGES) Study and the Aging, Demographics and Memory Study (ADAMS), a substudy of the Health and Retirement Study. SAGES is a prospective cohort study of long-term cognitive and functional outcomes of hospitalization in elective surgery patients. After recruitment, a neuropsychological battery was administered to participants just before surgery and periannually for up to 3 years. Because data collection was ongoing at the time of this study, we used preoperative data for the first 300 patients enrolled. Eligible participants were at least 70 years of age, English speaking, and scheduled for elective surgery at one of two academic teaching hospitals in Boston, Mass., USA. Exclusion criteria included evidence of dementia or delirium at baseline. Study procedures were approved by the Institutional Review Board at the Beth Israel Deaconess Medical Center.
ADAMS is a nationally representative sample of 856 older adults in the USA interviewed in 2002-2004 . Its parent study, the Health and Retirement Study, is a longitudinal survey of over 20,000 community-living retired persons. ADAMS, which began as a population-based study of dementia, initially identified a stratified random sample of 1,770 participants; 687 refused and 227 died before they were interviewed, yielding 856 participants. Participants with probable dementia and minorities were oversampled. We used survey weights to account for the complex survey design and to make estimates representative of adults over the age of 70 in the USA . ADAMS was approved by Institutional Review Boards at the University of Michigan and Duke University.
Neuropsychological Test Batteries
In SAGES, a neuropsychological test battery was administered during an in-person evaluation that consisted of 11 tests of memory, attention, language and executive function. We used 10 tests from the ADAMS battery, of which 7 were in common with SAGES (table 1). As explained in more detail in the statistical analysis, our modeling approach allows cognitive ability to be estimated based on responses to any subset of cognitive tests .
Clinical diagnoses, grouped as normal cognitive function, cognitive impairment-no dementia (CIND) and all-cause dementia, were assigned in ADAMS by an expert clinical consensus panel [14,17]. Diagnoses were determined after a review of data collected during in-home participant assessments, which included neuropsychological and functional assessments from participants and proxies. A diagnosis of dementia was based on the Diagnostic and Statistical Manual of Mental Disorders III-R and IV [18,19] and for the present study included probable and possible AD, probable and possible vascular dementia, dementia associated with other conditions (Parkinson's disease, normal pressure hydrocephalus, frontal lobe dementia, severe head trauma, alcoholic dementia, Lewy body dementia) and dementia of undetermined etiology. CIND was defined as functional impairment that did not meet criteria for all-cause dementia or as below average performance on any test . CIND included participants with mild cognitive impairment or cognitive impairment due to other causes (e.g. vascular disease, depression, psychiatric disorder, mental retardation, alcohol abuse, stroke, other neurological condition).
The MMSE is a brief 30-point cognitive screening instrument used to assess global mental status. It is widely used in clinical and epidemiological settings. Its validity as a screening test for all-cause dementia in clinical populations has been previously established [20,21]. George et al.  recommended cutoff points of 23/24 for moderate cognitive impairment and 17/18 for severe cognitive impairment. These cutoff points have been widely applied in clinical and research settings [21,22,23,24,25]. MMSE 9/10 has also been used to indicate severe impairment . Although not a preferred test for identification of mild cognitive impairment , cutoff points of 26/27 [23,27,28] and 28/29  have been used for that purpose. Although the MMSE has poor measurement properties and ceiling effects, we used is as a standard for comparison in this study because it remains widely used and its scores and cutoffs are well recognized.
We used descriptive statistics to characterize the SAGES and ADAMS samples. Analyses were subsequently conducted in four steps: (1) scoring the general cognitive performance factor in SAGES and calibrating it to ADAMS using item response theory (IRT) methods, (2) assessing criterion validity of the general cognitive performance factor using reference standard clinical ratings in ADAMS, (3) assessing convergent validity of the general cognitive performance factor with the MMSE and (4) identifying cutoff points on the general cognitive performance factor corresponding to published MMSE cutoff points.
Scoring the General Cognitive Performance Factor in SAGES and Calibrating to ADAMS
We calculated internal consistency of the cognitive tests in SAGES using Cronbach's α . The Cronbach's α statistic has a theoretical range between 0 and 1, with 1 indicating high internal consistency. A generally accepted reliability for analysis of between-person group differences is ≥0.80 and for within-person change is ≥0.90 . Next, we calculated the general cognitive performance factor in SAGES from a categorical variable factor analysis of cognitive tests. The general cognitive performance factor score was scaled to have a mean of 50 and standard deviation (SD) of 10 in the US population over the age of 70. The factor analysis is consistent with the IRT-graded response model, and facilitated precise estimation of reliability across the range of performance [30,31,32,33]. In IRT, reliability is conceptualized as the complement of the squared standard error of measurement (SEM) . The SEM is estimated from the test information function, which varies over the range of ability. We described the precision of the measure over the range of general cognitive performance using the SEM calculated based on the IRT measurement model [35,36].
To scale the general cognitive performance factor in SAGES to the nationally representative ADAMS sample, we took advantage of tests in common between studies to calibrate scores in SAGES to ADAMS . We categorized cognitive tests into up to 10 discrete equal-width categories  to avoid model convergence problems caused by outlying values  and to place all tests on a similar scale  (see online suppl. table 2; for all online suppl. material, see www.karger.com/doi/10.1159/000357647). Tests in common between studies were categorized based on the sample distribution in ADAMS. We assigned model parameters for anchor tests in the SAGES factor analysis based on corresponding estimates from the ADAMS-only model that used population-based survey weighting. This procedure allowed us to scale the general cognitive performance factor to reflect the general population of US adults aged 70 and older. Importantly, because the IRT model handles missing data under the assumption that cognitive performance data are missing at random conditional on variables in the model, general cognitive performance can be calculated for each participant based on responses to any subset of cognitive tests as long as not all test scores are missing. The factor score is the mean of the posterior probability distribution of the expected a priori latent trait estimate. The posterior probability distribution refers to the conditional density of the latent cognitive performance trait given observed cognitive test scores. Because the factor is computed on the basis of all available information in a participant's response patterns, it can be computed regardless of missing tests .
We conducted diagnostic procedures using Monte Carlo simulation by generating 100,001 hypothetical observations with the MMSE and all SAGES and ADAMS cognitive measures. Simulated cognitive test distributions matched those of our empirical samples. This simulation allowed us to rigorously compare SAGES and ADAMS scores to the overall general cognitive performance factor using correlations and Bland-Altman plots to examine systematic differences between the measures .
Criterion Validity of the General Cognitive Performance Factor
To evaluate criterion validity for distinguishing dementia and CIND, we used logistic regression. We report overall areas under the curve (AUC) for the general cognitive performance factor and diagnostic characteristics for the score that maximized sensitivity and specificity .
Convergent Validity of the General Cognitive Performance Factor
We correlated the general cognitive performance factor with MMSE using Pearson's correlation coefficients.
Linking the General Cognitive Performance Factor and MMSE
The MMSE is a widely used screening test for global cognitive status. Because of its widespread use, many clinicians and researchers are familiar with its scores. Thus, the MMSE provides a set of readily recognizable cutoff points which we utilized as guide posts for comparison with the general cognitive performance measure. To produce a ‘crosswalk' between the general cognitive performance factor and MMSE, we used equipercentile linking methods to correlate scores . This step allowed the direct comparison of general cognitive performance factor scores that correspond to MMSE scores. Equipercentile linking identifies scores on 2 measures (MMSE and general cognitive performance factor) with the same percentile rank, and assigns general cognitive performance a value from the reference test, MMSE, at that percentile. This approach is appropriate when 2 tests are on different scales, and is most useful when the distribution of the reference test is not normally distributed .
In SAGES (n = 300), most participants were female (56%), white (95%), on average 77 years old (range 70-92), and had at least a college education (70%; table 1); few had dementia (n = 5, 1.7%) or CIND (n = 19, 6.3%). By comparison, in ADAMS (n = 856), which is representative of persons over the age of 70, participants were mostly female (61%), white (89%), on average 79 years of age (range 70-110), and 37% had at least a college education. Relative to SAGES, the ADAMS sample was older (p < 0.001), more ethnically diverse (p = 0.001), less highly educated (p < 0.001), and had higher levels of cognitive impairment (n = 308, 13.7% had dementia and n = 241, 22.0% had CIND).
Derivation of the General Cognitive Performance Factor in SAGES and Calibration to ADAMS
By design, the general cognitive performance factor in ADAMS had a mean of 50 and SD of 10. The general cognitive performance factor in SAGES was 0.9 SD above the national average, reflecting their higher education and younger average age. The cognitive tests in ADAMS were internally consistent (Cronbach's α = 0.91). The reliability of the general cognitive performance factor, derived based on the SEM, was above 90% for scores between 40 and 70, which included 84% of the ADAMS sample. Correlations between the general cognitive performance factor and items from SAGES were above 0.80, with the exception of HVLT delayed recall (r = 0.65). The tests represent multiple domains including memory, executive function, language and attention, suggesting the factor represents general cognitive performance and is not dominated by a particular cognitive domain. Figure 1 demonstrates that general cognitive performance factor scores in SAGES and ADAMS were normally distributed; on average, SAGES participants had higher levels of cognitive function.
Using simulated data, the correlation between the study-specific general cognitive performance factor for SAGES and ADAMS was above 0.97. Bland-Altman plots further revealed no systematic bias across the range of general cognitive performance scores, suggesting the general cognitive performance factor was not measured differently across the two studies (see online suppl. fig. 1).
Criterion Validity of the General Cognitive Performance Factor
Figure 2 shows receiver operating curves for distinguishing dementia and CIND in ADAMS. The general cognitive performance factor score that best discriminated dementia participants from cognitively normal participants was less than 44.8 (sensitivity = 0.94, specificity = 0.90; fig. 2, right panel). This cutoff point correctly classified 94% of the sample. The AUC was 0.97. The general cognitive performance factor score that best discriminated CIND participants from cognitively normal participants was less than 49.5 (sensitivity = 0.80, specificity = 0.76; fig. 2, left panel). This cutoff point correctly classified 79% of the sample (AUC = 0.84).
The AUC for the general cognitive performance factor was significantly greater than the AUC for each constituent test (online suppl. table 1). The only exception was for immediate word recall, which was superior for predicting dementia.
Convergent Validity of the General Cognitive Performance Factor
The correlation between the general cognitive performance factor and the MMSE was 0.91 in ADAMS (p < 0.001), indicating strong evidence of convergent validity.
Crosswalk between General Cognitive Performance Factor and MMSE
The equipercentile linking procedure is illustrated in figure 3. Scores for the general cognitive performance factor and MMSE were matched based on percentile ranks. For example, a score of 24 on the MMSE has the same percentile rank as a score of 45 on the general cognitive performance factor.
After equipercentile linking the general cognitive performance factor with the MMSE, the score corresponding to an MMSE cutoff point of 23/24, was <45.2, or 0.48 SD below the national average (table 2). This cutoff point was nearly identical to the score that best discriminated participants with and without dementia (fig. 2). General cognitive performance factor scores of <50.9 and <40.7 corresponded to MMSE scores of 26/27 and 17/18, respectively. Table 2 provides sensitivity, specificity, positive and negative predictive values, and likelihood ratio positive and negative statistics for predicting dementia and CIND for general cognitive performance factor scores corresponding to MMSE scores of 29, 27, 24, 18 and 10. The general cognitive performance factor cutoff point of 45.2 correctly classified 90% of persons with dementia (sensitivity) and 93% of persons without dementia (specificity). This cutoff point is moderately strong for confirming dementia (positive predictive value = 73%, likelihood ratio positive = 12.8) and has an excellent negative predictive value (98%).
We constructed a crosswalk to show scores on the general cognitive performance factor and corresponding scores on the MMSE (fig. 4). Irregular levels between each score on the MMSE and the limited observable range of MMSE evident in this figure underscores the broader range and better interval scaling properties of the general cognitive performance factor.
We developed a unidimensional factor of cognitive performance in the SAGES study, scaled it to national norms for adults over 70 years of age living in the USA, and evaluated its criterion and convergent validity. When validated against reference standard diagnoses for dementia, the score of approximately 45 has a sensitivity of 94% and specificity of 90% (AUC = 0.97), indicating outstanding performance. Convergent validity with the MMSE was excellent (correlation = 0.91, p < 0.001). Cognitive tests comprising the general cognitive performance factor are internally consistent (Cronbach's α = 0.91). The factor is highly precise across most of the score range (reliability >0.90). To enhance the clinical relevance of the scores, we provided correlations with widely used scores for the MMSE. Notably, the score of 45 was the optimal cutoff point for dementia and also corresponded to an MMSE of 23/24.
General cognitive performance factors have previously been shown to have minimal floor and ceiling effects, measurement precision over a wide range of cognitive function, and interval scaling properties that make it an ideally suited measure of change [12,31]. We replicated these findings, and identified meaningful cutoff points to further enhance the potential utility of the measure. Strengths of this study include calibration of the cognitive composite to a large, nationally representative sample of older adults in which rigorous reference standard clinical diagnoses were available. With this sample, we were able to evaluate criterion and convergent validity of the general cognitive performance factor for detecting cognitive impairment and demonstrate its favorable test performance characteristics.
Several caveats merit attention. First, the general cognitive performance factor is not intended to diagnose dementia or mild cognitive impairment. It simply provides a refined cognitive measure which, like any cognitive measure, represents only one piece of the necessary armamentarium for establishing a clinical diagnosis. Second, 7 tests were in common to calibrate cognitive composites in ADAMS and SAGES, and Bland-Altman plots confirmed they were similar between studies. Although we are convinced that the general cognitive performance factors developed in the two samples were equivalent, further research is needed to determine minimum sample sizes, number of cognitive tests available in common between studies, and the degree of correlation among tests needed to estimate a reliable composite in a new sample. Existing research suggests that at least 5 anchor items in an IRT analysis such as ours is enough to produce reasonably accurate parameter estimates ; one previous study used 1 item in common to calibrate different scales . Third, some criterion contamination is potentially present in our evaluation of criterion validity because the general cognitive performance factor in ADAMS is a unique combination of shared variability among test items that were available to clinicians when assigning clinical diagnoses. However, comparison of AUCs (online suppl. table 1) for the general factor and individual cognitive tests revealed the former performed better than most of its constituent parts. Fourth, positive and negative predictive values in our study are dependent on base rates, and may differ in other samples. Fifth, while reliability of the general cognitive performance factor was excellent across a range of performance that included 84% of the ADAMS population, results suggest it is less reliable at more impaired levels. Future calibration efforts should consider cognitive tests that are sensitive to impairment in populations with more severe degrees of dementia. A final caveat is that our approach is not intended to replace examination and interpretation of individual neuropsychological tests. Such examination remains an important approach to examine domain-specific cognitive changes to assist with clinical diagnosis and to understand pathophysiological correlates of various cognitive disorders.
An important implication of the present work is the potential of deriving the general cognitive performance factor in other samples with neuropsychological batteries that have overlap with the battery used in ADAMS. Extrapolation of these methods holds the potential for harmonizing batteries to enhance comparability and even synthesize results across studies through integrative data analysis. This method would thus address a substantial limitation for combining existing studies using disparate neuropsychological batteries . Harmonizing samples together with different research designs and demographic characteristics provides opportunities to make findings more generalizable. Without this common metric, or at least 1 test in common across studies, to conduct integrative analysis, one must resort to comparing multiple data points from normative tests that potentially measure diverse cognitive domains.
The need for uniform measures of cognitive function, derived using rigorous psychometric methods, has been recognized by national groups . Uniform, psychometrically sound measures are a central focus of the NIH PROMIS and Toolbox initiatives. Our study is consistent with these goals. The innovative approach demonstrated here used psychometric methods to generate a unidimensional general cognitive performance composite with excellent performance characteristics that can be used to measure cognitive change over time and across study. We established clinically meaningful, population-based cutoff points. This measure, and the methods used to create it, holds substantial promise for advancing work to evaluate the progression of cognitive functioning over time. Perhaps most importantly, our methods can facilitate future strategies to integrate cognitive test results across epidemiological and clinical studies of cognitive aging.
We created a composite factor for general cognitive performance from widely used tests for neuropsychological functioning using psychometrically sophisticated methods. We used publicly available neuropsychological performance data from the ADAMS to calibrate general cognitive performance to a nationally representative sample of adults aged 70 and older in the USA. This calibration enabled us to describe cognitive functioning in our study on a nationally representative scale. The general cognitive performance factor was internally consistent, provided reliable measures of functional ability across a wide range of cognitive functioning, and demonstrated minimal floor and ceiling effects. It also demonstrated criterion validity: a cutoff point of approximately 45, corresponding to an MMSE of 23/24, optimally discriminated participants with and without dementia (sensitivity = 0.94, specificity = 0.90, AUC = 0.97). Our approach has broad applicability and usefulness to directly compare cognitive performance in new and existing studies when overlapping items with the ADAMS neuropsychological battery are present. These methods can facilitate interpretation and synthesis of findings in existing and future research studies.
This work was supported by a grant from the National Institute on Aging (P01AG031720 to Dr. Inouye). Dr. Gross was supported by a National Institutes of Health Translational Research in Aging Postdoctoral Fellowship (T32AG023480) and by a grant from the NIA (R03AG045494). Dr. Inouye holds the Milton and Shirley F. Levy Family Chair in Alzheimer's Disease. Dr. Fong was supported by a National Institute on Aging Career Development Award (K23AG031320). The contents do not necessarily represent the views of the funding entities. Dr. Gross had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
None of the authors report any conflicts of interest or received compensation for this work.