Abstract
Introduction: Accurate estimation of dementia prevalence is essential for making effective public and social care policy to support individuals and families suffering from the disease. The purpose of this paper is to estimate the prevalence of dementia in India using a semi-supervised machine learning approach based on a large nationally representative sample. Methods: The sample of this study is adults 60 years or older in the wave 1 (2017–2019) of the Longitudinal Aging Study in India (LASI). A subsample in LASI received extensive cognitive assessment and clinical consensus ratings and therefore has diagnoses of dementia. A semi-supervised machine learning model was developed to predict the status of dementia for LASI participants without diagnoses. After obtaining the predictions, sampling weights and age standardization to the World Health Organization (WHO) standard population were applied to generate the estimate for prevalence of dementia in India. Results: The prevalence of dementia for those aged 60 years and older in India was 8.44% (95% CI: 7.89%–9.01%). The age-standardized prevalence was estimated to be 8.94% (95% CI: 8.36%–9.55%). The prevalence of dementia was greater for those who were older, were females, received no education, and lived in rural areas. Discussion: The prevalence of dementia in India may be higher than prior estimates derived from local studies. These prevalence estimates provide the information necessary for making long-term planning of public and social care policy. The semi-supervised machine learning approach adopted in this paper may also be useful for other large population aging studies that have a similar data structure.
Introduction
Dementia is an neurodegenerative disease with a wide range of impacts on individuals, families, and society [1, 2]. Accurate estimation of dementia prevalence is essential for making effective public and social care policy to support individuals and families suffering from the disease [3]. Prior research conducted in India showed large variations in estimates for dementia prevalence [4]. A large systematic review of evidence from 1990 to 2016 estimated that 2.93 million out of the 1.36 billion total population in India had dementia, whereas the same study estimated that 4.02 million out of the 322.87 million total population in the USA had dementia [5]. The estimate for the USA was based on large epidemiological studies; however, the estimate for India was synthesized from small local studies.
Dementia ascertainment in a large epidemiological study is challenging and costly. Since there is no single definitive diagnostic test for dementia, large epidemiological studies often rely on extensive cognitive assessment and clinical evaluation to ascertain dementia [6, 7]. Clinical consensus diagnosis, a process designed to reach consensus diagnoses/labels of dementia among multiple clinical experts, is considered the gold standard as it can surpass the accuracy of individual expert evaluation [8]. Nevertheless, ascertaining dementia for every subject in a large epidemiological study is impractical due to the resources needed to execute the necessary in-depth assessments of cognition and clinical status [9]. As a result, existing large epidemiological studies like the seminal Health and Retirement Study (HRS) in the USA and its sister global studies only selected a small subsample for dementia ascertainment [6, 7, 9].
The purpose of this paper is to estimate the prevalence of dementia in India based on a semi-supervised machine learning analysis using samples from the Longitudinal Aging Study in India (LASI) [10]. LASI is one of the 25 HRS family studies and the first and only nationally representative aging study in India [11, 12, 13]. Semi-supervised learning is a machine learning approach that aims to make predictions for a large sample of unlabeled observations based on a smaller sample of labeled observations [14]. In the LASI, a small subset of subjects participated in the LASI-Diagnostic Assessment of Dementia (LASI-DAD) study to receive in-depth assessment of cognition and clinical consensus diagnosis of dementia [15]. Therefore, the estimate for dementia in this paper was based on the smaller number of observations with dementia diagnoses/labels in the LASI-DAD study and the predictions of dementia from the semi-supervised learning model for the larger sample of unlabeled observations in the LASI.
Materials and Methods
Study Sample
The sample of this study involves adults 60 years or older in the LASI (N = 31,477) [11]. The first wave of LASI (2017–2019) enrolled a nationally representative sample of more than 72,000 individuals aged 45 and older [10], the largest among all HRS family studies. The LASI wave-1 survey has achieved an overall household response rate of 96% and an individual response rate of 87% [10].
The LASI-DAD study drew a subsample aged 60 and older (N = 4,096) to conduct in-depth assessments of cognition and dementia and informant interviews of their cognitive and health conditions [15]. About 60% of LASI-DAD participants received clinical consensus diagnoses of their dementia status [16]. A machine learning model has been developed and validated to expand the clinical consensus diagnosis of dementia to all LASI-DAD participants [17].
Dementia Assessment and Diagnosis
The LASI-DAD study built a web-based approach [15] that assigned at least three clinicians for each individual to evaluate the six subdomains in the clinical dementia rating (CDR) scale [18]: (1) memory, (2) orientation, (3) judgment and problem solving, (4) community affairs, (5) home and hobbies, and (6) personal care. The CDR algorithm automatically generated a global rating ranging from 0 (normal) to 3 (severe dementia). For cases where individual global ratings differed, an automatic email was sent to the assigned reviewers, giving them a chance to review the case, read other clinicians' comments, and update their ratings, if desired [15]. This second round of review resulted in consensus for additional cases. For cases where consensus was not reached after the second round of review, a group of clinicians discussed the case through a virtual consensus meeting to determine the global CDR rating. An individual was classified as having dementia if the consensus CDR rating was ≥1.
As mentioned above, about 60% of individuals in the LASI-DAD sample received clinical consensus diagnoses of dementia, while the remaining 40% did not progress through the diagnostic process. A machine learning model was developed to predict the dementia status for the portion of the LASI-DAD sample without clinical consensus diagnoses. The model was trained using a 2-step procedure in which a number of machine learning models, including stochastic gradient boosting, support vector machine, elastic net, multivariate adaptive regression splines, and multilayer perceptron, were first fitted based on a training dataset to generate predicted risk scores of dementia. Then, a clinically informed majority-voting process was trained to output the final classification of dementia by combining multiple weak classifications derived from the predicted risk scores. The ultimate model was selected by comparing predictive accuracy on a test set. The ultimate model was a stochastic gradient boosting model with an area under receiver operating characteristics (ROC) curve of 0.94, indicating that the model has an almost perfect discriminative ability according to thresholds suggested in prior research [19]. Details of the machine learning model have been described elsewhere [17]. To maximize the amount of labeled data for analysis in this paper, both the clinical consensus diagnoses and the predicted diagnoses in the LASI-DAD study were used as the labeled data to develop the semi-supervised machine learning model.
Predictors
The following predictors were used to develop the semi-supervised machine learning model: 1) basic demographic variables including age, education levels, and marital status, 2) functional difficulties measured by the Activities of Daily Living (ADL) scale [20], the Instrumental Activities of Daily Living (IADL) scale [21], and other functional limitations in mobility, large muscle, and gross and fine motor activities [22], 3) depression measured by the 10-item Center for Epidemiologic Studies Depression (CESD) scale [23], 4) cognitive functioning assessed by the write-a-sentence task, date and location naming, the ten-word recall task, the serial seven tasks, object naming, reading command, drawing object, verbal fluency, backward counting, and doing a specified computation, and Informant Questionnaire on Cognitive Decline in the Elderly (IQCODE) for participants unable to undergo cognitive functioning assessment [10, 11], 5) health behaviors measured by physical activity, drinking, and smoking; existing diagnosis of health problems included conditions like visual and hearing impairment, stroke, diabetes, high blood pressure, and heart problems, and 6) other predictors including height, weight, and indoor pollution. As a common practice in machine learning analysis, all missingness was coded as an impossible value for continuous predictors or a distinct category for categorical predictors.
The Semi-Supervised Learning Approach
The predictions of dementia status for unlabeled observations in the LASI were obtained from a semi-supervised learning model developed using the self-training approach [14]. In self-training, an initial predictive model was trained with a reduced set of labeled data. Then, the predictive model was iteratively retrained with its own most confident predictions over the unlabeled data. Self-training is a simple and effective approach and has been applied to several semi-supervised learning tasks with biomedical data [24, 25, 26, 27].
In this study, the initial predictive model was trained with a random selection of 70% labeled data from the LASI-DAD study using a stochastic gradient boosting method, the same machine learning method that has been shown in our previous study to generate accurate predictions for LASI-DAD participants [17]. The initial model was then iteratively retrained using the self-training function in the R “ssc” package [28]. The rest 30% of the labeled data were reserved as a test set for assessing the predictive accuracy of the semi-supervised learning model.
Estimating the Prevalence of Dementia
After obtaining the predictions, sampling weights and age standardization were applied to generate the estimate for prevalence of dementia in India. Details of the sampling method and weight construction in LASI have been described elsewhere [10, 11]. The age-standardized prevalence was calculated as a weighted average of the age-specific prevalence, where the weights were the proportions of persons in the corresponding age groups of the WHO standard population [29].
We first estimated the age-standardized and age-specific prevalence of dementia in India. Then, we estimated age-standardized prevalence by gender, education, urbanicity, and state to provide detailed estimates for subgroups. Confidence intervals (CIs) for all estimates were obtained using the Rao-Scott χ2 method [30]. The analysis was implemented by using the R “survey” package [31].
Results
Study Sample Characteristics
The sample of this study includes LASI participants aged 60 and older (N = 31,477). Among them, a total of 4,096 participants were involved in the LASI-DAD study and therefore have labeled dementia status. The rest 27,381 subjects were unlabeled with dementia status. Characteristics of the study sample are shown in Table 1.
Semi-Supervised Learning Model
National Prevalence of Dementia in India
This paper estimated the prevalence of dementia for those aged 60 years and older in India to be 8.44% (95% CI: 7.89%–9.01%), suggesting that 10.08 million older adults in India have dementia. The age-standardized prevalence was estimated to be 8.94% (95% CI: 8.36%–9.55%). Estimates for age-specific prevalence are shown in Figure 2 and Table 2. With increasing age, there were greater proportions of people suffering from dementia in India.
Age-Standardized Prevalence of Dementia by Gender, Education, and Urbanicity
Estimates for age-standardized prevalence of dementia by gender, education, and urbanicity in India are shown in Table 3. Females were estimated to have a higher prevalence of dementia than males (12.29% vs. 5.37%). People receiving no school education and less than lower secondary education had a higher prevalence than people receiving upper secondary education and tertiary education (12.23% vs. 4.72% vs. 2.88% vs. 1.65%). People living in the rural areas had a higher prevalence of dementia than people living in the urban areas (10.19% vs. 6.07%).
Age-Standardized Prevalence of Dementia by State and Union Territory
Estimates for age-standardized prevalence of dementia by state and union territory are shown in Table 4. A significant geographic variation was shown, ranging from 1.82% (95% CI: 0.25%–3.38%) in Chandigarh, a union territory, serving as the capital of Punjab and Haryana, to 13.57% (95% CI: 11.19%–15.95%) in West Bengal. Seventeen states were estimated to have higher than national age-standardized prevalence (i.e., 8.94%), and ten states have greater than 10.00% prevalence rates.
Discussion
This study has produced estimates for prevalence of dementia in India using a semi-supervised machine learning approach based on a nationally representative sample from India. Existing studies for dementia prevalence in India were based on local samples and generated results with large variations [4]. For example, Vas et al. [32] were a relatively large study with about 25 thousand participants from Mumbai and estimated the prevalence to be 2.31% for people 65 years or older. Banerjee et al. [33] estimated the prevalence to be 1.27% based on a sample of 17,584 participants 50 years or older from the city of Kolkata. On the other end of the estimates, Rao et al. [34] found the prevalence to be 10% for adults 60 years or older based on a sample of 3,033 from the rural area in South India. Synthesis from those local estimates suggested that the prevalence rate of dementia in India is lower than major Western countries like the USA and the UK [5].
However, the estimate in this study suggests that dementia prevalence in India may be higher than prior estimates and is more in line with estimates from other large epidemiological studies. Using the HRS sample, Langa et al. [35] estimated that the prevalence of dementia in the USA for adults 65 years or older was 11.6% in 2000 and 8.8% in 2012. Estimates generated from the English Longitudinal Study of Ageing, a HRS sister study in the UK, suggested the prevalence of dementia in adults aged 60–89 to be 9.0% in the country [36]. Estimates from the Survey of Health, Ageing, and Retirement in Europe suggested that the prevalence of dementia for adults 65 years or older ranged from 6.86% in Czechia to 9.06% in Spain [37]. The same study estimated that the two largest European countries, i.e., France and Germany, had a prevalence of dementia of 8.91% and 8.54%, respectively [37].
Results in this study revealed a significant geographic variation in dementia prevalence after age standardization. Factors that may contribute to this geographic variation include the varying education level and urban-rural composition of study samples by state and union territory. For example, Jammu & Kashmir, a state with an estimated dementia prevalence rate of 12.94%, had 63.17% study samples receiving no school education and 72.60% from rural areas. In contrast, Chandigarh, a union territory with an estimated dementia prevalence rate of 1.82%, had 24.85% of study samples receiving no school education and 1.27% from rural areas.
To the best knowledge, this paper is the first study that uses a semi-supervised machine learning approach to estimate the prevalence of dementia based on a nationally representative sample. This approach has the advantage of combining the large LASI sample with the valuable but smaller number of dementia ratings from the LASI-DAD to generate national prevalence estimates. Existing research is limited on utilizing machine learning techniques for estimating dementia prevalence. Langavant et al. [38] applied an unsupervised learning approach to estimating the prevalence of dementia using data from ten HRS family studies including the LASI. Unlike the semi-supervised learning approach in this paper, the unsupervised learning approach did not rely on any diagnosis of dementia; instead, it attempted to identify dementia based on clusters of individuals in the data. The study generated a prevalence estimate of 7.5% for adults 50 years or older in India [38].
This paper has several strengths as well as some limitations. This paper uses a large nationally representative sample from India. The status of dementia for a subsample is ascertained by clinical consensus diagnoses, which is considered the gold standard for marking dementia in large epidemiological studies. Machine learning approaches are used to make predictions for unlabeled observations based on the labeled observations from clinical consensus diagnoses. This paper is limited by excluding individuals younger than 60 years old from the analysis. This was due to the design of LASI-DAD, which selected only individuals aged 60 and older for clinical consensus evaluation for dementia. Furthermore, the paper is limited by the fact that both the clinical consensus diagnoses and the predictions from a prior machine learning model are used to generate the labeled data for developing the semi-supervised learning model. On the one hand, this approach can increase the amount of labeled data. But on the other hand, this has introduced errors since the predictions are not perfect. We also acknowledge that machine learning is a relatively new approach to estimating dementia prevalence and may generate results that are different to estimates from other statistical methods. Based on the same LASI and LASI-DAD data, a regression-based multiple imputation analysis estimated the dementia prevalence to be 7.4% for adults 60 years or older [39], which is about 1 percentage point lower than this paper’s estimate but still higher than prior estimates synthesized from local studies. The reasons for the difference may involve different choices in predictors and statistical models and are worth of further investigation.
As a conclusion, findings from this paper suggest that the prevalence of dementia in India may be higher than prior estimates derived from local studies. These prevalence estimates provide the information necessary for making long-term planning of public and social care policy to support individuals and families suffering from dementia. The estimates also provide a basis for assessing the impact of treatment advances as they become available [40]. The semi-supervised machine learning approach adopted in this paper may also be useful for other large epidemiological aging studies that have a similar data structure (i.e., a large population-based sample with a smaller subset labeled with dementia status based on extensive cognitive assessment and clinical evaluation).
Statement of Ethics
This study protocol was reviewed and approved by the Institutional Review Board, University of Southern California, approval number UP-15-00684, the Indian Council of Medical Research, approval number 54/01/Indo-foreign/Ger/16-NCD-II, and the Ethics Committee, All India Institute of Medical Sciences, New Delhi, approval number IEC-284/06.05.2016, RP-33/2016. Written informed consents were obtained from each participating respondent and informant in the form of signature or thumb print.
Conflict of Interest Statement
The authors have no conflicts of interest to declare.
Funding Sources
This work and LASI-DAD data collection were funded by the US National Institute on Aging, National Institutes of Health [R01AG051125 and U01AG065958]. Data collection of LASI was supported by the US National Institute on Aging, National Institutes of Health [R01AG042778], and the Ministry of Health and Family Welfare, Government of India (T22011/02/2015-NCD). Data processing of LASI was supported by the National Institute on Aging, National Institutes of Health [R01AG030153].
Author Contributions
HJ: conceptualization, methodology, data curation and analysis, and writing – original draft and editing; EC: conceptualization, funding acquisition, and writing – review and editing; KML, ABD, and JL: conceptualization, funding acquisition, data collection, and writing – review and editing.
Data Availability Statement
The LASI and LASI-DAD data are publicly available (LASI website: www.iipsindia.ac.in/lasi; LASI-DAD website: www.lasi-dad.org). Machine learning predictions of dementia status for LASI and LASI-DAD participants without consensus clinical diagnosis are available upon request.