Abstract
Background: Alzheimer’s disease (AD) is a chronic neurodegenerative disease. In this study, potential diagnostic biomarkers were identified for AD. Methods: All AD samples and healthy samples were collected from 2 datasets in the GEO database, in which differentially expressed genes (DEGs) were analyzed by using the limma package of R language. GO and KEGG pathway enrichment was conducted basing on the DEGs via the clusterProfiler package of R. And, the PPI network construction and gene prediction were performed using the STRING database and Cytoscape. Then, a logistic regression model was constructed to predict the sample type. Results: Bioinformatic analysis of GEO datasets revealed 2,063 and 108 DEGs in GSE5281 and GSE4226 datasets, separately, and 15 overlapping DEGs were found. GO and KEGG enrichment analysis revealed terms associated with neurodevelopment. Then, we built a logistic regression model based on the hub genes from the PPI network and optimized the model to 3 genes (ALDOA, ENC1, and NFKBIA). The values of area under the curve of the training set GSE5281 and testing set GSE4226 were 0.9647 and 0.7857, respectively, which implied the efficacy of this model. Conclusion: The comprehensive bioinformatic analysis of gene expression in AD patients and the effective logistic regression model built in our study may provide promising research value for diagnostic methods of AD.
Introduction
Alzheimer’s disease (AD) is a kind of chronic neurodegenerative disease associated with disorder in the central nervous system and usually occurs in the presenium and senium [1, 2]. The clinical characteristics of AD include dysmnesia, aphasia, agnosia, and executive dysfunction, as well as personality and behavior changes [3, 4]. It is now well recognized that pathophysiological changes could start much earlier than clinical manifestations of AD, from clinically asymptomatic to severely impaired [5]. Consequently, AD could not be defined only based on its clinical features, and researchers have spared no effort to recognize AD based on both clinical and biomarker findings.
The research of AD for the past decades has tremendously increased our recognition of the disease and simultaneously emphasized its complexity [6, 7]. Many biological processes have been proposed to be related to AD, such as oxidative stress, neuroinflammation, mitochondrial dysfunction, autophagy, and gut microbiota [8‒12]. Studies on molecular mechanisms have indicated the participation and critical role of amyloid-β (Aβ) and Tau in the genesis of AD [2, 13, 14]. However, the pathogenesis of AD still remains unclear.
Nowadays, common diagnosis of AD is mostly relying on the imaging technology, identification of cognitive level, and several fluid biomarkers including Aβ [15‒18]. Yet, it becomes increasingly apparent that AD is a more complicated disease associated with a great regulatory network [19]. This makes it urgent to explore more accurate diagnostic and therapeutic targets of AD.
In the past decade, the high-speed development of microarray and high throughput sequencing technologies has suggested a proficient and prevalent manner to decode disease-related genetic and epigenetic regulations and presented numerous pieces of evidence for diagnosis and treatment of multiple diseases [20, 21]. Several studies have reported a number of potential biomarkers of AD through bioinformatic analysis. For example, whole-exome and whole-genome sequencing studies have identified TREM2 and TREML2 as risk factors for AD [22, 23]. A next-generation sequencing study screened out 12 miRNAs in the blood sample to be associated with AD development [24].
In the current study, we adopted integrated methods of bioinformatics and machine learning to analyze mRNA data in brain tissue samples and blood samples of AD patients collected from the Gene Expression Omnibus (GEO) database, screened diagnostic target at genetic level, and built a diagnostic logistic regression model for AD. Our work may provide supportive evidence for early diagnosis and therapy of AD.
Materials and Methods
Data Collection
We downloaded 2 mRNA expression datasets, namely, GSE5281 and GSE4226, from the GEO (https://www.ncbi.nlm.nih.gov/geo/) database. The GSE5281 dataset contained 161 brain tissue samples including 87 brain samples from patients with AD and 74 collected from healthy donors and was detected basing on the Affymetrix Human Genome U133 Plus 2.0 Array platform. The GSE4226 dataset contained 28 peripheral blood samples including 14 from AD patients and 14 blood samples from healthy donors and was detected basing on NIA MGC, Mammalian Genome Collection Array platform.
Differentially Expressed Genes
The raw data downloaded from the GEO database were normalized using the robust multiarray average method, standardized by taking logarithm base 2 of each value, followed by identification of differentially expressed genes (DEGs). We identified DEGs by using the limma package [25] in R software (version 3.5.2). Log2FC >1 and FDR ≤0.05 were considered as criteria for identification of DEGs.
Functional Enrichment Analysis
The clusterProfiler package [26] was used to conduct Gene Ontology (GO) analysis, which included biological process, molecular function, and cellular component, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. p < 0.05 was considered as the threshold value.
PPI Network of DEGs
We used the online database Search Tool for the Retrieval of Interacting Genes (STRING) (version 11.0) (https://string-db.org/) [27] to predict and analyze functional connections and potential protein-protein interactions (PPI) among the overlapping genes and reserved the interaction pairs with a confidence score ≥0.4. The reserved PPI were visualized by using Cytoscape software (version 3.7.2) [28]. The cytoHubba plug-in of Cytoscape was used to screen key genes in the PPI network basing on the MCC algorithm, and MCODE plug-in was used for modularization analysis.
Logistic Regression Model
We constructed a logistic regression model with the GSE5281 dataset as the training set and GSE4226 as an independent testing set. The samples were classified into 2 types, the control group and the AD sample group. DEG expression was included as continuous predictive variables, and the sample type was regarded as categorical responsive values to establish the logistic regression model by using glm package in R software. The screened variables by the stepwise regression method were adopted to reconstruct the model, and the p value of each predictive variable was calculated. The variables with p value ≤0.05 were chosen for model reconstruction and subsequent analysis.
Results
Differentially Expressed Genes
The raw data from GEO datasets were standardized using R software, and the results showed little biases between samples in 2 datasets, which eliminated the intensity diverse caused by experimental techniques. The standardized data of GSE5281 and GSE4226 are shown in Figure 1a, b, separately. A total of 2,063 DEGs comprising 854 upregulated genes and 1,209 downregulated genes were obtained based on microarray analysis of GSE5281 (Fig. 1c), and the difference between AD cases and normal cases was statistically significant (Fig. 1d). For analysis on the dataset GSE4226, 108 DEGs (Fig. 1e), including 57 upregulated and 51 downregulated, were recognized and showed significant difference between the disease and control groups (Fig. 1f). Fifteen overlapping DEGs between the 2 datasets were identified, namely, NLN, THRA, TPI1, PFKM, NFKBIA, TPM3, ATPIF1, ENC1, EPDR1, ACTG1, ACTB, ALDOA, XRCC6, CITED1, and PKM.
Functional Enrichment Analysis of DEGs
In order to explore the potential biological function of the DEGs summarized in previous works, we conducted GO analysis and KEGG pathway enrichment analysis. GO analysis indicated that the DEGs from GSE5281 were notably enriched in GO terms related to regulation of neuron projection development, and the DEGs from GSE4226 were notably enriched in GO terms related to SRP-dependent cotranslational protein targeting to the membrane and so on. The detailed GO terms including biological process terms, cellular component terms, and molecular function terms enriched from GSE5281 and GSE4226 are listed in online suppl. Tables 1 and 2 (see www.karger.com/doi/10.1159/000518717 for all online suppl. material), among which the top 10 most significantly enriched GO terms from each dataset are displayed in Figure 2a, b, separately.
In the case of KEGG pathway enrichment analysis, the DEGs collected from GSE5281 were enriched in pathways such as Parkinson disease (online suppl. Table 3), and the 20 most significant pathways are shown in Figure 2c. On the other hand, the enriched KEGG pathways of 108 DEGs from the GSE4226 dataset are listed in online suppl. Table 4, and the top 20 pathways including ribosome are presented in Figure 2d.
PPI Network Construction
We constructed a PPI network based on the 15 overlapping DEGs from 2 datasets using the STRING database, screened gene-gene interaction with a minimum required interaction score >0.4, and visualized the network through Cytoscape software. As shown in Figure 3a, the PPI network consisted of 10 nodes and 19 lines, indicating 10 hub genes and 19 interaction pairs among these genes. The constructed network included a subnetwork composing of 5 genes (TPI1, ALDOA, PKM, ACTG1, and ACTB) and 10 paired connection. Next, we analyzed the topological structure of the PPI network by using Cytoscape software, scored the hub genes, and listed 5 hub genes with the highest score (Fig. 3b), which were consistent with the subnetwork.
Logistic Regression Model
We next constructed a logistic regression model based on the 10 genes selected from the PPI network and adopted the GSE5281 dataset as the training set. To establish a model processing strong explanatory power with less variables, we screened out 6 hub genes, namely, PI1, ALDOA, ENC1, TPM3, THRA, and NFKBIA, from the 10 PPI interacting genes via the stepwise regression method. Subsequently, we adopted these 6 genes to reconstruct a logistic regression model (model 1).
The odds ratio (OR) value of the model indicated the correlation between the variables and incidence of AD, with value above 1 indicating positive and value lower indicating negative. We identified the 3 most critical genes (ALDOA [downregulated], ENC1 [downregulated], and NFKBIA [upregulated]) which contribute the most to the model for that their p values were <0.05 in logistic analysis. Hence, we constructed a logistic regression model 2 containing no influential points that would affect accuracy of the model (Fig. 4a). The χ2 test indicated no difference between model 1 and model 2, indicating better explanatory power with less variables. Moreover, the area under the curve (AUC) values of the GSE5281 training set and GSE4226 testing set were 0.9647 and 0.7857, separately (Fig. 4b). These data demonstrated that the logistic regression model established basing on ALDOA, ENC1, and NFKBIA could be regarded as reliable for AD prediction.
Discussion
The past century of progress in research of AD has developed more and more effective methods, including brain imaging techniques, identification of cognitive level, and fluid biomarkers for the diagnosis of AD [15, 16]. Yet, the specific mechanisms of AD development are still unclear. Furthermore, the overlap with symptoms of the neuropathological diseases makes accurate early clinical diagnosis of AD nearly impossible. Hence, the identification of critical diagnostic and prognostic biomarker for AD remains urgent. In our present study, via the integrated bioinformatic analysis of AD-related data in the GEO database, we have constructed a reliable logistic regression model based on 3 genes (ALDOA, ENC1, and NFKBIA), which could separate the AD samples from the healthy samples.
The high-speed boosted bioinformatics is providing powerful evidence for disease diagnosis including AD. In our study, we performed bioinformatic analysis on brain tissue samples and blood samples collected from 2 GEO datasets, GSE5281 and GSE4226, to identify early critical factors associated with AD. We first obtained DEGs in these datasets and performed GO and KEGG pathway enrichment. GO enrichment disclosed critical biological processes, cellular components, and molecular functions associated with the DEGs in GSE5281 and GSE4226 datasets. It is notable that the most enriched terms include the regulation of neuron projection development, vesicle localization, establishment of vesicle localization, cotranslational protein targeting to membrane, and protein targeting to endoplasmic reticulum (ER). These results indicated that the alteration of neurofunction [29], protein localization [30], and ER signal transduction [30] possibly participates in the development of AD. The enriched terms related to neurofunction are consistent with the feature of AD as a neurodegenerative disease. And, it was reported that the important regulator protein Tau of AD could bind with the vesicle to regulate presynaptic dysfunction [31]. Besides, the misfolding of proteins causing ER stress in neurons is one common feature of AD [32, 33]. Hence, profound studies on neurofunction, vesicle localization, and ER stress may be promising areas for diagnosis and treatment of AD.
We next constructed a PPI network basing on the 15 overlapping genes of the 2 datasets and subsequently obtained 10 hub genes. Further performance of stepwise regression methods screened out 6 of the 10 hub genes, namely, TPM3, THRA, NFKBIA, ENC1, TPI, and ALDOA. The logistic regression model basing on the 6 genes further identified ALDOA, ENC1, and NFKBIA as the most contributed genes. The activated NF-κB pathway and the upstream regulator NFKBIA were previously revealed by bioinformatic analysis as a potential biomarker in late-onset AD [34]. Ectoderm neural cortex-1 (ENC1) was first identified as an early and highly specific marker of neural induction in vertebrates and showed extensive colocalization with actin cytoskeleton [35]. Recent studies confirmed the role of ENC1 in dissociation of cognition and neuropathology in aging population, which is related to the pathogenesis of AD [36]. Moreover, ENC1 was also reported to be associated with ER stress, which is consistent with our GO and KEGG analysis [37]. As for ALDOA, the fructose-bisphosphate aldolase A, numerous studies have indicated its role in cytoskeletal development and cancer research area; however, ALDOA-related study in AD is limited. A recent study revealed that ALDOA accompanied by several other genes could serve as potential biomarkers for AD [38]. Moreover, ALDOA was reported to be probably involved in the immune-related phenomena in AD [39]. Besides, a previous bioinformatic study on alcohol drinking-exacerbated AD model indicated ALDOA to be related to AD genesis [40]. These previous reports supported our results of ALDOA, ENC1, and NFKBIA as hub genes in the AD regulatory network. We then used these genes to establish a more effective logistic regression model which exhibited favorable outcomes when testing the GEO datasets GSE5281 and GSE4226, with area under the curve values of 0.9647 and 0.7857, indicating the availability of this model for identification of AD.
Conclusion
To summarize, our comprehensive analysis on AD data from GEO datasets revealed ALDOA, ENC1, and NFKBIA as the potential diagnostic biomarkers. The constructed logistic regression model based on these genes could effectively distinguish AD patients from healthy people. Our findings may shed light on the early diagnosis and treatment for AD.
Statement of Ethics
Statement of ethics is not applicable.
Conflict of Interest Statement
The authors of this work have nothing to disclose.
Funding Sources
Our research was supported by the National Natural Science Foundation of China (No. 81973600).
Author Contributions
Huimin Wang and Yanqiu Zhang contributed to conceptualization, software, data curation, and writing – original draft. Chengyao Zheng, Songqi Yang, and Xiuju Chen contributed to validation and writing – original draft. Heng Wang contributed to writing – reviewing and editing and project administration.
Data Availability Statement
All data generated or analyzed during this study are included in this article and its online suppl. material files. Further enquiries can be directed to the corresponding author.
References
Additional information
Huimin Wang and Yanqiu Zhang contributed equally to this work.