Introduction: Metabolomics could offer novel prognostic biomarkers and elucidate mechanisms of diabetic kidney disease (DKD) progression. Via metabolomic analysis of urine samples from 995 CRIC participants with diabetes and state-of-the-art statistical modeling, we aimed to identify metabolites prognostic to DKD progression. Methods: Urine samples (N = 995) were assayed for relative metabolite abundance by untargeted flow-injection mass spectrometry, and stringent statistical criteria were used to eliminate noisy compounds, resulting in 698 annotated metabolite ions. Utilizing the 698 metabolites’ ion abundance along with clinical data (demographics, blood pressure, HbA1c, eGFR, and albuminuria), we developed univariate and multivariate models for the eGFR slope using penalized (lasso) and random forest models. Final models were tested on time-to-ESKD (end-stage kidney disease) via cross-validated C-statistics. We also conducted pathway enrichment analysis and a targeted analysis of a subset of metabolites. Results: Six eGFR slope models selected 9–30 variables. In the adjusted ESKD model with highest C-statistic, valine (or betaine) and 3-(4-methyl-3-pentenyl)thiophene were associated (p < 0.05) with 44% and 65% higher hazard of ESKD per doubling of metabolite abundance, respectively. Also, 13 (of 15) prognostic amino acids, including valine and betaine, were confirmed in the targeted analysis. Enrichment analysis revealed pathways implicated in kidney and cardiometabolic disease. Conclusions: Using the diverse CRIC sample, a high-throughput untargeted assay, followed by targeted analysis, and rigorous statistical analysis to reduce false discovery, we identified several novel metabolites implicated in DKD progression. If replicated in independent cohorts, our findings could inform risk stratification and treatment strategies for patients with DKD.
In the USA, 25% of diabetic patients have chronic kidney disease, a major precursor to kidney failure . Nearly half of the patients with kidney failure have diabetes . Thus, it is important to identify DKD progression risk factors, so that kidney disease can be detected and treated early. While clinical factors, such as microalbuminuria, are generally prognostic for DKD, there is still large variability in DKD risk for patients with similar risk profiles [3-5], and furthermore, clinical risk factors do not offer molecular insights into disease pathology. Thus, there is a need to explore new biomarkers which may add prognostic power and more importantly identify biological mechanisms of DKD .
Metabolites, low-weight intermediates and end products of cellular pathways, could uncover physiological or pathological changes in chronic diseases . Recent research emphasizes the importance of and need for further research on the human urine metabolome in kidney disease [8, 9].
Previous urine metabolome in DKD studies has largely been cross-sectional and aimed at identifying metabolites distinguishing DKD from healthy or suitable comparison controls [10-12]. Studies examining metabolomics for DKD progression contained several limitations: evaluating specific compounds, serum/plasma only assays, restriction to type I diabetes or specific subpopulations, small sample size, or short follow-up [13-17]. Highlighting recent work, the branched-chain amino acid pathway in type 1 diabetes  and lipogenesis in type 2 diabetes among American Indians  have been linked to DKD progression. Herein, we aimed to expand on prior works by using an extensive panel of urine metabolites in a large, diverse longitudinal cohort of 995 patients with DKD and had a median 8 years of follow-up. Previously , we evaluated 13 a priori metabolites and their association with DKD progression. Here, using a high-throughput untargeted platform, we examined relative abundance of 698 annotated metabolite ions and implemented machine learning and rigorous statistical approaches to identify markers of DKD progression, rate of annual eGFR change, and ESKD. Further, we conducted a targeted metabolomics analysis on a subset of 15 amino acids. By implementing a robust multifaceted statistical approach, two metabolomic assays, and two clinically important outcomes, we aimed to identify potentially multiple sets of novel features prognostic for DKD progression and/or elucidate biological pathways of DKD.
We used a metabolomics substudy of the Chronic Renal Insufficiency Cohort (CRIC). The parent CRIC study recruited (from 2003 on) a racially diverse group aged 21–74 years, ∼50% diabetic, and with a broad range of kidney function . Informed consent was obtained from participants; protocols were approved by the IRBs and Scientific and Data Coordinating Center (approval # 807882). The current study analyzed the urine metabolome at study entry (baseline) of 995 randomly selected CRIC participants with diabetes across CKD stages 3a, 3b, and 4, eGFR 45–60, 30–45, and 20–30 mL/min/1.73 m2, respectively.
We evaluated two outcomes: annual rate of eGFR change (eGFR slope) and time-to-ESKD. The eGFR slope was estimated via mixed models using serial eGFR measures as described previously . Time-to-ESKD was the time from entry to the CRIC study to incident kidney failure with need for renal replacement therapy or kidney transplantation; drop-out or death before kidney failure was considered censoring events.
Overview of Analytic Steps (Fig. 1)
Details of the assay, sample processing, and feature extraction are given in online supplementary materials (for all online suppl. material, see www.karger.com/doi/10.1159/000521940). We implemented stringent filtering to exclude metabolite ions with high noise or low biological variability. We then tested single metabolite associations with DKD outcomes, correcting for multiple comparisons, with and without clinical variables adjustment. Next, we developed multivariate metabolite models, using statistical and machine learning methods to identify marker signatures associated with eGFR slope, and further tested these models on time-to-ESKD. A unique aspect of our approach is that we did not train models on the ESKD outcome; thus, ESKD results serve as “internal replication” for a long-term clinical outcome. Finally, we conducted pathway enrichment analyses to investigate biological underpinnings of selected metabolites and conducted a targeted analysis of a subset of metabolites.
Filtering Metabolite Ions
Leveraging technical replicates, we used QC and CRIC samples to eliminate ions with poor reliability in the untargeted analysis (online suppl. materials). Of 1,899 annotated metabolite ions, the 698 which passed filtering criteria are the final metabolite ion set for all subsequent analyses. A single ion could annotate multiple metabolites; we will clarify these resulting ambiguities as appropriate.
Associations of Single Metabolite with Outcomes
We tested associations between each of the 698 log2-transformed metabolite ions and outcomes, with and without adjustment for clinical variables: age, gender, race, smoking, baseline BMI, mean arterial pressure, HbA1c, eGFR, and albuminuria . For the eGFR slope, we calculated Pearson correlations and fit linear models; for ESKD, we fit Cox models. We used the Benjamini-Hochberg false discovery rate to correct for multiple comparisons .
Multivariate Models for the eGFR Slope
Using eGFR slopes as outcome, penalized regression (via lasso) and machine learning (via random forest) models were developed to elicit multivariate prognostic metabolomic signatures. The lasso reduces overfitting by imposing a penalty (λ) . We considered two λ values chosen by 10-fold cross-validation: λ. min, the value yielding the lowest prediction error; and λ.1se, the value within one SD of lowest prediction error. Four lasso models were fit; each included 698 ions and 9 clinical variables as covariates. Two models forced all 9 clinical variables to be included, utilizing either λ.min or λ.1se. The other two models did not force the clinical variables to be included. For random forests, we used percent increase of mean squared error to order variable importance ; for comparability, we selected the same number of variables as the corresponding lasso models (without forcing clinical variables). Thus, we fit six multiple metabolite ion models: four lasso and two random forest models, each of which selected an optimal predictor set. As a final sensitivity analysis, we also fit four elastic net models , which can select groups of correlated features and may better mimic biological pathways.
Internal Replication on ESKD Outcome
To evaluate the models on time-to-ESKD, we fit six Cox models, in which predictors were variables selected in the six eGFR slope outcome analyses. We used likelihood ratio tests to compare each Cox model to its corresponding model of only clinical variables. To quantify model performance, we used 5-fold cross validation repeated 100 times to estimate mean and 95% CI of the C-statistic. Predictors used in these six Cox models were the predictors selected in the corresponding eGFR slope models; no tuning or variable selection was used in the Cox models. We intentionally avoided further training on ESKD to assess if predictors of the eGFR slope were also predictive of long-term outcomes (i.e., ESKD).
Definitions of well-known biological pathways by their respective involved metabolites were obtained from HMDB . We considered 743 pathways and performed a hypergeometric test for each pathway definition via
where M = # of measured metabolite ions, K = # of measured compounds in the pathway definition, N = # of “hits,” i.e., selected/prognostic metabolite ions, and x = # of “hits” that mapped to the pathway. The p values were corrected for multiple testing using the Benjamini-Hochberg method.
Using a targeted assay, we validated selection and annotation of 15 metabolites detected by untargeted analysis, which were prognostic in the ESKD analysis. Again, we fit Cox models, adjusted for 9 clinical variables. To compare untargeted and targeted analyses, we present standardized hazard ratios (HRs).
At entry (Table 1), participants (n = 995) were mean 59.9 years, 56% male, 44% white, and 42% black; on average (mean [SD]), they were obese (BMI 34.2 [7.9]), had poor diabetes control (HbA1c 7.6 [1.5]%), and had moderate-to-poor kidney function (eGFR 40.6 [11.2] mL/min/1.73 m2). The eGFR decline (slope) averaged 1.8 (SD = 1.9) mL/min/1.73 m2/year; 36% (N = 360) had ESKD during the 10-year study (range: 2–10 years). We excluded subjects with missing clinical variables (<2%).
Associations between Single Metabolites and Disease Outcomes
Of 698 ions, 89 were significantly correlated with eGFR slopes without adjustment for clinical variables (false discovery rate corrected p < 0.05). In adjusted analyses, 6 ions remained significant with β-coefficients from −0.45 to 0.3 (online suppl. Table S1). Also, 123 (unadjusted) and 99 (adjusted) ions were significantly associated with ESKD in Cox regression. After adjustment, HRs ranged from 1.12 to 1.84 (online suppl. Table S2). The prognostic ion set contained several amino acids (valine, isoleucine, and tryptophan) and other compounds, e.g., hydroxybutanoic acid (online suppl. Table S2). Only one ion, annotated as 3-(4-methyl-3-pentenyl)thiophene (Ion Index 1098, online suppl. Tables S1, S2), was associated with both eGFR slope and ESKD in adjusted models with coefficients (95% CI) of −0.44 (−0.68, −0.21) for the eGFR slope and HR (95% CI) 1.84 (1.45, 2.32), indicating that higher abundance of this compound might be associated with worse DKD progression. Complete results are in online supplementary Tables S1, S2. Henceforth, the 99 ESKD-associated metabolites will be denoted the 99-ESKD-associated set.
Multivariable Prognostic Metabolites for eGFR Slope Outcome
Each of the lasso or random forest models (online suppl. Table S3) selected 9–30 variables resulting in 49 (out of 698) ions across the 6 prognostic models (online suppl. Table S4), denoting the 49-eGFR-associated set. Baseline albuminuria, blood pressure, and HbA1c were selected in all 6 models, and unsurprisingly, higher levels of these clinical markers were associated with steeper eGFR decline; race was also selected in all 6 models; 3,4-dicaffeoyl-1,5-quinolactone was selected in all models except model 1 (clinical only model). Nine ions, annotated as 3,4-dicaffeoyl-1,5-quinolactone, butynal, 3-(4-methyl-3-pentenyl)thiophene, C10:3, zalcitibine, asparaginyl-hydroxyproline, valine (or betaine), argynil-glutamine, and pentose (or di-isopropyl disulfide), were selected in at least 3 models; an additional 13 ions were selected in at least 2 models (online suppl. Table S4). Elastic net models had similar C-statistics (data not shown), so the more parsimonious lasso models were retained.
The final selected features from each of the 6 eGFR slope models (online suppl. Table S3) were tested on the ESKD outcome in Cox models. The likelihood ratio test p values were < 0.0001 when each of the 6 models was compared to the corresponding model with only clinical features, i.e., adding metabolite ions improved model fit significantly. Several ESKD models had similar median C-statistics (online suppl. Fig. S1), ranging from 0.82 to 0.85. The best (model 2, online suppl. Table S3) with 29 variables had a cross-validated median (95% CI) C-statistic of 0.85 (0.85, 0.86). This model selected 20 metabolite ions via lasso; 14 were significantly associated with the eGFR slope as evidenced by their bootstrap 95% CIs which excluded 0 (Table 2). Of greater interest are the five ions significantly (5% level) associated with time-to-ESKD. Higher abundance of valine (or betaine) and 3-(4-methyl-3-pentenyl)thiophene was each associated with increased risk of ESKD (HR 1.44 and 1.65, respectively); higher asparaginyl-hydroxyproline abundance was associated with lower ESKD risk (HR = 0.7). Two other significantly associated compounds were pipazethate and aminophylline. Importantly, since our lasso model was not trained on ESKD outcome, we expect the Cox model results to be valid and not influenced by model selection. We also ran sensitivity analysis including ACE inhibitor or ARB use in Cox regression for model 2. There were negligible changes in the Cox model coefficients for the 29 variables in model 2; the HR (95% CI) for ACE inhibitor or ARB use was 1.22 (0.92, 1.63, p = 0.17), and this was not statistically significant at 5% significance level.
We conducted pathway enrichment based on the 49-eGFR-associated (online suppl. Tables S3, S4), the 99-ESKD-associated (online suppl. Table S2), and the combined 49- and 99-sets (=131 ions). Thus, in equation (1), M = 698; N = 49 or 99 or 131; K varied by pathway; and x = # of ions in a pathway and also in the prognostic 49-, 99-, or 131-set.
Pathways (online suppl. Fig. S2; Table S5) consistently and significantly enriched across both eGFR slope and ESKD models were enzyme deficiencies, acidurias and acidemias, and those related to amino acid metabolism. Of interest, the valine-leucine-isoleucine degradation pathway involved in insulin resistance, cardiometabolic risk, cardiomyopathy, and CKD [26, 27]; the 2-aminoadipic 2-oxoadipic aciduria and lysine degradation pathways implicated in diabetes and kidney disease [28, 29]; and the tryptophan pathway [30-32] was enriched in the 99-ESKD and combined 131-ion sets (online suppl. Fig. S2; Table S5).
Validation by Targeted CE-MS Approach
In the targeted validation analysis, 13 of 15 metabolites were significantly associated with ESKD (Fig. 2). HR (95% CI) in the untargeted and targeted analysis was comparable.
Conclusions and Discussion
High-throughput metabolomics offers great potential, yet the scale and measurement errors raise analytic and inferential challenges. To address these challenges, we implemented two steps: (1) identifying a sparse prognostic set of metabolites after adjusting for known clinical factors and (2) using pathway analysis to elicit biological meaning. First, using an untargeted assay, we tested single metabolite associations with eGFR slope and ESKD. After adjusting for clinical variables, 6 and 99 (online suppl. Tables S1, S2) metabolite ions were found to exhibit non-null associations with eGFR slope and ESKD, respectively. An interesting finding is that for the eGFR slope, adjusting for clinical factors greatly reduced the number of non-null associations (from 89 to 6 ions), whereas for ESKD, the number of prognostic metabolites was relatively similar with and without adjustment (123 unadjusted to 99 adjusted), suggesting that clinical factors may be more critical for proximal (e.g., eGFR slope) versus long-term (e.g., ESKD) outcomes. Of note, differences in prognostic factors for these outcomes have also been previously reported .
As a next step, we conducted a quantitative targeted assay of 15 amino acid metabolism metabolites from the 99 (untargeted) ions and validated the prognostic value of 13 (of 15). Amino acid metabolism is known to be associated with DKD in cross-sectional analysis , and in a few prospective studies [7, 11, 34], but these studies have generally had small sample sizes (<100) and/or evaluated specific subgroups (e.g., type I diabetes). Thus, our finding of the prognostic value of urine amino acids in the large CRIC study even after adjustment for clinical factors adds to the literature and suggests that additional amino aciduria, independent of albuminuria, may be a risk factor for DKD progression.
In multivariable analysis, we fit 6 multiple metabolite models (adjusted for clinical factors) using robust penalized regression and machine learning. These modeling approaches are ideally suited to BigData with a large number of possibly correlated predictors. We identified sets of 6–25 prognostic metabolite ions (online suppl. Tables 2, S3). Nine unique ions were selected in ≥3 models. Two of these ions, asparginyl-hydroxyproline, implicated in iron deficiency and Crohn’s disease , and arginyl glutamine, a dipeptide shown to be helpful for treating hypoxia-induced small intestine injury , are likely novel findings in DKD research. In addition, we identified valine, a glucogenic branched chain amino acid, and C10:3, an acyl-carnitine whose blood levels were associated with heart failure and maternal and newborn metabolic health [37, 38]. The rest were food or drug derivatives [39, 40], e.g., 3-(4-methyl-3-pentenyl)thiophene found in alcoholic beverages. For ESKD, the 6 models had similar C-statistics (range 0.82–0.85), comparable to the 0.84 C-statistic of the clinical variables-only model, i.e., adding metabolites did not improve model discrimination. The fact that multiple models had similar discrimination is not surprising given the complexity of DKD. It is likely that there are many biological pathways in DKD, and thus potentially multiple “equally” prognostic models encompassing metabolite subsets from each of these pathways. Importantly, in the multiple adjusted Cox model with highest C-statistic, we identified several interesting prognostic (p < 0.05) metabolite ions. Valine (or betaine) and 3-(4-methyl-3-pentenyl)thiophene were associated with 44% and 65% higher hazard of ESKD, respectively, per doubling of ion abundance. Importantly, the Cox model analysis, albeit on the same cohort, did not involve any training or variable selection, and hence HRs are likely less biased.
To systematically leverage the biological content of prognostic metabolite ions, we conducted enrichment analysis, using pathway, disease, and function definitions from the HMDB. We focused on metabolic pathway enrichment, since pathway definitions are less prone to curation anomalies, more finite, and well defined compared to disease and function definitions. We identified branched-chain amino acid pathways, namely, valine-leucine-isoleucine degradation, isovaleric acidemia/aciduria, propionic acidemia, and methylmalonic acidemia, all implicated in CKD [18, 26-29]. The tryptophan pathway was previously investigated in DKD primarily in serum/plasma samples [30-32] though findings were mixed. Our study found that higher levels of tryptophan were associated with higher ESKD risk in both untargeted and targeted analyses, and the tryptophan pathway was significantly enriched in our set of ESKD-related metabolites. To our knowledge, these findings are novel, and further investigation of urine tryptophan is warranted.
Our study fills several gaps in current DKD research. Previous urine metabolomics studies were cross-sectional, had small sample sizes, examined few single compounds, or have been restricted to subpopulations. In this work, we addressed these shortcomings. Our CRIC DKD sample is one of the largest in the USA, with comprehensive clinical data and long-term follow-up with annual assessments of kidney function. We conducted untargeted high-throughput metabolomics which identified 1,899 annotated ions, used stringent filtering to reduce to a reliable subset of 698 metabolite ions, and used rigorous paradigms to build cross-validated models to select metabolite ions prognostic for eGFR decline. The selected ions were then tested for association with ESKD without further training. While it could be argued that prognostic predictor sets may vary between the eGFR slope and ESKD outcomes , our approach of replicating findings on the ESKD outcome, rather than fitting separate ESKD models, is more statistically stringent. Of note, models were not trained on the ESKD outcome, which should reduce overfitting. Also, identifying predictors of short-term outcomes which also foretell long-term outcomes could be more impactful in the clinic for planning long-term disease management. However, we acknowledge that focusing feature selection solely on the ESKD outcome could be of value, if results could be validated, and we aim to pursue this line in future work in independent cohorts. Finally, we used a targeted CE-MS assay to validate and quantify a subset of “hits” identified in our untargeted analysis, illustrating the value of an untargeted high-throughput metabolomics approach despite the sample complexity of human urine.
There are limitations to our work. First, although we adjusted for known clinical factors, residual confounding and measurement error could still impact our findings. Results may also vary by clinical subgroups; we leave such investigation to future studies. Second, while our enrichment analysis revealed potential DKD-related pathways, our selected metabolites were also enriched for deficiencies associated with other diseases. Thus, the systematic pathway enrichment may not be specific to DKD, a limitation of pathway analyses when a given pathway could overlap multiple diseases. Finally, while we followed strict protocols for biospecimen storage, processing, and quality control, the age of our archived samples might also introduce errors. Also, while we validated a subset of metabolites with existing targeted approaches, it is necessary to develop quantitative assays to compare with our untargeted findings . We are developing targeted assays for other prognostic markers for future validation. In addition, confirming (or refuting) our findings in independent cohorts using different assays will enhance generalizability.
In summary, we conducted large-scale untargeted and targeted metabolomics analysis using a diverse cohort and identified several metabolites implicated in DKD. We used robust statistical models incorporating rigorous techniques to reduce overfitting: cross-validation for feature selection, replicating results on a long-term clinical outcome, and validation using a targeted assay. While we highlighted and validated a few prognostic metabolites and related pathways, the most important aspect of our work is that we identified 131 potentially prognostic metabolites in our untargeted analysis, several of which are not addressed before. We hope that providing this list will spur further investigation of these promising metabolites and enhance understanding of DKD, eventually leading to better treatments and disease management.
We thank Lisa E. Wesby, MS, for assistance as the CRIC project manager in managing and facilitating the manuscript among all the authors. We thank the remaining CRIC co-investigators for providing the current data set and their continued efforts with this important study: Lawrence J. Appel, MD, MPH; Jing Chen, MD, MMSc, MSc; Alan S. Go, MD; James P. Lash, MD; Robert G. Nelson, MD, PhD, MS; Mahboob Rahman, MD; Vallabh O Shah, PhD, MS; Raymond R. Townsend, MD; Mark L. Unruh, MD, MS.
Statement of Ethics
This study protocol was reviewed and approved by the IRBs and Scientific and Data Coordinating Center (approval # 807882). Informed consent was obtained from participants.
Conflict of Interest Statement
Uwe Sauer is a Cofounder of Metabolic Concepts Ltd., Boston, MA, USA. Kumar Sharma has a startup company, SygnaMap, that analyzes data generated from mass spec imaging of tissue sections for understanding solid organ diseases. Dr. Feldman reports consulting for Kyowa Hakko Kirin Co., Ltd., serving as the Editor-in-Chief of the American Journal of Kidney Diseases (AJKD), honoraria for speaking from InMed Physicians, and consulting with DLA Piper LLP. All other authors have nothing to disclose.
Funding for the CRIC study was obtained under a cooperative agreement from the National Institute of Diabetes and Digestive and Kidney Diseases (U01DK060990, U01DK060984, U01DK061022, U01DK061021, U01DK061028, U01DK060980, U01DK060963, U01DK060902, and U24DK060990). In addition, this work was supported in part by the Perelman School of Medicine at the University of Pennsylvania Clinical and Translational Science Award NIH/NCATS UL1TR000003, Johns Hopkins University UL1 TR-000424, University of Maryland GCRC M01 RR-16500, Clinical and Translational Science Collaborative of Cleveland, UL1TR000439 from the National Center for Advancing Translational Sciences (NCATS) component of the National Institutes of Health and NIH roadmap for Medical Research, Michigan Institute for Clinical and Health Research (MICHR) UL1TR000433, University of Illinois at Chicago CTSA UL1RR029879, Tulane COBRE for Clinical and Translational Research in Cardiometabolic Diseases P20 GM109036, Kaiser Permanente NIH/NCRR UCSF-CTSI UL1 RR-024131, Department of Internal Medicine, University of New Mexico School of Medicine Albuquerque, NM R01DK119199. Mr. Kwan, Dr. Fuhrer, Ms. Zhang, and Drs. Darshi, Montemayor, Sharma, and Natarajan were partially supported by NIDDK 1R01DK110541-01A1. Ms. Zhang and Drs. Darshi, Montemayor, Natarajan, and Sharma were also partially supported by DP3DK094352. Dr. Afshinnia is funded by K08-DK106523 and R03-DK121941. Dr. Anderson is supported by P20GM109036, U01DK060963, R01DK104730, and R01DK107566. Mr. Kwan was also supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1650112. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
J.Z., L.N., K.S., T.F., D.M., H.P.Y, M.D., and B.K. conceptualized the study design. T.F., D.M., J.Z., and B.K. contributed to data curation. T.F., D.M., U.S., and K.S. contributed to data acquisition. J.Z. conducted study analyses. J.Z. and L.N. drafted the manuscript. L.N. and K.S. contributed to funding acquisition and provided oversight to the study analyses. All authors contributed to data interpretation, critically reviewed and edited the manuscript for important intellectual content, and approved the final version. J.Z. is the guarantor of this work, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Data Availability Statement
The data that support the findings of this study are available on request from the CRIC study at cristudy.org upon reasonable request.
Sharma and Natarajan are co-senior authors.