Introduction: Penalized regression models can be used to identify and rank risk factors for poor quality of life or other outcomes. They often assume linear covariate associations, but the true associations may be nonlinear. There is no standard, automated method for determining optimal functional forms (shapes of relationships) between predictors and the outcome in high-dimensional data settings. Methods: We propose a novel algorithm, ridge regression for functional form identification of continuous predictors (RIPR) that models each continuous covariate with linear, quadratic, quartile, and cubic spline basis components in a ridge regression model to capture potential nonlinear relationships between continuous predictors and outcomes. We used a simulation study to test the performance of RIPR compared to standard and spline ridge regression models. Then, we applied RIPR to identify top predictors of Patient-Reported Outcomes Measurement Information System (PROMIS) adult global mental and physical health scores using demographic and clinical characteristics among N = 107 glomerular disease patients enrolled in the Nephrotic Syndrome Study Network (NEPTUNE). Results: RIPR resulted in better predictive accuracy than the standard and spline ridge regression methods in 56–80% of simulation repetitions under a variety of data characteristics. When applied to PROMIS scores in NEPTUNE, RIPR resulted in the lowest error for predicting physical scores, and the second-lowest error for mental scores. Further, RIPR identified hemoglobin quartiles as an important predictor of physical health that was missed by the other models. Conclusion: The RIPR algorithm can capture nonlinear functional forms of predictors that are missed by standard ridge regression models. The top predictors of PROMIS scores vary greatly across methods. RIPR should be considered alongside other machine learning models in the prediction of patient-reported outcomes and other continuous outcomes.

Patients with glomerular disease have heterogeneous disease activity, progression, and quality of life. Factors associated with higher quality of life for adults with glomerular disease include complete proteinuria remission, reduction in symptom burden, shorter disease duration, a lack of edema, and better socioeconomic status [1‒5]. While previous studies focused on a limited set of pre-specified predictors of interest, the investigation of a more comprehensive list of potential predictors could uncover novel risk factors for poor quality of life.

With a large number of potential predictors and a relatively small sample size – e.g., when analyzing a rare disease cohort and/or after stratifications by pediatric and adult subpopulations – data are high-dimensional and require advanced statistical methods for analysis. Machine learning algorithms, such as penalized regression models, are designed specifically for high-dimensional data analysis and can allow for consideration of many factors simultaneously. To optimize predictive ability of penalized regression models, it is crucial to identify the most likely functional relationships (e.g., linear, U-shaped, or other nonlinear forms) between predictors and the outcome. However, manually selecting the functional form for each predictor may not be feasible for large numbers of predictors. Several multivariable regression techniques have been developed to automatically assess which predictors should fit with linear or nonlinear components [6‒9]. However, all of these methods only allow for one functional form per predictor, and when nonlinear functional relationships are modeled, only spline terms are considered. Therefore, if the true functional forms are nonlinear but not well-represented by splines, or are combinations of multiple nonlinear functional forms, then these methods will fail to capture the true associations between the predictors and the outcome. We aim to accurately predict clinical outcomes and identify novel risk factors that may have nonlinear associations with outcomes.

To address the shortcomings of existing methods, we propose a ridge regression for functional form identification of continuous predictors (RIPR) algorithm. For each continuous predictor, the RIPR method considers linear, quadratic, quartile, and cubic spline functional form representations. We first tested the performance of the RIPR algorithm using a Monte Carlo simulation study. We then applied the RIPR method to data from the Nephrotic Syndrome Study Network (NEPTUNE) to identify novel risk factors and their functional relationships with poor health-related quality of life (HRQOL) among glomerular disease patients.

Study Sample

The Nephrotic Syndrome Study Network (NEPTUNE) is a multisite observational cohort study of patients with glomerular diseases [10]. Participants were enrolled at the time of their first clinically indicated biopsy and are followed every 4–6 months for at least 36 months. Data were collected on participant demographics, family medical history, clinical symptoms, and laboratory values. For this cross-sectional study, we included adult NEPTUNE participants who were at least 18 years old at study enrollment and had baseline Patient-Reported Outcomes Measurement Information System (PROMIS) adult global mental and physical health scores at enrollment. This study was considered exempt from Institutional Review Board (IRB) review by the University of Pennsylvania IRB (IRB Protocol #849205).

Independent Variables and Outcomes

All demographic and clinical characteristics from the NEPTUNE dataset that had fewer than 55% missing values (blank, unknown, or unanswered) were included as independent variables. The 55% threshold was chosen based on the distribution of missingness across predictors, which gradually increased from 0 to 51% before sharply jumping to 83%. A complete list of these variables and their descriptive statistics pre- and post-imputation can be found in the online supplementary materials section 1 (for all online suppl. material, see www.karger.com/doi/10.1159/000528847). We also conducted sensitivity analyses that only included predictors with fewer than 30% missing values. For any remaining missing values within these predictors, we performed a nonparametric imputation method, missForest, that can handle both continuous and categorical variables with nonlinear functional relationships with outcomes 11‒13. As we are modeling multiple functional forms per continuous covariate for the RIPR and spline ridge regression methods, the relationships between predictors and the outcome are likely very complex and not easily described by a single, parametric model. As missForest is a random forest-based approach, this method can effectively handle complex dependencies including nonlinear and interactive effects, does not depend on stringent parametric assumptions or tuning parameters, and is well-suited for high-dimensional settings that motivated the development of RIPR.

The patient-reported outcomes (PROs) for assessing HRQOL for this study were adult global mental and physical health scores from the PROMIS questionnaire [14]. For both adult global mental and physical health scores, a higher score indicates better quality of life, whereas a lower score indicates worse quality of life.

Statistical Analyses

RIPR Algorithm

To predict PROMIS physical and mental scores, we applied our proposed RIPR algorithm. This algorithm can be classified as having two overall steps. In the first step, we created a new set of predictors that contains linear, quadratic, quartile, and cubic spline representations of each continuous covariate, alongside the original categorical covariates in our dataset. Then, we fit a ridge regression model 15‒17 with our new set of predictors using 10-fold cross-validation. As we are modeling each continuous covariate with multiple functional forms, many of these functional forms will be highly correlated, which can cause L1-penalization based methods like the LASSO and elastic net to inconsistently select correlated predictors which are useful for predicting the outcome. The use of a ridge regression also ensures that no coefficients in the model are zeroed out, allowing for visualization and interpretation of functional forms represented by multiple terms. For complete descriptions of how the new set of predictors is created and the motivation behind the design of the RIPR algorithm, see online supplementary materials section 2 and 3, respectively.

Simulation Study

To evaluate the performance of the RIPR algorithm, we conducted a simulation study to allow specification of the true functional forms of associations between predictors and outcomes and evaluation of the method’s performance under different data characteristics [18]. We compared the RIPR algorithm to two other potential modelling strategies: ridge regression using the original set of predictors (regular ridge regression) and ridge regression using a cubic spline representation for each continuous predictor (spline ridge regression, see online supplementary materials section 2 for a complete description of methods for creating the updated set of predictors). These comparisons allowed us to evaluate whether the use of multiple functional forms was beneficial compared to assuming linear associations or only allowing spline terms.

We simulated 100 subjects and 100 predictors and simulated the outcome vector using a multiple linear regression model with standard normally distributed predictors and error terms. We considered equal proportions of truly linear, quadratic, exponential, and inverse tangent functional relationships (see Fig. 1, for examples) and scenarios in which the proportion for one functional form was upweighted (half of associations) and the remaining proportions are equal (see online suppl. materials section 4 for the distributions of functional forms for each simulation setting). These functional forms were chosen to be representative of the most common functional relationships with continuous outcomes. Predictor coefficients were randomly simulated as either −5 or 5 for nonlinear terms and −1 or 1 for linear terms. We also varied the proportion of predictors which were uninformative in generating the outcome (i.e., coefficients set to zero), which we call the sparsity coefficient index, from {0,0.2,0.4,0.6,0.8,1}. We used an 80/20 training/testing split and averaged the results over 1,000 simulation replicates.

Fig. 1.

Plots of simulated linear, quadratic, exponential, and inverse tangent functional relationships for 100 observations. The predictor values were simulated from a random uniform distribution with minimum and maximum values of −10 and 10, respectively, and the same predictor values were used for each functional association plot. The outcome vector for each functional relationship plot was computed by first applying the appropriate transformation (from left to right, starting with the top left plot: identity, square, exponential, or inverse tangent) to the predictor values and then adding random Gaussian noise with mean zero and standard deviations of 2.5, 7, 1,000, or 0.1, respectively. For the exponential functional association plot, only the 50 observations corresponding to positive predictor values were retained and the outcome vector was scaled by 500.

Fig. 1.

Plots of simulated linear, quadratic, exponential, and inverse tangent functional relationships for 100 observations. The predictor values were simulated from a random uniform distribution with minimum and maximum values of −10 and 10, respectively, and the same predictor values were used for each functional association plot. The outcome vector for each functional relationship plot was computed by first applying the appropriate transformation (from left to right, starting with the top left plot: identity, square, exponential, or inverse tangent) to the predictor values and then adding random Gaussian noise with mean zero and standard deviations of 2.5, 7, 1,000, or 0.1, respectively. For the exponential functional association plot, only the 50 observations corresponding to positive predictor values were retained and the outcome vector was scaled by 500.

Close modal

To compare the performance of the regular ridge, RIPR, and spline ridge regression methods for each of the simulation settings, we calculated the mean squared error (MSE) to quantify each model’s predictive accuracy, with lower MSE indicating better accuracy. We calculated the proportion of simulation repetitions for which the RIPR algorithm provided a lower MSE than the other two methods. Additionally, we determined the average percent reduction in MSE relative to the other two models among the simulation replicates in which the RIPR algorithm provided a lower MSE.

NEPTUNE Data Analysis

We compared the performance of the RIPR, regular ridge, and spline ridge algorithms for predicting PROMIS mental and physical scores using the NEPTUNE dataset. MSE was calculated for each of the three methods and each of the two outcomes. Because the NEPTUNE dataset has a limited sample size and no external validation dataset, we used bootstrapping for internal validation [19]. Top predictors were identified based on highest magnitudes of the standardized regression coefficients from each model fit. A complete description of the procedure for choosing the top predictors can be found in online supplementary materials section 5. For the spline ridge regression, a predictor was included in the top 20 predictors list if any of the spline basis coefficients for that predictor were among the twenty coefficients with the greatest magnitude. Similarly, for the RIPR algorithm, a predictor was included in the top 20 predictors list if any of the basis coefficients for any functional form (linear, quadratic, quartiles, and spline) for that predictor were among the twenty coefficients with the greatest magnitude. If multiple basis components for either the spline ridge regression or RIPR algorithms (including across functional forms for the latter algorithm) were in the top 20 predictors list, then the predictor was only counted once in the top 20 predictors list. All statistical analyses were conducted using R software, version 4.1.2 (R Development Core Team, Vienna, Austria), and our implementation of the RIPR algorithm is available on GitHub: https://github.com/jeremysrubin/RIPR.

Simulation Study Results

Detailed results of the simulation study are available in online supplementary materials section 4. For simulation settings with sparsity ranging from 0 to 0.4, the RIPR algorithm produced a lower MSE relative to the regular ridge regression and spline ridge regression the majority of the time. The average reduction in MSE using the RIPR algorithm in comparison to the regular ridge regression or spline ridge regression was in the range 6–11%. When sparsity was greater than 0.4, the RIPR algorithm only reduced the MSE in less than half of simulation repetitions, and for the specific simulations in which the MSE was reduced for this increased sparsity, the average reductions in MSE were small.

For simulation settings in which we varied the proportions of true functional forms, the RIPR algorithm was robust to the presence of increased linear, quadratic, and inverse tangent functional forms, as the percentage of simulations in which the RIPR method performs better than the other two methods was over 50%, and the average percent reduction in MSE provided by the RIPR method for the simulation runs in which RIPR resulted in a lower MSE remained similar across these functional form distributions. However, when there was an increased proportion of exponential functional forms, the RIPR algorithm less frequently resulted in a lower MSE than the regular and spline ridge regression methods.

NEPTUNE Data Analysis

Participant characteristics of the NEPTUNE study sample (before imputation, N = 107) are provided in Table 1. The mean age of a participant in our sample was 46 years, and the median kidney disease duration was 4 months. Additionally, the average eGFR and UPCR at baseline were 72.07 mL/min/1.73 m2 and 3.76 g/g, respectively. Most participants in the sample were male (57%). The average BMI of participants in our sample was 28.61. More than half of the participants were White/Caucasian (57%). Distributions of variables after imputation were similar to those before imputation (online suppl. materials section 1).

Table 1.

Study sample characteristics

VariableMean/Median (SD/IQR) or count (%)
Patient age, years 46 (15.11) 
Kidney disease duration, monthsa 4 (1–12) 
eGFR at baseline 72.07 (33.76) 
UPCR at baselinea 3.76 (1.64–7.53) 
RAAS block medication use 54 (50%) 
LDL cholesterola 156.5 (121.3–244.0) 
Any edema present 52 (48%) 
BMI 28.61 (5.40) 
Male 62 (57%) 
Cohort  
 MN 36 (33%) 
 MCD 20 (18%) 
 FSGS 8 (7%) 
 Other 28 (26%) 
Annual household income  
 USD 0–19,999 26 (24%) 
 USD 20,000–39,999 14 (13%) 
 USD 40,000–59,999 14 (13%) 
 USD 60,000–79,999 10 (9%) 
 USD 80,000–99,999 7 (6%) 
 USD 100,000+ 18 (16%) 
Education  
 High school or below 53 (49%) 
 Some college (including 2-year associates degree/certificate) 32 (29%) 
 Graduate school 19 (17%) 
Patient race  
 Asian/Asian American 13 (12%) 
 Black/African American 28 (26%) 
 Multi-racial 2 (2%) 
 White/Caucasian 62 (57%) 
VariableMean/Median (SD/IQR) or count (%)
Patient age, years 46 (15.11) 
Kidney disease duration, monthsa 4 (1–12) 
eGFR at baseline 72.07 (33.76) 
UPCR at baselinea 3.76 (1.64–7.53) 
RAAS block medication use 54 (50%) 
LDL cholesterola 156.5 (121.3–244.0) 
Any edema present 52 (48%) 
BMI 28.61 (5.40) 
Male 62 (57%) 
Cohort  
 MN 36 (33%) 
 MCD 20 (18%) 
 FSGS 8 (7%) 
 Other 28 (26%) 
Annual household income  
 USD 0–19,999 26 (24%) 
 USD 20,000–39,999 14 (13%) 
 USD 40,000–59,999 14 (13%) 
 USD 60,000–79,999 10 (9%) 
 USD 80,000–99,999 7 (6%) 
 USD 100,000+ 18 (16%) 
Education  
 High school or below 53 (49%) 
 Some college (including 2-year associates degree/certificate) 32 (29%) 
 Graduate school 19 (17%) 
Patient race  
 Asian/Asian American 13 (12%) 
 Black/African American 28 (26%) 
 Multi-racial 2 (2%) 
 White/Caucasian 62 (57%) 

aIndicates median and IQR were used due to skewed distribution. Number of participants is 107. Descriptive statistics were calculated prior to missing data imputation.

Top Predictors of PROMIS Mental Scores

The MSEs of the RIPR, regular ridge, and spline ridge regression algorithms for the PROMIS mental scores were 101.224, 110.968, and 95.631, respectively (Table 2). For the RIPR algorithm, the top predictors of the PROMIS mental scores were: calcium (spline, quartile), number of times seen provider for well/check visits during past 6 months (spline, quartile), WBC (spline, quartile), annual household income, number of times received ER care during past 6 months (spline, quadratic), hypertension at baseline, race of father, race of mother, age of biological father (spline), participant disease cohort, blood urea nitrogen (quartile), total number of days hospitalized during past 6 months (quartile, linear, quadratic, spline), platelets (quartile), serum albumin (quartile, spline), and CO2 (quartile). The parentheses following each continuous variable in this top predictors list for the RIPR algorithm indicate the functional forms which had corresponding top ranking coefficients. Ten of the top 15 predictors were continuous and two of these (platelets and CO2) were not identified as top predictors by either of the other models (Table 3 and online suppl. materials section 6). Six of 10 continuous top predictors were represented by multiple functional forms and another three were best characterized in quartiles.

Table 2.

NEPTUNE data analysis: estimated MSE of prediction models

RIPRRegular ridge regressionSpline ridge regression
PROMIS mental scores 101.224 110.968 95.631 
PROMIS physical scores 47.493 73.668 56.113 
RIPRRegular ridge regressionSpline ridge regression
PROMIS mental scores 101.224 110.968 95.631 
PROMIS physical scores 47.493 73.668 56.113 

Internally validated performance measured by MSE for RIPR, regular ridge regression, and spline ridge regression models using 107 NEPTUNE participants. As lower MSE is indicative of better predictive accuracy, a lower estimated performance indicates better predictive accuracy. The number of bootstrap samples was 100.

Table 3.

Top Predictors of PROMIS Scores

RankingPredictor (RIPR)Predictor (regular ridge regression)Predictor (spline ridge regression)
a) Top predictors of PROMIS mental scores 
 1 Calcium (spline, quartile) Annual household income Calcium 
 2 Number of times seen provider for well/check visits during past 6 months (spline, quartile) Hypertension at baseline Number of times seen provider for well/check visits during past 6 months 
 3 WBC (spline, quartile) Race of father WBC 
 4 Annual household income Race of mother Annual household income 
 5 Number of times received ER care during past 6 months (spline, quadratic) Participant disease cohort Number of times received ER care during past 6 months 
 6 Hypertension at baseline Total number of days hospitalized during past 6 months Hypertension at baseline 
 7 Race of father Passive smoker Race of father 
 8 Race of mother Chloride Race of mother 
 9 Age of biological father (spline) Number of times received ER care during past 6 months Age of biological father 
 10 Participant disease cohort Any edema present Participant disease cohort 
 11 Blood urea nitrogen (quartile) Blood urea nitrogen Serum albumin 
 12 Total number of days hospitalized during past 6 months (quartile, linear, quadratic, spline) Immediately after birth, spent time in NICU or ICU  
 13 Platelets (quartile)   
 14 Serum albumin (quartile, spline)   
 15 CO2 (quartile)   
b) Top predictors of PROMIS physical scores 
 1 Hemoglobin (quartile) Annual household income Annual household income 
 2 Annual household income Chloride WBC 
 3 WBC (spline, quartile) Alcohol use Alcohol use 
 4 eGFR at baseline (quartile) Shortness of breath Blood urea nitrogen 
 5 Alcohol use Number of times received ER care during the past 6 months Shortness of breath 
 6 Age of biological father (spline, quartile) Nausea and/or vomiting Participant age 
 7 Blood urea nitrogen (spline, quadratic) Any edema present Age of biological father 
 8 Shortness of breath Total number of days hospitalized during past 6 months Chloride 
 9  Hypertension at baseline Number of times seen provider for well/check visits during the past 6 months 
 10  eGFR at baseline Any edema present 
 11  Blood urea nitrogen Number of times received ER care during the past 6 months 
 12  Received one or more blood transfusions  
 13  Serum albumin  
 14  Cholesterol  
 15  Number of times seen provider for well/check visits during the past 6 months  
 16  Participant disease cohort  
 17  Race of father  
 18  Drug allergies  
 19  Swelling  
 20  Drug use other than pipe or cigar  
RankingPredictor (RIPR)Predictor (regular ridge regression)Predictor (spline ridge regression)
a) Top predictors of PROMIS mental scores 
 1 Calcium (spline, quartile) Annual household income Calcium 
 2 Number of times seen provider for well/check visits during past 6 months (spline, quartile) Hypertension at baseline Number of times seen provider for well/check visits during past 6 months 
 3 WBC (spline, quartile) Race of father WBC 
 4 Annual household income Race of mother Annual household income 
 5 Number of times received ER care during past 6 months (spline, quadratic) Participant disease cohort Number of times received ER care during past 6 months 
 6 Hypertension at baseline Total number of days hospitalized during past 6 months Hypertension at baseline 
 7 Race of father Passive smoker Race of father 
 8 Race of mother Chloride Race of mother 
 9 Age of biological father (spline) Number of times received ER care during past 6 months Age of biological father 
 10 Participant disease cohort Any edema present Participant disease cohort 
 11 Blood urea nitrogen (quartile) Blood urea nitrogen Serum albumin 
 12 Total number of days hospitalized during past 6 months (quartile, linear, quadratic, spline) Immediately after birth, spent time in NICU or ICU  
 13 Platelets (quartile)   
 14 Serum albumin (quartile, spline)   
 15 CO2 (quartile)   
b) Top predictors of PROMIS physical scores 
 1 Hemoglobin (quartile) Annual household income Annual household income 
 2 Annual household income Chloride WBC 
 3 WBC (spline, quartile) Alcohol use Alcohol use 
 4 eGFR at baseline (quartile) Shortness of breath Blood urea nitrogen 
 5 Alcohol use Number of times received ER care during the past 6 months Shortness of breath 
 6 Age of biological father (spline, quartile) Nausea and/or vomiting Participant age 
 7 Blood urea nitrogen (spline, quadratic) Any edema present Age of biological father 
 8 Shortness of breath Total number of days hospitalized during past 6 months Chloride 
 9  Hypertension at baseline Number of times seen provider for well/check visits during the past 6 months 
 10  eGFR at baseline Any edema present 
 11  Blood urea nitrogen Number of times received ER care during the past 6 months 
 12  Received one or more blood transfusions  
 13  Serum albumin  
 14  Cholesterol  
 15  Number of times seen provider for well/check visits during the past 6 months  
 16  Participant disease cohort  
 17  Race of father  
 18  Drug allergies  
 19  Swelling  
 20  Drug use other than pipe or cigar  

Top predictors for RIPR, regular ridge regression, and spline ridge regression for both (a) PROMIS mental scores and (b) PROMIS physical scores from analyses of 107 participants from the NEPTUNE dataset. For the RIPR algorithm, if coefficients for different functional forms of the same predictor were among the largest in magnitude, then we retained all corresponding functional forms (listed in parentheses) and ranked the predictor based on the highest coefficient.

For all algorithms, we started by considering the top 20 predictors and removed one lowest ranking predictor at a time until MSE of the model fit decreased drastically. For the RIPR algorithm, all top ranking functional forms for a predictor were evaluated at once to check for changes in MSE.

Using the spline regression method, which had the lowest MSE, the predictors with the greatest coefficient magnitudes for the PROMIS mental scores were: calcium, number of times seen provider for well/check visits during past 6 months, WBC, annual household income, number of times received ER care during past 6 months, hypertension at baseline, race of father, race of mother, age of biological father, participant disease cohort, and serum albumin. Every top predictor for the spline ridge regression method for PROMIS mental scores appeared in the corresponding list for the RIPR method. However, the lower MSE from the spline regression model indicates that the more complex functional relationships found by RIPR may not be necessary for optimal prediction. Using the regular ridge regression method, the majority of top predictors were categorical variables, whereas allowing nonlinear functional forms using the other two methods replaced many categorical predictors with continuous predictors. In sensitivity analyses that reduced the missingness threshold, the RIPR method had the lowest MSE, suggesting that the more complex functional forms may be necessary with the smaller set of predictors (online suppl. materials section 8).

Top Predictors of PROMIS Physical Scores

The MSEs of the RIPR, regular ridge regression, and spline ridge regression algorithms for predicting PROMIS physical scores were 47.493, 73.668, and 56.113, respectively (Table 2). Using the RIPR algorithm, the top predictors of PROMIS physical scores were: hemoglobin (quartile), annual household income, WBC (spline, quartile), eGFR at baseline (quartile), alcohol use, age of biological father (spline, quartile), blood urea nitrogen (spline, quadratic), and shortness of breath. Plots illustrating the estimated functional forms of these continuous predictors are provided in online supplementary materials section 7. WBC, age of biological father, and blood urea nitrogen all exhibited some U-shaped associations with PROMIS physical scores. Further, hemoglobin and eGFR displayed step function-like relationships with PROMIS physical scores. Five of the top eight predictors were continuous and had top ranking quartile coefficients, three of these predictors had top ranking spline coefficients, and one of these predictors had a top ranking quadratic coefficient (Table 3b and online suppl. materials section 6). Further, three of predictors were best represented by multiple functional forms. One predictor (hemoglobin) was not identified as a top predictor by either of the other two methods.

The linear ridge regression model had the highest MSE (worst accuracy) for predicting physical PROMIS scores and 12 of 20 top predictors were categorical variables. Many of these top predictors did not overlap with those of the spline ridge regression or RIPR models. Using the spline ridge regression, seven of the top 11 predictors were continuous. In sensitivity analyses that lowered the missingness threshold, the RIPR algorithm similarly had the lowest MSE.

We have shown evidence that the predictive ability of these penalized regression models can change when modelling nonlinear functional forms. The inclusion of these nonlinear functional forms of continuous predictors in a ridge regression model may provide the greatest and most consistent reductions in MSE when most predictors are truly informative of the outcome (low sparsity). This trend may suggest that as the number of informative predictors increases, it is more likely for complex dependencies to exist between these predictors and the outcome. Furthermore, the predictive ability of models which regress on nonlinear functional forms appears to remain relatively stable for changes in the distribution of true, underlying functional relationships between the continuous predictors and the outcome. The RIPR algorithm had the best estimated performance for predicting PROMIS physical scores, but only the second-best estimated performance for PROMIS mental scores compared to standard ridge regression or ridge regression with spline terms only. This may have arisen from certain continuous, clinical characteristics that have complex, nonlinear associations with physical – rather than mental – health. We observed that multiple top predictors of the PROMIS physical scores when using RIPR were laboratory measurements such as hemoglobin, WBC, eGFR at baseline, and blood urea nitrogen. Because RIPR models each of these continuous covariates with multiple functional forms, the improved predictive ability of RIPR for PROMIS physical scores may suggest that the relationships between these factors and physical PROs are better represented by combinations of several functional forms, unlike previous literature which only considered linear associations [1‒5]. In contrast, the better performance of spline regression for predicting PROMIS mental scores and overlap between top predictors using spline regression and RIPR suggests that spline terms were sufficient for capturing the nonlinear associations between predictors and PROMIS mental scores.

Not only can the predictive ability of these machine learning models change in the presence of nonlinear functional forms of continuous variables, but the top predictors of these outcomes can change as well. In the NEPTUNE data analysis results, there was volatility in the list of top predictors depending on whether nonlinear functional forms were considered in the regression model. When only allowing for linear functional forms, many categorical independent variables were among the top predictors. Once we allowed for nonlinear functional forms of continuous predictors through the RIPR and spline ridge regression algorithms, the majority of the most important predictors were continuous. Further, for the RIPR algorithm, some of the top ranking continuous predictors had multiple functional forms, and all had at least one nonlinear functional form. Having multiple functional forms indicates that the functional relationships with the outcome are likely more complex than the functional relationships which would be captured by using only one functional form. Therefore, if we ignore nonlinear functional forms of continuous predictors, we can miss important predictors of outcomes. Because the list of top predictors across models can vary, it may be advisable to pick the top predictors based on the model fit with the best predictive accuracy.

Through better identification of the top predictors and the shapes of their functional forms, we may discover unusual or unexpected patterns that provide important information about biological mechanisms or risk assessment. For instance, based on the estimated association between eGFR and PROMIS physical scores in online supplementary materials section 7, it appears that mild chronic kidney disease participants had the highest physical health scores. This relationship may be worth further exploration to determine if there are any underlying reasons for this association, such as whether these participants have increased provider access or are on medications that contribute to an improved physical HRQOL.

Although the RIPR method did not always produce a lower MSE than the regular and spline ridge regression algorithms, the simulation study and NEPTUNE data analysis illustrate that the RIPR algorithm can have better predictive ability under many data characteristics and may better capture the true functional relationship of a continuous predictor and an outcome. Our recommended strategy to predict PROs is therefore to consider a wide variety of models, including the regular ridge regression, spline ridge regression, and RIPR algorithm, alongside random forest and other machine learning methods, to assess which models provide the best fit for a particular data application. If multiple models have similar predictive ability, then choosing the simplest of these models can allow for easier interpretations of functional relationships between the predictors and outcome.

While our study demonstrates the potential of the RIPR algorithm to identify functional forms of continuous predictors and improve prediction of PROs, there are several important limitations to note. First, we had a limited sample size and set of predictors in the NEPTUNE dataset, making it difficult to generalize these findings. Other domains of predictors, such as gene expression and pathology data, were not available for most patients with PRO data. Due to our limited sample size for the NEPTUNE data analysis, future work should be directed toward validating these results in external datasets, especially those which include healthy controls for comparison. Adapting RIPR to handle longitudinal data is another potential extension of this work. Second, we considered four different functional forms of continuous predictors, but other nonlinear functional relationships such as fractional polynomials [20‒22] may better capture the true, underlying functional forms of continuous predictors from certain domains. Further, for the functional forms of continuous predictors, we incorporated in the RIPR algorithm; we could further optimize the method over parameters such as the number of basis components used for the spline terms or the number of bins used for categorization (e.g., deciles instead of quartiles). Third, we demonstrated in online supplementary materials section 8 that the predictive ability of these models may be sensitive to different missingness thresholds used for inclusion/exclusion of predictors. Characterizing the impact of different imputation strategies and different levels of missingness on model performance would be an interesting area for future research.

Despite these potential limitations, we believe that this work is quite meaningful in multiple respects. To our current knowledge, this is one of the first studies to offer a simple algorithm to explicitly model multiple, nonlinear functional forms of continuous predictors for the prediction of PROs and other continuous outcomes. We also demonstrated meaningful changes in the list of independent variables and functional forms of continuous predictors that were most important for predicting PROs when nonlinear functional forms were considered. Further, we illustrated that the predictive ability of machine learning models can change when we include nonlinear functional forms of continuous independent variables.

In this study, we propose a novel algorithm, RIPR, for predicting outcomes through a ridge regression model that incorporates multiple, nonlinear functional forms of the continuous predictors. The functional forms identified by RIPR motivate future research toward understanding the mechanisms behind nonlinear relationships between predictors and outcomes, both for risk assessment and for targeting any modifiable risk factors. For instance, future studies using RIPR can help providers identify patients at high risk of low QOL and/or to motivate research that elucidates mechanisms for nonlinear associations. The RIPR method can provide improved prediction of an outcome relative to alternative ridge regression approaches for a variety of data attributes. Additionally, the RIPR algorithm allows for automatic estimation of functionally complex relationships between continuous predictors and an outcome event in data settings with many predictors but relatively few subjects by considering multiple functional forms per continuous covariate.

The Nephrotic Syndrome Study Network (NEPTUNE) is part of the Rare Diseases Clinical Research Network (RDCRN), which is funded by the National Institutes of Health (NIH) and led by the National Center for Advancing Translational Sciences (NCATS) through its Division of Rare Diseases Research Innovation (DRDRI). NEPTUNE is funded under grant number U54DK083912 as a collaboration between NCATS and the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). Additional funding and/or programmatic support is provided by the University of Michigan, NephCure Kidney International, and the Halpin Foundation. RDCRN consortia are supported by the RDCRN Data Management and Coordinating Center (DMCC) and funded by NCATS and the National Institute of Neurological Disorders and Stroke (NINDS) under U2CTR002818.

Ethical approval was not required for this study in accordance with local/national guidelines. Written informed consent from participants was not required in accordance with local/national guidelines.

The authors have no conflicts of interest to declare.

There was no funding.

Jeremy Rubin, Laura Mariani, Abigail Smith, and Jarcy Zee: data analysis, interpretation, manuscript, draft, re-writing, and conceptualization.

The data in this study were obtained from the Nephrotic Syndrome Study Network (NEPTUNE) where data sharing requires ancillary study approval and a data use agreement. The dataset may be requested from the NEPTUNE Data Analysis and Coordinating Center (DACC), https://www.neptune-study.org/. Further inquiries can be directed to the corresponding author.

1.
Troost
JP
,
Waldo
A
,
Carlozzi
NE
,
Murphy
S
,
Modersitzki
F
,
Trachtman
H
.
The longitudinal relationship between patient-reported outcomes and clinical characteristics among patients with focal segmental glomerulosclerosis in the Nephrotic Syndrome Study Network
.
Clin Kidney J
.
2020
;
13
(
4
):
597
606
.
2.
Troost
JP
,
Gipson
DS
,
Carlozzi
NE
,
Reeve
BB
,
Nachman
PH
,
Gbadegesin
R
.
Using PROMIS to create clinically meaningful profiles of nephrotic syndrome patients
.
Health Psychol
.
2019
;
38
(
5
):
410
21
.
3.
Canetta
PA
,
Troost
JP
,
Mahoney
S
,
Kogon
AJ
,
Carlozzi
N
,
Bartosh
SM
.
Health-related quality of life in glomerular disease
.
Kidney Int
.
2019
;
95
(
5
):
1209
24
.
4.
Murphy
SL
,
Mahan
JD
,
Troost
JP
,
Srivastava
T
,
Kogon
AJ
,
Cai
Y
.
Longitudinal changes in health-related quality of life in primary glomerular disease: results from the CureGN study
.
Kidney Int Rep
.
2020
;
5
(
10
):
1679
89
.
5.
Krissberg
JR
,
Helmuth
ME
,
Almaani
S
,
Cai
Y
,
Cattran
D
,
Chatterjee
D
.
Racial-ethnic differences in health-related quality of life among adults and children with glomerular disease
.
Glomerular Dis
.
2021
;
1
(
3
):
105
17
.
6.
Huang
J
,
Horowitz
JL
,
Wei
F
.
Variable selection in nonparametric additive models
.
Ann Stat
.
2010
;
38
(
4
):
2282
313
.
7.
Huang
J
,
Wei
F
,
Ma
S
.
Semiparametric regression pursuit
.
Stat Sin
.
2012
;
22
(
4
):
1403
26
.
8.
Xie
H
,
Huang
J
.
SCAD-penalized regression in high-dimensional partially linear models
.
Ann Statist
.
2009
;
37
(
2
):
673
96
.
9.
Zhang
HH
,
Cheng
G
,
Liu
Y
.
Linear or nonlinear? Automatic structure discovery for partially linear models
.
J Am Stat Assoc
.
2011
;
106
(
495
):
1099
112
.
10.
Gadegbeku
CA
,
Gipson
DS
,
Holzman
LB
,
Ojo
AO
,
Song
PXK
,
Barisoni
L
.
Design of the Nephrotic Syndrome Study Network (NEPTUNE) to evaluate primary glomerular nephropathy by a multidisciplinary approach
.
Kidney Int
.
2013
;
83
(
4
):
749
56
.
11.
Breiman
L
.
Random forests
.
Machine Learn
.
2001
;
45
(
1
):
5
32
.
12.
Biau
G
,
Scornet
E
.
A random forest guided tour
.
TEST
.
2016
;
25
(
2
):
197
227
.
13.
Stekhoven
DJ
,
Bühlmann
PJ
.
MissForest-non-parametric missing value imputation for mixed-type data
.
Bioinformatics
.
2012
;
28
(
1
):
112
8
.
14.
PROMIS
Northwestern University
.
2022
. Available from: https://www.healthmeasures.net/explore-measurement-systems/promis.
15.
Hellton
KH
,
Hjort
NL
.
Fridge: focused fine-tuning of ridge regression for personalized predictions
.
Stat Med
.
2018 Apr
37
8
1290
303
.
16.
McDonald
GC
.
Ridge regression
.
WIREs Comp Stat
.
2009
;
1
(
1
):
93
100
.
17.
Hoerl
AE
,
Kennard
RW
.
Ridge Regression: biased estimation for nonorthogonal problems
.
Technometrics
.
1970
;
12
(
1
):
55
67
.
18.
Zee
J
,
Mansfield
S
,
Mariani
LH
,
Gillespie
BW
.
Using all longitudinal data to define time to specified percentages of estimated GFR decline: a simulation study
.
Am J Kidney Dis
.
2019
;
73
(
1
):
82
9
.
19.
Steyerberg
EW
,
Harrell
FE
,
Borsboom
GJJM
,
Eijkemans
MJC
,
Vergouwe
Y
,
Habbema
JDF
.
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis
.
J Clin Epidemiol
.
2001
;
54
(
8
):
774
81
.
20.
Dupont
WD
.
Review of multivariable model-building: a pragmatic approach to regression analysis based on fractional polynomials for modeling continuous variables, by royston and sauerbrei
.
Stata J
.
2010
;
10
(
2
):
297
302
.
21.
Royston
P
,
Altman
DG
.
Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling
.
Appl Stat
.
1994
;
43
(
3
):
429
67
.
22.
Sauerbrei
W
,
Royston
P
.
Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials
.
J R Stat Soc Ser A
.
1999
;
162
(
1
):
71
94
.