Introduction: This study assessed the external validity of all published first trimester prediction models for the risk of preeclampsia (PE) based on routinely collected maternal predictors. Moreover, the potential utility of the best-performing models in clinical practice was evaluated. Material and Methods: Ten prediction models were systematically selected from the literature. We performed a multicenter prospective cohort study in the Netherlands between July 1, 2013, and December 31, 2015. Eligible pregnant women completed a web-based questionnaire before 16 weeks’ gestation. The outcome PE was established using postpartum questionnaires and medical records. Predictive performance of each model was assessed by means of discrimination (c-statistic) and a calibration plot. Clinical usefulness was evaluated by means of decision curve analysis and by calculating the potential impact at different risk thresholds. Results: The validation cohort contained 2,614 women of whom 76 developed PE (2.9%). Five models showed moderate discriminative performance with c-statistics ranging from 0.73 to 0.77. Adequate calibration was obtained after refitting. The best models were clinically useful over a small range of predicted probabilities. Discussion: Five of the ten included first trimester prediction models for PE showed moderate predictive performance. The best models may provide more benefit compared to risk selection as used in current guidelines.
Preeclampsia (PE) is still a leading cause of maternal and neonatal morbidity and mortality worldwide. Although management of PE has improved, the only available curative treatment is delivery of the placenta. Further reduction of the burden of disease may be achieved through prevention [1, 2].
Calcium supplementation and low-dose aspirin are well-researched interventions that reduce the risk of developing PE. A daily intake of at least 1 g of calcium lowers the risk of PE in women with a low dietary calcium intake and women at increased risk of hypertensive disorders . Low-dose aspirin safely reduces the occurrence of PE by 10–43% among high-risk women, and also reduces related adverse outcomes (e.g., preterm birth, small for gestational age) [4-10]. A recent meta-analysis, including the large Aspirin for Evidence-Based Preeclampsia Prevention (ASPRE) trial, demonstrated that aspirin mainly reduces the risk of preterm PE with a reported risk reduction of 38–67% [11, 12].
Efficiency of preventive strategies may improve if they are specifically aimed at women that are most likely to benefit from them. Prediction models may help identifying women at increased risk of PE early on in pregnancy, a crucial period at which preventive interventions are most effective.
A considerable number of first trimester prediction models for PE have been published, containing predictors ranging from maternal risk factors, blood pressure measurement, to more specialized tests such as uterine artery Doppler ultrasound and biomarkers (e.g., pregnancy-associated plasma protein-A, placental growth factor). Although specialized tests have been reported to improve predictive performance, a drawback is that these tests are not routinely performed or readily available in general antenatal settings, and generate substantial additional costs [13, 14].
Only a minority of the models have been validated in independent populations . Prediction models usually perform less well in populations other than the ones in which they were developed . External validation of models is necessary in order to gain insight into their robustness across populations and domains [15, 16].
In this paper, we report the results of an independent validation study of all published first trimester prediction models based on routinely collected maternal characteristics for the risk of PE, in a cohort of 2,614 Dutch pregnant women. Next to this, we evaluated clinical usefulness of the best performing models by means of decision curve analysis and by estimating clinical impact at different risk thresholds and different management scenarios, and compared it with the performance of current international recommendations.
Material and Methods
Selection Prediction Models
We searched PubMed to identify first trimester prediction models for development of PE. The search was first performed in April 2013, before finalizing study questionnaires, and then updated until April 2017. The search strategy retrieved 1,361 citations (see online suppl. File 1; for all online suppl. material, see www.karger.com/doi/10.1159/000490385). L.M. screened all citations and, together with L.S., assessed eligibility of full text articles. In case of disagreement, a third reviewer (H.S.) was available.
Fifty articles described prediction models fulfilling the eligibility criteria, which were: (1) article presented the development of a prediction model or an update of a previously developed model, (2) endpoint of the model was the risk of developing PE, (3) model contained multiple predictors, (4) model was based on weighted risk predictors, (5) predictors were routinely collected in Dutch obstetric practice (maternal characteristics, anthropometric measures, or blood pressure measurements), and (6) predictors were available before 16+0 weeks of gestation. Citation lists of relevant articles were checked and yielded 9 additional eligible articles. L.M. contacted authors if the model intercept, estimates or definition of predictors were not provided in the article. If regression coefficients were nevertheless not obtained, the model was excluded from the validation analyses. We excluded 36 articles for the following reasons: algorithm not available (n = 19), model already published in another article (n = 17). The 23 included articles described 33 prediction models, 10 of which were aimed at predicting overall PE, 12 models at early PE, 6 models at late PE, 2 models at severe PE, and 4 models at predicting gestational age at delivery with PE. We only validated the models predicting any PE as the number of women with early onset PE (e.g., delivery < 34 weeks of gestational age) was too low in our validation cohort (n = 5). None of the models were used in daily care during the study period.
The characteristics of included models are summarized in Table 1 (see Supplementary Table 1 for a comprehensive overview) [17-26]. The models were published between 2005 and 2015 in the UK, Iran, Canada, and Australia by 6 different research groups. Three models were developed for nulliparous women [18, 21, 23]. References of excluded studies are described in online Supplementary File 1.
We performed a multicenter prospective cohort study (Expect Study I) in the south-eastern part of the Netherlands (province of Limburg) . Pregnant women less than 16 weeks of pregnancy and aged 18 years or older were recruited between July 1, 2013 and January 1, 2015 in 36 midwifery practices (primary care) and 6 hospitals (secondary and tertiary care). Follow-up took place until December 31, 2015. We excluded pregnancies that terminated before 24 weeks of gestation or for which the outcome was not available. Eligible pregnant women were asked to complete a web-based questionnaire before 16 weeks of gestation (pregnancy questionnaire) and at 6 weeks after the due date (postpartum questionnaire). The web-based questionnaires could be accessed through the study website by use of a personal login code provided with the study information. Paper-and-pencil questionnaires were made available upon request. Automatic reminders were sent out in case of incompleteness or nonresponse. Medical records and letters of discharge about the pregnancy and birth were requested from caregivers.
The Medical Ethical Committee of the Maastricht University Medical Center evaluated the study protocol and declared that no ethical approval was necessary (MEC 13-4-053). All participating women gave informed consent through the Internet.
Assessment of Predictors
Predictors were assessed by means of the pregnancy questionnaire. Systolic and diastolic blood pressure (mm Hg) and heart rate (beats per minute) were measured before 16 weeks of gestation by the gynecologist or midwife. The results of these measurements were subsequently self-reported in the pregnancy questionnaire. The predictors were defined as described in the original articles. For the predictor woman’s birth weight, a proxy variable was made as we measured this variable in categories instead of as a continuous variable. Since no measurements were available for crown-rump length at blood pressure measurement, we made a proxy variable based on gestational age . A detailed description about definition and assessment of the predictors is provided in online Supplementary Table 2.
Assessment of PE
The Dutch national guideline defines PE as systolic blood pressure ≥140 mm Hg and/or diastolic blood pressure ≥90 mm Hg (Korotkoff V) after 20 weeks gestation, measured twice in a previously normotensive woman, accompanied by proteinuria (at least 300 mg protein in a 24-h urine collection) [29, 30]. The outcome was obtained from a combination of the medical record and postpartum questionnaire. We checked whether the medical record stated a diagnosis of PE or a combination of the field codes proteinuria and highest diastolic blood pressure measurement ≥90 mm Hg. Women were asked to report any occurrence of PE in the postpartum questionnaire. PE was defined as present when the diagnosis was confirmed by both sources. In the absence of the postpartum questionnaire (n = 435), the medical record was used as a gold standard and vice versa (n = 16). In case of a discrepancy between the pregnancy questionnaire and medical record, we contacted the obstetric caregiver for final decision.
There is no generally accepted rule for the required sample size for external validation studies of prediction models. Vergouwe et al. state that at least 100 events and 100 nonevents are necessary in order to be able to detect relevant differences between model performance in the derivation set and the validation set. Expecting that around 4% of the pregnancies would be affected by PE and accounting for incomplete records (10%), recruitment of 2,750 women was necessary.
Baseline characteristics were analyzed in the validation cohort and presented as means ± standard deviation for continuous variables and an absolute value with percentage for categorical variables. We imputed the missing values using stochastic regression imputation using predictive mean matching as the imputation model . We evaluated the relatedness of the predictors and outcome between the validation cohort and the included original cohorts (case-mix).
For each subject in our validation cohort, we computed probabilities of developing PE on the basis of all included models. The algorithms of the included models are presented in online Supplementary Table 3. Predictive performance of the models was assessed by means of discrimination and calibration. Discrimination is the models’ ability to distinguish between women who will develop PE and those who will not. We quantified discrimination by calculating the c-statistic (area under the receiver operating characteristic [ROC] curve, or AUROC) with 95% confidence interval. Calibration is the degree of agreement between the predicted probabilities and observed outcomes, and was examined by constructing calibration plots. Perfect predictions should be on the 45° line with an intercept of 0 (calibration in the large) and a slope of 1 . Calibration in the large compares the mean predicted risk with the observed proportion of PE cases, and indicates whether predictions are systematically too low or too high . Calibration slope refers to the average strength of the predictor effects . At validation, the slope is often smaller than one, reflecting overfitting of the model (low predictions too low and high predictions too high) . We divided the women into 10 groups of roughly equal size with similar predicted risk. In case of miscalibration, we recalibrated the prediction model by adjusting the intercept and slope using the linear predictor as the only covariable . This recalibration method has no effect on the discriminative performance of the model as the spread and ranking of the predictions remain the same.
We used the validation cohort with our own defined eligibility criteria for the analyses of each model. However, the original models were developed in different pregnant populations with their own in- and exclusion criteria. We additionally assessed the discriminative performance for each original study according to their specific eligibility criteria. Furthermore, a subgroup analysis was performed among nulliparous women, as previous PE was a strong predictor in most models.
Lastly, we determined the potential impact of the (recalibrated) models in clinical practice. Decision curve analysis was carried out for the best performing models as it provides a first impression of the clinical usefulness of the models. Decision curve analysis assesses net benefit (net proportion of true positives) of the models over a range of threshold probabilities compared to the scenarios of classifying all and no women as high risk of developing PE . For the model with the highest net benefit, we evaluated the potential clinical impact of the model at different risk thresholds and different management strategies. We calculated incidence reduction of PE and number needed to prevent one case of PE using theoretical preventive interventions with relative risk reductions of 10, 20, and 40%. We also compared the model performance to current guidelines, the NICE and ACOG criteria [35-37]. As the ACOG criteria do not specifically define the number of moderate risk factors to identify women at high risk for PE, we defined ≥2 moderate risk factors as high risk . Low socioeconomic status was defined as education below the level of tertiary education.
All statistical analyses were conducted with IBM SPSS statistics version 23 (SPSS, Chicago, IL, USA) and R version 3.2.3, packages rms, pROC, and DecisionCurve.
We included a total of 2,614 women in the validation cohort (Fig. 1). Table 2 shows the characteristics of the validation cohort in the observed data. For most predictors, the percentage of missing data was less than one. However, woman’s own birth weight, systolic blood pressure, and diastolic blood pressure were missing in 12.2, 9.9, and 10.3% of women, respectively. An overview of the number of missing values per predictor and the characteristics of both the complete cases and the validation cohort after imputation is provided in online Supplementary Table 4. The pregnancy was complicated by PE in 76 women (2.9%).
We also described the characteristics of the development samples and compared these with the current validation sample (see online suppl. Table 5). The occurrence of the outcome was considerably higher in the development cohorts of Direkvand-Moghadam et al.  and Seed et al.. The case-mix of Direkvand-Moghadam et al., Seed et al., and North et al. substantially differed. In contrast to most development cohorts, our validation cohort had a low incidence of non-Caucasian ethnicity. Kenny et al. and Audibert et al. reported no predictor characteristics, and Syngelaki et al.  did not describe the number of cases.
The discriminative performance of the models is presented in Table 3. The AUROC decreased in five out of ten models to below or equal to 0.60. The model of Syngelaki et al. showed the best discriminative ability (AUROC 0.77). The models of Macdonald-Wallis et al. and Kenny et al. remained most stable after external validation (0.04 decline in AUROC). The ROC curves are available in online Supplementary Figure 1. The discriminative performances did not change much when validating the models in the cohort according to their own eligibility criteria.
Calibration plots of the models are provided in Figure 2. The distribution of the predicted probabilities was closely clustered around the mean for most models. Models consistently overestimated the risk of PE, except for the model of Poon et al. , Poon et al. , and Plasencia et al.. All models were overfitted (e.g., the calibration slope < 1). The model of Poon et al.  was best calibrated to our population (slope 0.92). The model of Syngelaki et al. had a slope of 1.1. Recalibration of the models showed closer fitting to the ideal calibration line (Fig. 3).
Secondary analysis, in which the models were only applied to nulliparous women (n = 1,326), showed that the discriminative performance decreased considerably for all models with AUROC ranging from 0.51 to 0.61 (see online suppl. Fig. 2). The models specifically developed for nulliparous women (Kenny et al., North et al., and Audibert et al.) did not perform better than the other models. The calibration plots are available in online Supplementary Figure 3. The model of Syngelaki et al. had a slope of 0.59.
Figure 4 shows the net benefit of the five best performing models over a range of risk thresholds. The models were clinically useful, compared with classifying all or no women as high-risk, over a small range of probability thresholds (Macdonald-Wallis et al.  0.50–5.07%, Syngelaki et al. 0.52–5.98%, Poon et al.  0.52–6.08%, Poon et al.  0.52–6.44%, Plasencia et al. 0.68–6.31%). Although the curves differed only slightly, the model of Syngelaki et al. had the highest net benefit for almost the complete range of clinical useful threshold probabilities.
The prognostic measures of the model of Syngelaki et al. and the potential effect of the interventions on PE incidence reduction for six different risk thresholds are presented in Table 4. A low risk threshold ensures that almost no cases will be missed (high sensitivity), but many women will be treated unnecessarily. Conversely, a high specificity leads to a low proportion of false positives. However, the effect of the intervention will be smaller since more women prone to develop PE will be considered as low risk. The number needed to prevent one case of PE is dependent on the efficacy of the intervention, but less so of risk threshold.
We compared the model of Syngelaki et al. with current guidelines. When applying the NICE criteria to our population, 8.5% of the women were considered as high risk with a sensitivity of 23.7% and a specificity of 91.9%. The ACOG criteria indicated 43.6% of our women as high-risk with a sensitivity of 68.4% and specificity of 57.1%. The model of Syngelaki et al. had a specificity of 90.1 and 70.1% at a sensitivity of 23.7 and 68.4%, respectively.
We assessed the generalizability of ten first trimester prediction models based on routinely collected maternal characteristics for the risk of PE in a Dutch population. Five models (Macdonald-Wallis et al. , Syngelaki et al., Poon et al. , Poon et al. , Plasencia et al.) showed moderate discriminative performance (AUROCs ranged from 0.73–0.77). The model of Poon et al.  showed the best calibration. This model underestimated the risk of developing PE over the entire range, but showed the best fit of predictor effects (slope 0.92). Refitting of the models improved calibration.
Validation is an important step in predictive modelling since the predictive performance is usually too optimistic in development samples . Internal validation was only performed for three of the included models (Kenny et al., North et al. and Seed et al.). The model of Kenny et al. stayed most stable in our external validation cohort. External validation studies of prediction models for PE are scarce [13, 14, 39, 40]. None have simultaneously considered all first trimester prediction models based on routinely collected maternal characteristics for the risk of PE. Macdonald-Wallis et al. estimated the external validity of their own developed model using another cohort from the same country. Independent external validation was only performed for the model of Plasencia et al. and Poon et al.  [40-42]. These external validation studies, particularly those validating the model of Plasencia’s group [41, 42], were hampered by a (very) low number of cases.
Our results confirm the importance of performing an external validation study. The predictive performance decreased for all included models. A previous pregnancy complicated by PE is a strong risk factor and the risk for multiparous women with no previous PE reduces substantially for a subsequent pregnancy [43-45]. The subanalysis among nulliparous women showed that the performance decreased considerably, even for the models specifically developed for nulliparous women. This shows the difficulty of predicting PE among low risk nulliparous women based on exclusively routinely collected maternal characteristics. Although there appears scope for further improvement of the models, for now predictive performance of models based on only routinely collected maternal predictors measured up to 16 weeks of gestation is unlikely to improve sufficiently by adding other predictors as most studies already took into account the well-described maternal risk factors. Very strong risk factors such as antiphospholipid syndrome and chronic hypertension are absent in most models . This is probably due to the low prevalence of some of these factors in the general pregnant population. Another possible factor is that pregnant women with essential hypertension may already receive high-risk treatment, reducing PE risk. Most prediction models are developed to assess the risk in a low-risk population. In case of the presence of very strong risk factors, women should also be considered as high risk.
Performance parameters are important characteristics of a prediction model but do not equal the usefulness of the model in clinical practice . Moreover, a prediction model can only lead to benefit if there is an effective follow-up strategy. Decision curve analysis indicated that the five best-performing models were clinically useful compared to treating all women or none at all as high risk over a small range of probability thresholds. No model had a considerably higher net benefit than others. Based on predictive performance, mainly among nulliparous women, we prefer the model of Syngelaki et al. containing a wide range of known risk factors for PE (ethnicity, BMI, age, preexisting chronic hypertension, preexisting diabetes mellitus, smoking status, conception method, family history of PE, previous PE, and parity), this model also has good face value for the end user.
Decision curve analysis gives only a first impression of clinical usefulness. Determining the best risk threshold is a subsequent challenge that has had surprisingly little attention in the prediction literature. A prediction tool should correctly identify individuals at high risk of developing disease, while avoiding unnecessary interventions for individuals at low risk. There is a trade-off between sensitivity and specificity. In the case of PE, a high sensitivity is preferred due to the high burden of disease. The best performing models still had a considerably large number of false positives at a high sensitivity. The choice of a cut-off is therefore also dependent on the available intervention. Apart from the efficiency of the intervention in reducing PE, tolerance, side effects, and costs are as very important aspects to consider. The current available preventive interventions low-dose aspirin and calcium supplementation are safe, have no major side effects, are not burdensome and are relatively inexpensive [3-12].
Several current guidelines provide a list of risk factors to indicate women at high risk of developing PE. A limitation of some published guidelines is that they can be interpreted differently by caregivers [36, 46]. Furthermore, criteria lists do not take into account the strength of the different risk factors in relation to PE. Our best model performed better than the ACOG criteria and similarly to the NICE criteria at the same sensitivity.
At the moment, the validated models with an acceptable performance are simple, readily available, and cheap to implement. An extensive decision analysis can provide insight about the expected impact of the model compared to usual care by combining test characteristics with evidence on consequences of the outcome, effects and burden of the further management, and costs. In case the model is worth considering for use in clinical practice, an empirical impact study can determine whether the model indeed leads to better care. Not only outcomes are important but also a lot of other aspects like applicability of the model, acceptability of the intervention, risk counselling, satisfaction, and real costs.
Strengths and Limitations
The main strengths of our study were the design and data collection. We performed a multicenter prospective study in which nearly all pregnant women less than 16 weeks of gestation were eligible. This design ensures a representative and unselected population to a high degree. The use of online questionnaires, an efficient data collection tool in women of reproductive age, improved our data quality and quantity of missing data due incorporation of validation checks . Our dataset contained less than 1% missing data for almost all predictors, and these were imputed in order to prevent biased results and a loss of statistical precision. Outcome assessment based on both the postpartum questionnaires and medical records has ensured accurate and most complete available information possible in a large cohort .
We selected all existing first trimester prediction models based on routinely collected maternal characteristics for the risk of PE. The paper of Syngelaki et al., included on the basis of citation lists, was not selected from our systematic search as the title and abstract did not clearly indicate that it addressed prediction models for a wide range of adverse pregnancy outcomes. Syngelaki’s model was also not described in existing systematic reviews about prediction models for PE [13, 14, 39, 49]. A limitation in the selection process is that we had to exclude a substantial proportion of prediction models as the algorithm was not available, even after contacting the authors. Moreover, the dataset was too small to validate algorithms predicting early PE. Another point to be mentioned is that we used our own eligibility criteria for a fair comparison of all models. Analysis of the performance measures of the models according to their specific eligibility criteria however did not show better performance. Furthermore, we had to create a proxy variable for the predictors woman’s own birth weight and crown-rump length . These proxy variables could have led to an underestimation of the predictive performance of the model. The variable blood pressure could have been subject to measurement error as this variable was self-reported and measured according to routine clinical practice . Despite availability of standardized protocols for blood pressure measurement, the procedure is often not strictly followed in real practice. A last limitation to be mentioned is the number of cases in our validation cohort. No golden rule is available for the sample size of external validation studies. However, studies suggest that a minimum of 100 events, or preferably more, are needed for accurate and precise model performances [31, 51]. The study of Collins et al. showed that the precision of the c-statistic and calibration slope did not differ considerably between 75 and 100 cases.
Five of the ten included first trimester prediction models based on routinely collected maternal characteristics for the risk of PE showed moderate predictive performance after external validation. The models are expected to perform better than current available guidelines in the separation of high- and low-risk pregnancies. Further research should focus on the determination of an acceptable risk threshold and impact of implementation of a chosen model in clinical practice.
We thank all women who participated in the Expect Study I. The Expect Study I could not have been established without the contribution of participating departments of obstetrics and gynecology of hospitals, midwifery practices, and maternity care centers in the Province of Limburg: Zuyderland Medical Center Heerlen and Sittard-Geleen, Maastricht University Medical Center, Laurentius Hospital Roermond, Sint Jans Gasthuis Weert, VieCuri Medical Center, midwifery practice Roermond, midwifery practice Nederweert, midwifery practice Weert, midwifery practice Lenie & Chantal, midwifery practice Loes Wijnhoven, midwifery practice De Roerstreek, midwifery practice Bollebuik, midwifery practice Westenberg, midwifery practice Cranendonck, midwifery practice Becca, midwifery practice Born, midwifery practice Geleen, midwifery practice Grevenbicht, midwifery practice Sittard, midwifery practice Sittard-Oost, midwifery practice Stein, midwifery practice Astrea, midwifery practice Horst & Maasdorpen, midwifery practice Reuver-Tegelen, midwifery practice Janneke van Hal, midwifery practice Raijer & Sup, midwifery practice Venlo-Blerick, midwifery practice Venray, midwifery practice Schoffelen-Van Vleuten, midwifery practice Maastricht, midwifery practice Meerssen, midwifery practice Naomi Satijn, midwifery practice Vita, midwifery practice Het Verloskundig Huis, midwifery practice Valkenburg, midwifery practice Parkstad, midwifery practice Lief, midwifery practice Bevalt Beter, midwifery practice ‘t Bolleke, midwifery practice Natuurlijk bij Jeanny, midwifery practice La Vie, GroenekruisDomicura, Cicogna, ZiN kraamzorg, and maternity center Echt.
The authors declare no conflicts of interest.
The Expect Study was funded by The Netherlands Organization for Health Research and Development, Pregnancy and Childbirth program (ZonMw grant 209020007).