TY - JOUR
T1 - External validation of prognostic models for critically ill patients required substantial sample sizes
AU - Peek, N.
AU - Arts, D. G. T.
AU - Bosman, R. J.
AU - van der Voort, P. H. J.
AU - de Keizer, N. F.
PY - 2007
Y1 - 2007
N2 - OBJECTIVE: To investigate the behavior of predictive performance measures that are commonly used in external validation of prognostic models for outcome at intensive care units (ICUs). STUDY DESIGN AND SETTING: Four prognostic models (Simplified Acute Physiology Score II, the Acute Physiology and Chronic Health Evaluation II, and the Mortality Probability Models II) were evaluated in the Dutch National Intensive Care Evaluation registry database. For each model discrimination (AUC), accuracy (Brier score), and two calibration measures were assessed on data from 41,239 ICU admissions. This validation procedure was repeated with smaller subsamples randomly drawn from the database, and the results were compared with those obtained on the entire data set. RESULTS: Differences in performance between the models were small. The AUC and Brier score showed large variation with small samples. Standard errors of AUC values were accurate but the power to detect differences in performance was low. Calibration tests were extremely sensitive to sample size. Direct comparison of performance, without statistical analysis, was unreliable with either measure. CONCLUSION: Substantial sample sizes are required for performance assessment and model comparison in external validation. Calibration statistics and significance tests should not be used in these settings. Instead, a simple customization method to repair lack-of-fit problems is recommended
AB - OBJECTIVE: To investigate the behavior of predictive performance measures that are commonly used in external validation of prognostic models for outcome at intensive care units (ICUs). STUDY DESIGN AND SETTING: Four prognostic models (Simplified Acute Physiology Score II, the Acute Physiology and Chronic Health Evaluation II, and the Mortality Probability Models II) were evaluated in the Dutch National Intensive Care Evaluation registry database. For each model discrimination (AUC), accuracy (Brier score), and two calibration measures were assessed on data from 41,239 ICU admissions. This validation procedure was repeated with smaller subsamples randomly drawn from the database, and the results were compared with those obtained on the entire data set. RESULTS: Differences in performance between the models were small. The AUC and Brier score showed large variation with small samples. Standard errors of AUC values were accurate but the power to detect differences in performance was low. Calibration tests were extremely sensitive to sample size. Direct comparison of performance, without statistical analysis, was unreliable with either measure. CONCLUSION: Substantial sample sizes are required for performance assessment and model comparison in external validation. Calibration statistics and significance tests should not be used in these settings. Instead, a simple customization method to repair lack-of-fit problems is recommended
U2 - https://doi.org/10.1016/j.jclinepi.2006.08.011
DO - https://doi.org/10.1016/j.jclinepi.2006.08.011
M3 - Article
C2 - 17419960
SN - 0895-4356
VL - 60
SP - 491
EP - 501
JO - Journal of Clinical Epidemiology
JF - Journal of Clinical Epidemiology
IS - 5
ER -