External validation: a simulation study to compare cross-validation versus holdout or external testing to assess the performance of clinical prediction models using PET data from DLBCL patients

Jakoba J. Eertink; Martijn W. Heymans; Gerben J. C. Zwezerijnen; Josée M. Zijlstra; Henrica C. W. de Vet; Ronald Boellaard

doi:https://doi.org/10.1186/s13550-022-00931-w

External validation: a simulation study to compare cross-validation versus holdout or external testing to assess the performance of clinical prediction models using PET data from DLBCL patients

Jakoba J. Eertink, Martijn W. Heymans, Gerben J. C. Zwezerijnen, Josée M. Zijlstra, Henrica C. W. de Vet, Ronald Boellaard

Research output: Contribution to journal › Article › Academic › peer-review

13 Citations (Scopus)

Abstract

Aim: Clinical prediction models need to be validated. In this study, we used simulation data to compare various internal and external validation approaches to validate models. Methods: Data of 500 patients were simulated using distributions of metabolic tumor volume, standardized uptake value, the maximal distance between the largest lesion and another lesion, WHO performance status and age of 296 diffuse large B cell lymphoma patients. These data were used to predict progression after 2 years based on an existing logistic regression model. Using the simulated data, we applied cross-validation, bootstrapping and holdout (n = 100). We simulated new external datasets (n = 100, n = 200, n = 500) and simulated stage-specific external datasets (1), varied the cut-off for high-risk patients (2) and the false positive and false negative rates (3) and simulated a dataset with EARL2 characteristics (4). All internal and external simulations were repeated 100 times. Model performance was expressed as the cross-validated area under the curve (CV-AUC ± SD) and calibration slope. Results: The cross-validation (0.71 ± 0.06) and holdout (0.70 ± 0.07) resulted in comparable model performances, but the model had a higher uncertainty using a holdout set. Bootstrapping resulted in a CV-AUC of 0.67 ± 0.02. The calibration slope was comparable for these internal validation approaches. Increasing the size of the test set resulted in more precise CV-AUC estimates and smaller SD for the calibration slope. For test datasets with different stages, the CV-AUC increased as Ann Arbor stages increased. As expected, changing the cut-off for high risk and false positive- and negative rates influenced the model performance, which is clearly shown by the low calibration slope. The EARL2 dataset resulted in similar model performance and precision, but calibration slope indicated overfitting. Conclusion: In case of small datasets, it is not advisable to use a holdout or a very small external dataset with similar characteristics. A single small testing dataset suffers from a large uncertainty. Therefore, repeated CV using the full training dataset is preferred instead. Our simulations also demonstrated that it is important to consider the impact of differences in patient population between training and test data, which may ask for adjustment or stratification of relevant variables.

Original language	English
Article number	58
Journal	EJNMMI Research
Volume	12
Issue number	1
DOIs	https://doi.org/10.1186/s13550-022-00931-w
Publication status	Published - 2022

Keywords

CV-AUC
External validation
Internal validation
Model performance

Access to Document

https://doi.org/10.1186/s13550-022-00931-w

Cite this

Eertink, J. J., Heymans, M. W., Zwezerijnen, G. J. C., Zijlstra, J. M., de Vet, H. C. W., & Boellaard, R. (2022). External validation: a simulation study to compare cross-validation versus holdout or external testing to assess the performance of clinical prediction models using PET data from DLBCL patients. EJNMMI Research, 12(1), Article 58. https://doi.org/10.1186/s13550-022-00931-w

@article{83d3a3b8f61d4362a75a7e7d617fe294,

title = "External validation: a simulation study to compare cross-validation versus holdout or external testing to assess the performance of clinical prediction models using PET data from DLBCL patients",

abstract = "Aim: Clinical prediction models need to be validated. In this study, we used simulation data to compare various internal and external validation approaches to validate models. Methods: Data of 500 patients were simulated using distributions of metabolic tumor volume, standardized uptake value, the maximal distance between the largest lesion and another lesion, WHO performance status and age of 296 diffuse large B cell lymphoma patients. These data were used to predict progression after 2 years based on an existing logistic regression model. Using the simulated data, we applied cross-validation, bootstrapping and holdout (n = 100). We simulated new external datasets (n = 100, n = 200, n = 500) and simulated stage-specific external datasets (1), varied the cut-off for high-risk patients (2) and the false positive and false negative rates (3) and simulated a dataset with EARL2 characteristics (4). All internal and external simulations were repeated 100 times. Model performance was expressed as the cross-validated area under the curve (CV-AUC ± SD) and calibration slope. Results: The cross-validation (0.71 ± 0.06) and holdout (0.70 ± 0.07) resulted in comparable model performances, but the model had a higher uncertainty using a holdout set. Bootstrapping resulted in a CV-AUC of 0.67 ± 0.02. The calibration slope was comparable for these internal validation approaches. Increasing the size of the test set resulted in more precise CV-AUC estimates and smaller SD for the calibration slope. For test datasets with different stages, the CV-AUC increased as Ann Arbor stages increased. As expected, changing the cut-off for high risk and false positive- and negative rates influenced the model performance, which is clearly shown by the low calibration slope. The EARL2 dataset resulted in similar model performance and precision, but calibration slope indicated overfitting. Conclusion: In case of small datasets, it is not advisable to use a holdout or a very small external dataset with similar characteristics. A single small testing dataset suffers from a large uncertainty. Therefore, repeated CV using the full training dataset is preferred instead. Our simulations also demonstrated that it is important to consider the impact of differences in patient population between training and test data, which may ask for adjustment or stratification of relevant variables.",

keywords = "CV-AUC, External validation, Internal validation, Model performance",

author = "Eertink, {Jakoba J.} and Heymans, {Martijn W.} and Zwezerijnen, {Gerben J. C.} and Zijlstra, {Jos{\'e}e M.} and {de Vet}, {Henrica C. W.} and Ronald Boellaard",

note = "Funding Information: This work was financially supported by the Dutch Cancer Society (# VU 2018-11648). Publisher Copyright: {\textcopyright} 2022, The Author(s).",

year = "2022",

doi = "https://doi.org/10.1186/s13550-022-00931-w",

language = "English",

volume = "12",

journal = "EJNMMI Research",

issn = "2191-219X",

publisher = "Springer Berlin",

number = "1",

}

External validation: a simulation study to compare cross-validation versus holdout or external testing to assess the performance of clinical prediction models using PET data from DLBCL patients. / Eertink, Jakoba J.; Heymans, Martijn W.; Zwezerijnen, Gerben J. C. et al.
In: EJNMMI Research, Vol. 12, No. 1, 58, 2022.

Research output: Contribution to journal › Article › Academic › peer-review

TY - JOUR

T1 - External validation

T2 - a simulation study to compare cross-validation versus holdout or external testing to assess the performance of clinical prediction models using PET data from DLBCL patients

AU - Eertink, Jakoba J.

AU - Heymans, Martijn W.

AU - Zwezerijnen, Gerben J. C.

AU - Zijlstra, Josée M.

AU - de Vet, Henrica C. W.

AU - Boellaard, Ronald

PY - 2022

Y1 - 2022

N2 - Aim: Clinical prediction models need to be validated. In this study, we used simulation data to compare various internal and external validation approaches to validate models. Methods: Data of 500 patients were simulated using distributions of metabolic tumor volume, standardized uptake value, the maximal distance between the largest lesion and another lesion, WHO performance status and age of 296 diffuse large B cell lymphoma patients. These data were used to predict progression after 2 years based on an existing logistic regression model. Using the simulated data, we applied cross-validation, bootstrapping and holdout (n = 100). We simulated new external datasets (n = 100, n = 200, n = 500) and simulated stage-specific external datasets (1), varied the cut-off for high-risk patients (2) and the false positive and false negative rates (3) and simulated a dataset with EARL2 characteristics (4). All internal and external simulations were repeated 100 times. Model performance was expressed as the cross-validated area under the curve (CV-AUC ± SD) and calibration slope. Results: The cross-validation (0.71 ± 0.06) and holdout (0.70 ± 0.07) resulted in comparable model performances, but the model had a higher uncertainty using a holdout set. Bootstrapping resulted in a CV-AUC of 0.67 ± 0.02. The calibration slope was comparable for these internal validation approaches. Increasing the size of the test set resulted in more precise CV-AUC estimates and smaller SD for the calibration slope. For test datasets with different stages, the CV-AUC increased as Ann Arbor stages increased. As expected, changing the cut-off for high risk and false positive- and negative rates influenced the model performance, which is clearly shown by the low calibration slope. The EARL2 dataset resulted in similar model performance and precision, but calibration slope indicated overfitting. Conclusion: In case of small datasets, it is not advisable to use a holdout or a very small external dataset with similar characteristics. A single small testing dataset suffers from a large uncertainty. Therefore, repeated CV using the full training dataset is preferred instead. Our simulations also demonstrated that it is important to consider the impact of differences in patient population between training and test data, which may ask for adjustment or stratification of relevant variables.

AB - Aim: Clinical prediction models need to be validated. In this study, we used simulation data to compare various internal and external validation approaches to validate models. Methods: Data of 500 patients were simulated using distributions of metabolic tumor volume, standardized uptake value, the maximal distance between the largest lesion and another lesion, WHO performance status and age of 296 diffuse large B cell lymphoma patients. These data were used to predict progression after 2 years based on an existing logistic regression model. Using the simulated data, we applied cross-validation, bootstrapping and holdout (n = 100). We simulated new external datasets (n = 100, n = 200, n = 500) and simulated stage-specific external datasets (1), varied the cut-off for high-risk patients (2) and the false positive and false negative rates (3) and simulated a dataset with EARL2 characteristics (4). All internal and external simulations were repeated 100 times. Model performance was expressed as the cross-validated area under the curve (CV-AUC ± SD) and calibration slope. Results: The cross-validation (0.71 ± 0.06) and holdout (0.70 ± 0.07) resulted in comparable model performances, but the model had a higher uncertainty using a holdout set. Bootstrapping resulted in a CV-AUC of 0.67 ± 0.02. The calibration slope was comparable for these internal validation approaches. Increasing the size of the test set resulted in more precise CV-AUC estimates and smaller SD for the calibration slope. For test datasets with different stages, the CV-AUC increased as Ann Arbor stages increased. As expected, changing the cut-off for high risk and false positive- and negative rates influenced the model performance, which is clearly shown by the low calibration slope. The EARL2 dataset resulted in similar model performance and precision, but calibration slope indicated overfitting. Conclusion: In case of small datasets, it is not advisable to use a holdout or a very small external dataset with similar characteristics. A single small testing dataset suffers from a large uncertainty. Therefore, repeated CV using the full training dataset is preferred instead. Our simulations also demonstrated that it is important to consider the impact of differences in patient population between training and test data, which may ask for adjustment or stratification of relevant variables.

KW - CV-AUC

KW - External validation

KW - Internal validation

KW - Model performance

UR - http://www.scopus.com/inward/record.url?scp=85138269415&partnerID=8YFLogxK

U2 - https://doi.org/10.1186/s13550-022-00931-w

DO - https://doi.org/10.1186/s13550-022-00931-w

M3 - Article

C2 - 36089634

SN - 2191-219X

VL - 12

JO - EJNMMI Research

JF - EJNMMI Research

IS - 1

M1 - 58

ER -

External validation: a simulation study to compare cross-validation versus holdout or external testing to assess the performance of clinical prediction models using PET data from DLBCL patients

Abstract

Keywords

Access to Document

Other files and links

Cite this