TY - JOUR
T1 - Estimation of predictive performance in high-dimensional data settings using learning curves
AU - Goedhart, Jeroen M.
AU - Klausch, Thomas
AU - van de Wiel, Mark A.
N1 - Funding Information: The authors gratefully acknowledge the financial support by Stichting Hanarth Fonds . Publisher Copyright: © 2022 The Author(s)
PY - 2022
Y1 - 2022
N2 - In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.
AB - In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.
KW - Area under the receiver operating curve
KW - Bootstrap
KW - Cross-validation
KW - High-dimensional data
KW - Omics
KW - Predictive performance
UR - http://www.scopus.com/inward/record.url?scp=85138800330&partnerID=8YFLogxK
U2 - https://doi.org/10.1016/j.csda.2022.107622
DO - https://doi.org/10.1016/j.csda.2022.107622
M3 - Article
SN - 0167-9473
JO - Computational Statistics and Data Analysis
JF - Computational Statistics and Data Analysis
M1 - 107622
ER -