Estimation of predictive performance in high-dimensional data settings using learning curves

Jeroen M. Goedhart; Thomas Klausch; Mark A. van de Wiel

doi:https://doi.org/10.1016/j.csda.2022.107622

Estimation of predictive performance in high-dimensional data settings using learning curves

Jeroen M. Goedhart, Thomas Klausch, Mark A. van de Wiel

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.

Original language	English
Article number	107622
Journal	Computational Statistics and Data Analysis
Early online date	2022
DOIs	https://doi.org/10.1016/j.csda.2022.107622
Publication status	E-pub ahead of print - 2022

Keywords

Area under the receiver operating curve
Bootstrap
Cross-validation
High-dimensional data
Omics
Predictive performance

Access to Document

https://doi.org/10.1016/j.csda.2022.107622

Cite this

@article{713c567b1cda4b7cb72b3db60584ab5a,

title = "Estimation of predictive performance in high-dimensional data settings using learning curves",

abstract = "In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.",

keywords = "Area under the receiver operating curve, Bootstrap, Cross-validation, High-dimensional data, Omics, Predictive performance",

author = "Goedhart, {Jeroen M.} and Thomas Klausch and {van de Wiel}, {Mark A.}",

note = "Funding Information: The authors gratefully acknowledge the financial support by Stichting Hanarth Fonds . Publisher Copyright: {\textcopyright} 2022 The Author(s)",

year = "2022",

doi = "https://doi.org/10.1016/j.csda.2022.107622",

language = "English",

journal = "Computational Statistics and Data Analysis",

issn = "0167-9473",

publisher = "Elsevier",

}

TY - JOUR

T1 - Estimation of predictive performance in high-dimensional data settings using learning curves

AU - Goedhart, Jeroen M.

AU - Klausch, Thomas

AU - van de Wiel, Mark A.

PY - 2022

Y1 - 2022

N2 - In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.

AB - In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.

KW - Area under the receiver operating curve

KW - Bootstrap

KW - Cross-validation

KW - High-dimensional data

KW - Omics

KW - Predictive performance

UR - http://www.scopus.com/inward/record.url?scp=85138800330&partnerID=8YFLogxK

U2 - https://doi.org/10.1016/j.csda.2022.107622

DO - https://doi.org/10.1016/j.csda.2022.107622

M3 - Article

SN - 0167-9473

JO - Computational Statistics and Data Analysis

JF - Computational Statistics and Data Analysis

M1 - 107622

ER -

Estimation of predictive performance in high-dimensional data settings using learning curves

Abstract

Keywords

Access to Document

Other files and links

Cite this