Evaluating pointwise reliability of machine learning prediction

Giovanna Nicora; Miguel Rios; Ameen Abu-Hanna; Riccardo Bellazzi

doi:https://doi.org/10.1016/j.jbi.2022.103996

Evaluating pointwise reliability of machine learning prediction

Giovanna Nicora, Miguel Rios, Ameen Abu-Hanna, Riccardo Bellazzi

Research output: Contribution to journal › Article › Academic › peer-review

27 Citations (Scopus)

Abstract

Interest in Machine Learning applications to tackle clinical and biological problems is increasing. This is driven by promising results reported in many research papers, the increasing number of AI-based software products, and by the general interest in Artificial Intelligence to solve complex problems. It is therefore of importance to improve the quality of machine learning output and add safeguards to support their adoption. In addition to regulatory and logistical strategies, a crucial aspect is to detect when a Machine Learning model is not able to generalize to new unseen instances, which may originate from a population distant to that of the training population or from an under-represented subpopulation. As a result, the prediction of the machine learning model for these instances may be often wrong, given that the model is applied outside its “reliable” space of work, leading to a decreasing trust of the final users, such as clinicians. For this reason, when a model is deployed in practice, it would be important to advise users when the model's predictions may be unreliable, especially in high-stakes applications, including those in healthcare. Yet, reliability assessment of each machine learning prediction is still poorly addressed. Here, we review approaches that can support the identification of unreliable predictions, we harmonize the notation and terminology of relevant concepts, and we highlight and extend possible interrelationships and overlap among concepts. We then demonstrate, on simulated and real data for ICU in-hospital death prediction, a possible integrative framework for the identification of reliable and unreliable predictions. To do so, our proposed approach implements two complementary principles, namely the density principle and the local fit principle. The density principle verifies that the instance we want to evaluate is similar to the training set. The local fit principle verifies that the trained model performs well on training subsets that are more similar to the instance under evaluation. Our work can contribute to consolidating work in machine learning especially in medicine.

Original language	English
Article number	103996
Journal	Journal of biomedical informatics
Volume	127
DOIs	https://doi.org/10.1016/j.jbi.2022.103996
Publication status	Published - 1 Mar 2022

Keywords

Machine learning trustworthiness
Predictive reliability
Uncertainty

Access to Document

https://doi.org/10.1016/j.jbi.2022.103996

Cite this

@article{d579631fdb0049788126288e5b69dfa3,

title = "Evaluating pointwise reliability of machine learning prediction",

abstract = "Interest in Machine Learning applications to tackle clinical and biological problems is increasing. This is driven by promising results reported in many research papers, the increasing number of AI-based software products, and by the general interest in Artificial Intelligence to solve complex problems. It is therefore of importance to improve the quality of machine learning output and add safeguards to support their adoption. In addition to regulatory and logistical strategies, a crucial aspect is to detect when a Machine Learning model is not able to generalize to new unseen instances, which may originate from a population distant to that of the training population or from an under-represented subpopulation. As a result, the prediction of the machine learning model for these instances may be often wrong, given that the model is applied outside its “reliable” space of work, leading to a decreasing trust of the final users, such as clinicians. For this reason, when a model is deployed in practice, it would be important to advise users when the model's predictions may be unreliable, especially in high-stakes applications, including those in healthcare. Yet, reliability assessment of each machine learning prediction is still poorly addressed. Here, we review approaches that can support the identification of unreliable predictions, we harmonize the notation and terminology of relevant concepts, and we highlight and extend possible interrelationships and overlap among concepts. We then demonstrate, on simulated and real data for ICU in-hospital death prediction, a possible integrative framework for the identification of reliable and unreliable predictions. To do so, our proposed approach implements two complementary principles, namely the density principle and the local fit principle. The density principle verifies that the instance we want to evaluate is similar to the training set. The local fit principle verifies that the trained model performs well on training subsets that are more similar to the instance under evaluation. Our work can contribute to consolidating work in machine learning especially in medicine.",

keywords = "Machine learning trustworthiness, Predictive reliability, Uncertainty",

author = "Giovanna Nicora and Miguel Rios and Ameen Abu-Hanna and Riccardo Bellazzi",

note = "Funding Information: We thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions. This work was partially supported by the Department of Electrical, Computer and Biomedical Engineering of University of Pavia and by the European Commission as part of the PERISCOPE project (Grant Agreement 101016233), coordinated by the University of Pavia. Publisher Copyright: {\textcopyright} 2022",

year = "2022",

month = mar,

day = "1",

doi = "https://doi.org/10.1016/j.jbi.2022.103996",

language = "English",

volume = "127",

journal = "Journal of biomedical informatics",

issn = "1532-0464",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Evaluating pointwise reliability of machine learning prediction

AU - Nicora, Giovanna

AU - Rios, Miguel

AU - Abu-Hanna, Ameen

AU - Bellazzi, Riccardo

N1 - Funding Information: We thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions. This work was partially supported by the Department of Electrical, Computer and Biomedical Engineering of University of Pavia and by the European Commission as part of the PERISCOPE project (Grant Agreement 101016233), coordinated by the University of Pavia. Publisher Copyright: © 2022

PY - 2022/3/1

Y1 - 2022/3/1

N2 - Interest in Machine Learning applications to tackle clinical and biological problems is increasing. This is driven by promising results reported in many research papers, the increasing number of AI-based software products, and by the general interest in Artificial Intelligence to solve complex problems. It is therefore of importance to improve the quality of machine learning output and add safeguards to support their adoption. In addition to regulatory and logistical strategies, a crucial aspect is to detect when a Machine Learning model is not able to generalize to new unseen instances, which may originate from a population distant to that of the training population or from an under-represented subpopulation. As a result, the prediction of the machine learning model for these instances may be often wrong, given that the model is applied outside its “reliable” space of work, leading to a decreasing trust of the final users, such as clinicians. For this reason, when a model is deployed in practice, it would be important to advise users when the model's predictions may be unreliable, especially in high-stakes applications, including those in healthcare. Yet, reliability assessment of each machine learning prediction is still poorly addressed. Here, we review approaches that can support the identification of unreliable predictions, we harmonize the notation and terminology of relevant concepts, and we highlight and extend possible interrelationships and overlap among concepts. We then demonstrate, on simulated and real data for ICU in-hospital death prediction, a possible integrative framework for the identification of reliable and unreliable predictions. To do so, our proposed approach implements two complementary principles, namely the density principle and the local fit principle. The density principle verifies that the instance we want to evaluate is similar to the training set. The local fit principle verifies that the trained model performs well on training subsets that are more similar to the instance under evaluation. Our work can contribute to consolidating work in machine learning especially in medicine.

AB - Interest in Machine Learning applications to tackle clinical and biological problems is increasing. This is driven by promising results reported in many research papers, the increasing number of AI-based software products, and by the general interest in Artificial Intelligence to solve complex problems. It is therefore of importance to improve the quality of machine learning output and add safeguards to support their adoption. In addition to regulatory and logistical strategies, a crucial aspect is to detect when a Machine Learning model is not able to generalize to new unseen instances, which may originate from a population distant to that of the training population or from an under-represented subpopulation. As a result, the prediction of the machine learning model for these instances may be often wrong, given that the model is applied outside its “reliable” space of work, leading to a decreasing trust of the final users, such as clinicians. For this reason, when a model is deployed in practice, it would be important to advise users when the model's predictions may be unreliable, especially in high-stakes applications, including those in healthcare. Yet, reliability assessment of each machine learning prediction is still poorly addressed. Here, we review approaches that can support the identification of unreliable predictions, we harmonize the notation and terminology of relevant concepts, and we highlight and extend possible interrelationships and overlap among concepts. We then demonstrate, on simulated and real data for ICU in-hospital death prediction, a possible integrative framework for the identification of reliable and unreliable predictions. To do so, our proposed approach implements two complementary principles, namely the density principle and the local fit principle. The density principle verifies that the instance we want to evaluate is similar to the training set. The local fit principle verifies that the trained model performs well on training subsets that are more similar to the instance under evaluation. Our work can contribute to consolidating work in machine learning especially in medicine.

KW - Machine learning trustworthiness

KW - Predictive reliability

KW - Uncertainty

UR - http://www.scopus.com/inward/record.url?scp=85123373748&partnerID=8YFLogxK

U2 - https://doi.org/10.1016/j.jbi.2022.103996

DO - https://doi.org/10.1016/j.jbi.2022.103996

M3 - Article

C2 - 35041981

SN - 1532-0464

VL - 127

JO - Journal of biomedical informatics

JF - Journal of biomedical informatics

M1 - 103996

ER -

Evaluating pointwise reliability of machine learning prediction

Abstract

Keywords

Access to Document

Other files and links

Cite this