Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees

Athanasios Angelakis; Ioanna Soulioti; Michael Filippakis

doi:https://doi.org/10.1016/j.heliyon.2023.e20530

Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees

Athanasios Angelakis, Ioanna Soulioti, Michael Filippakis

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

We define an iterative method for dimensionality reduction using categorical gradient boosted trees and Shapley values and created four machine learning models which potentially could be used as diagnostic tests for acute myeloid leukaemia (AML). For the final Catboost model we use a dataset of 2177 individuals using as features 16 probe sets and the age in order to classify if someone has AML or is healthy. The dataset is multicentric and consists of data from 27 organizations, 25 cities, 15 countries and 4 continents. The performance of our last model is specificity: 0.9909, sensitivity: 0.9985, F1-score: 0.9976 and its ROC-AUC: 0.9962 using ten fold cross validation. On an inference dataset the perormance is: specificity: 0.9909, sensitivity: 0.9969, F1-score: 0.9969 and its ROC-AUC: 0.9939. To the best of our knowledge the performance of our model is the best one in the literature, as regards the diagnosis of AML using similar or not data. Moreover, there has not been any bibliographic reference which associates AML or any other type of cancer with the 16 probe sets we used as features in our final model.

Original language	English
Article number	e20530
Journal	Heliyon
Volume	9
Issue number	10
DOIs	https://doi.org/10.1016/j.heliyon.2023.e20530
Publication status	Published - 1 Oct 2023

Access to Document

https://doi.org/10.1016/j.heliyon.2023.e20530

Cite this

@article{fa5ddf73b04d4b128f26533259fa6643,

title = "Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees",

abstract = "We define an iterative method for dimensionality reduction using categorical gradient boosted trees and Shapley values and created four machine learning models which potentially could be used as diagnostic tests for acute myeloid leukaemia (AML). For the final Catboost model we use a dataset of 2177 individuals using as features 16 probe sets and the age in order to classify if someone has AML or is healthy. The dataset is multicentric and consists of data from 27 organizations, 25 cities, 15 countries and 4 continents. The performance of our last model is specificity: 0.9909, sensitivity: 0.9985, F1-score: 0.9976 and its ROC-AUC: 0.9962 using ten fold cross validation. On an inference dataset the perormance is: specificity: 0.9909, sensitivity: 0.9969, F1-score: 0.9969 and its ROC-AUC: 0.9939. To the best of our knowledge the performance of our model is the best one in the literature, as regards the diagnosis of AML using similar or not data. Moreover, there has not been any bibliographic reference which associates AML or any other type of cancer with the 16 probe sets we used as features in our final model.",

author = "Athanasios Angelakis and Ioanna Soulioti and Michael Filippakis",

note = "Publisher Copyright: {\textcopyright} 2023 The Author(s)",

year = "2023",

month = oct,

day = "1",

doi = "https://doi.org/10.1016/j.heliyon.2023.e20530",

language = "English",

volume = "9",

journal = "Heliyon",

issn = "2405-8440",

publisher = "Elsevier BV",

number = "10",

}

TY - JOUR

T1 - Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees

AU - Angelakis, Athanasios

AU - Soulioti, Ioanna

AU - Filippakis, Michael

PY - 2023/10/1

Y1 - 2023/10/1

N2 - We define an iterative method for dimensionality reduction using categorical gradient boosted trees and Shapley values and created four machine learning models which potentially could be used as diagnostic tests for acute myeloid leukaemia (AML). For the final Catboost model we use a dataset of 2177 individuals using as features 16 probe sets and the age in order to classify if someone has AML or is healthy. The dataset is multicentric and consists of data from 27 organizations, 25 cities, 15 countries and 4 continents. The performance of our last model is specificity: 0.9909, sensitivity: 0.9985, F1-score: 0.9976 and its ROC-AUC: 0.9962 using ten fold cross validation. On an inference dataset the perormance is: specificity: 0.9909, sensitivity: 0.9969, F1-score: 0.9969 and its ROC-AUC: 0.9939. To the best of our knowledge the performance of our model is the best one in the literature, as regards the diagnosis of AML using similar or not data. Moreover, there has not been any bibliographic reference which associates AML or any other type of cancer with the 16 probe sets we used as features in our final model.

AB - We define an iterative method for dimensionality reduction using categorical gradient boosted trees and Shapley values and created four machine learning models which potentially could be used as diagnostic tests for acute myeloid leukaemia (AML). For the final Catboost model we use a dataset of 2177 individuals using as features 16 probe sets and the age in order to classify if someone has AML or is healthy. The dataset is multicentric and consists of data from 27 organizations, 25 cities, 15 countries and 4 continents. The performance of our last model is specificity: 0.9909, sensitivity: 0.9985, F1-score: 0.9976 and its ROC-AUC: 0.9962 using ten fold cross validation. On an inference dataset the perormance is: specificity: 0.9909, sensitivity: 0.9969, F1-score: 0.9969 and its ROC-AUC: 0.9939. To the best of our knowledge the performance of our model is the best one in the literature, as regards the diagnosis of AML using similar or not data. Moreover, there has not been any bibliographic reference which associates AML or any other type of cancer with the 16 probe sets we used as features in our final model.

UR - http://www.scopus.com/inward/record.url?scp=85174517934&partnerID=8YFLogxK

U2 - https://doi.org/10.1016/j.heliyon.2023.e20530

DO - https://doi.org/10.1016/j.heliyon.2023.e20530

M3 - Article

C2 - 37860531

SN - 2405-8440

VL - 9

JO - Heliyon

JF - Heliyon

IS - 10

M1 - e20530

ER -

Diagnosis of acute myeloid leukaemia on microarray gene expression data using categorical gradient boosted trees

Abstract

Access to Document

Other files and links

Cite this