Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study

Amitava Banerjee; Ashkan Dashtban; Suliang Chen; Laura Pasea; Johan H. Thygesen; Ghazaleh Fatemifar; Benoit Tyl; Tomasz Dyszynski; Folkert W. Asselbergs; Lars H. Lund; Tom Lumbers; Spiros Denaxas; Harry Hemingway

doi:https://doi.org/10.1016/S2589-7500(23)00065-1

Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study

Amitava Banerjee, Ashkan Dashtban, Suliang Chen, Laura Pasea, Johan H. Thygesen, Ghazaleh Fatemifar, Benoit Tyl, Tomasz Dyszynski, Folkert W. Asselbergs, Lars H. Lund, Tom Lumbers, Spiros Denaxas, Harry Hemingway

Research output: Contribution to journal › Article › Academic › peer-review

8 Citations (Scopus)

Abstract

Background: Machine learning has been used to analyse heart failure subtypes, but not across large, distinct, population-based datasets, across the whole spectrum of causes and presentations, or with clinical and non-clinical validation by different machine learning methods. Using our published framework, we aimed to discover heart failure subtypes and validate them upon population representative data. Methods: In this external, prognostic, and genetic validation study we analysed individuals aged 30 years or older with incident heart failure from two population-based databases in the UK (Clinical Practice Research Datalink [CPRD] and The Health Improvement Network [THIN]) from 1998 to 2018. Pre-heart failure and post-heart failure factors (n=645) included demographic information, history, examination, blood laboratory values, and medications. We identified subtypes using four unsupervised machine learning methods (K-means, hierarchical, K-Medoids, and mixture model clustering) with 87 of 645 factors in each dataset. We evaluated subtypes for (1) external validity (across datasets); (2) prognostic validity (predictive accuracy for 1-year mortality); and (3) genetic validity (UK Biobank), association with polygenic risk score (PRS) for heart failure-related traits (n=11), and single nucleotide polymorphisms (n=12). Findings: We included 188 800, 124 262, and 9573 individuals with incident heart failure from CPRD, THIN, and UK Biobank, respectively, between Jan 1, 1998, and Jan 1, 2018. After identifying five clusters, we labelled heart failure subtypes as (1) early onset, (2) late onset, (3) atrial fibrillation related, (4) metabolic, and (5) cardiometabolic. In the external validity analysis, subtypes were similar across datasets (c-statistics: THIN model in CPRD ranged from 0·79 [subtype 3] to 0·94 [subtype 1], and CPRD model in THIN ranged from 0·79 [subtype 1] to 0·92 [subtypes 2 and 5]). In the prognostic validity analysis, 1-year all-cause mortality after heart failure diagnosis (subtype 1 0·20 [95% CI 0·14–0·25], subtype 2 0·46 [0·43–0·49], subtype 3 0·61 [0·57–0·64], subtype 4 0·11 [0·07–0·16], and subtype 5 0·37 [0·32–0·41]) differed across subtypes in CPRD and THIN data, as did risk of non-fatal cardiovascular diseases and all-cause hospitalisation. In the genetic validity analysis the atrial fibrillation-related subtype showed associations with the related PRS. Late onset and cardiometabolic subtypes were the most similar and strongly associated with PRS for hypertension, myocardial infarction, and obesity (p<0·0009). We developed a prototype app for routine clinical use, which could enable evaluation of effectiveness and cost-effectiveness. Interpretation: Across four methods and three datasets, including genetic data, in the largest study of incident heart failure to date, we identified five machine learning-informed subtypes, which might inform aetiological research, clinical risk prediction, and the design of heart failure trials. Funding: European Union Innovative Medicines Initiative-2.

Original language	English
Pages (from-to)	e370-e379
Journal	The Lancet Digital Health
Volume	5
Issue number	6
DOIs	https://doi.org/10.1016/S2589-7500(23)00065-1
Publication status	Published - 1 Jun 2023

Access to Document

https://doi.org/10.1016/S2589-7500(23)00065-1

Cite this

Banerjee, A., Dashtban, A., Chen, S., Pasea, L., Thygesen, J. H., Fatemifar, G., Tyl, B., Dyszynski, T., Asselbergs, F. W., Lund, L. H., Lumbers, T., Denaxas, S., & Hemingway, H. (2023). Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study. The Lancet Digital Health, 5(6), e370-e379. https://doi.org/10.1016/S2589-7500(23)00065-1

@article{53d0d3a08a614e3c9909e12667c20af3,

title = "Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study",

abstract = "Background: Machine learning has been used to analyse heart failure subtypes, but not across large, distinct, population-based datasets, across the whole spectrum of causes and presentations, or with clinical and non-clinical validation by different machine learning methods. Using our published framework, we aimed to discover heart failure subtypes and validate them upon population representative data. Methods: In this external, prognostic, and genetic validation study we analysed individuals aged 30 years or older with incident heart failure from two population-based databases in the UK (Clinical Practice Research Datalink [CPRD] and The Health Improvement Network [THIN]) from 1998 to 2018. Pre-heart failure and post-heart failure factors (n=645) included demographic information, history, examination, blood laboratory values, and medications. We identified subtypes using four unsupervised machine learning methods (K-means, hierarchical, K-Medoids, and mixture model clustering) with 87 of 645 factors in each dataset. We evaluated subtypes for (1) external validity (across datasets); (2) prognostic validity (predictive accuracy for 1-year mortality); and (3) genetic validity (UK Biobank), association with polygenic risk score (PRS) for heart failure-related traits (n=11), and single nucleotide polymorphisms (n=12). Findings: We included 188 800, 124 262, and 9573 individuals with incident heart failure from CPRD, THIN, and UK Biobank, respectively, between Jan 1, 1998, and Jan 1, 2018. After identifying five clusters, we labelled heart failure subtypes as (1) early onset, (2) late onset, (3) atrial fibrillation related, (4) metabolic, and (5) cardiometabolic. In the external validity analysis, subtypes were similar across datasets (c-statistics: THIN model in CPRD ranged from 0·79 [subtype 3] to 0·94 [subtype 1], and CPRD model in THIN ranged from 0·79 [subtype 1] to 0·92 [subtypes 2 and 5]). In the prognostic validity analysis, 1-year all-cause mortality after heart failure diagnosis (subtype 1 0·20 [95% CI 0·14–0·25], subtype 2 0·46 [0·43–0·49], subtype 3 0·61 [0·57–0·64], subtype 4 0·11 [0·07–0·16], and subtype 5 0·37 [0·32–0·41]) differed across subtypes in CPRD and THIN data, as did risk of non-fatal cardiovascular diseases and all-cause hospitalisation. In the genetic validity analysis the atrial fibrillation-related subtype showed associations with the related PRS. Late onset and cardiometabolic subtypes were the most similar and strongly associated with PRS for hypertension, myocardial infarction, and obesity (p<0·0009). We developed a prototype app for routine clinical use, which could enable evaluation of effectiveness and cost-effectiveness. Interpretation: Across four methods and three datasets, including genetic data, in the largest study of incident heart failure to date, we identified five machine learning-informed subtypes, which might inform aetiological research, clinical risk prediction, and the design of heart failure trials. Funding: European Union Innovative Medicines Initiative-2.",

author = "Amitava Banerjee and Ashkan Dashtban and Suliang Chen and Laura Pasea and Thygesen, {Johan H.} and Ghazaleh Fatemifar and Benoit Tyl and Tomasz Dyszynski and Asselbergs, {Folkert W.} and Lund, {Lars H.} and Tom Lumbers and Spiros Denaxas and Harry Hemingway",

note = "Funding Information: All authors are supported by the BigData@Heart Consortium, funded by the European Union Innovative Medicines Initiative-2 joint undertaking under grant agreement number 116074. This joint undertaking receives support from the EU's Horizon 2020 research and innovation programme and the European Federation of Pharmaceutical Industries and Associations. Publisher Copyright: {\textcopyright} 2023 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license",

year = "2023",

month = jun,

day = "1",

doi = "https://doi.org/10.1016/S2589-7500(23)00065-1",

language = "English",

volume = "5",

pages = "e370--e379",

journal = "The Lancet Digital Health",

issn = "2589-7500",

publisher = "Elsevier Ltd",

number = "6",

}

Banerjee, A, Dashtban, A, Chen, S, Pasea, L, Thygesen, JH, Fatemifar, G, Tyl, B, Dyszynski, T, Asselbergs, FW, Lund, LH, Lumbers, T, Denaxas, S & Hemingway, H 2023, 'Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study', The Lancet Digital Health, vol. 5, no. 6, pp. e370-e379. https://doi.org/10.1016/S2589-7500(23)00065-1

Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study. / Banerjee, Amitava; Dashtban, Ashkan; Chen, Suliang et al.
In: The Lancet Digital Health, Vol. 5, No. 6, 01.06.2023, p. e370-e379.

Research output: Contribution to journal › Article › Academic › peer-review

TY - JOUR

T1 - Identifying subtypes of heart failure from three electronic health record sources with machine learning

T2 - an external, prognostic, and genetic validation study

AU - Banerjee, Amitava

AU - Dashtban, Ashkan

AU - Chen, Suliang

AU - Pasea, Laura

AU - Thygesen, Johan H.

AU - Fatemifar, Ghazaleh

AU - Tyl, Benoit

AU - Dyszynski, Tomasz

AU - Asselbergs, Folkert W.

AU - Lund, Lars H.

AU - Lumbers, Tom

AU - Denaxas, Spiros

AU - Hemingway, Harry

N1 - Funding Information: All authors are supported by the BigData@Heart Consortium, funded by the European Union Innovative Medicines Initiative-2 joint undertaking under grant agreement number 116074. This joint undertaking receives support from the EU's Horizon 2020 research and innovation programme and the European Federation of Pharmaceutical Industries and Associations. Publisher Copyright: © 2023 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license

PY - 2023/6/1

Y1 - 2023/6/1

N2 - Background: Machine learning has been used to analyse heart failure subtypes, but not across large, distinct, population-based datasets, across the whole spectrum of causes and presentations, or with clinical and non-clinical validation by different machine learning methods. Using our published framework, we aimed to discover heart failure subtypes and validate them upon population representative data. Methods: In this external, prognostic, and genetic validation study we analysed individuals aged 30 years or older with incident heart failure from two population-based databases in the UK (Clinical Practice Research Datalink [CPRD] and The Health Improvement Network [THIN]) from 1998 to 2018. Pre-heart failure and post-heart failure factors (n=645) included demographic information, history, examination, blood laboratory values, and medications. We identified subtypes using four unsupervised machine learning methods (K-means, hierarchical, K-Medoids, and mixture model clustering) with 87 of 645 factors in each dataset. We evaluated subtypes for (1) external validity (across datasets); (2) prognostic validity (predictive accuracy for 1-year mortality); and (3) genetic validity (UK Biobank), association with polygenic risk score (PRS) for heart failure-related traits (n=11), and single nucleotide polymorphisms (n=12). Findings: We included 188 800, 124 262, and 9573 individuals with incident heart failure from CPRD, THIN, and UK Biobank, respectively, between Jan 1, 1998, and Jan 1, 2018. After identifying five clusters, we labelled heart failure subtypes as (1) early onset, (2) late onset, (3) atrial fibrillation related, (4) metabolic, and (5) cardiometabolic. In the external validity analysis, subtypes were similar across datasets (c-statistics: THIN model in CPRD ranged from 0·79 [subtype 3] to 0·94 [subtype 1], and CPRD model in THIN ranged from 0·79 [subtype 1] to 0·92 [subtypes 2 and 5]). In the prognostic validity analysis, 1-year all-cause mortality after heart failure diagnosis (subtype 1 0·20 [95% CI 0·14–0·25], subtype 2 0·46 [0·43–0·49], subtype 3 0·61 [0·57–0·64], subtype 4 0·11 [0·07–0·16], and subtype 5 0·37 [0·32–0·41]) differed across subtypes in CPRD and THIN data, as did risk of non-fatal cardiovascular diseases and all-cause hospitalisation. In the genetic validity analysis the atrial fibrillation-related subtype showed associations with the related PRS. Late onset and cardiometabolic subtypes were the most similar and strongly associated with PRS for hypertension, myocardial infarction, and obesity (p<0·0009). We developed a prototype app for routine clinical use, which could enable evaluation of effectiveness and cost-effectiveness. Interpretation: Across four methods and three datasets, including genetic data, in the largest study of incident heart failure to date, we identified five machine learning-informed subtypes, which might inform aetiological research, clinical risk prediction, and the design of heart failure trials. Funding: European Union Innovative Medicines Initiative-2.

AB - Background: Machine learning has been used to analyse heart failure subtypes, but not across large, distinct, population-based datasets, across the whole spectrum of causes and presentations, or with clinical and non-clinical validation by different machine learning methods. Using our published framework, we aimed to discover heart failure subtypes and validate them upon population representative data. Methods: In this external, prognostic, and genetic validation study we analysed individuals aged 30 years or older with incident heart failure from two population-based databases in the UK (Clinical Practice Research Datalink [CPRD] and The Health Improvement Network [THIN]) from 1998 to 2018. Pre-heart failure and post-heart failure factors (n=645) included demographic information, history, examination, blood laboratory values, and medications. We identified subtypes using four unsupervised machine learning methods (K-means, hierarchical, K-Medoids, and mixture model clustering) with 87 of 645 factors in each dataset. We evaluated subtypes for (1) external validity (across datasets); (2) prognostic validity (predictive accuracy for 1-year mortality); and (3) genetic validity (UK Biobank), association with polygenic risk score (PRS) for heart failure-related traits (n=11), and single nucleotide polymorphisms (n=12). Findings: We included 188 800, 124 262, and 9573 individuals with incident heart failure from CPRD, THIN, and UK Biobank, respectively, between Jan 1, 1998, and Jan 1, 2018. After identifying five clusters, we labelled heart failure subtypes as (1) early onset, (2) late onset, (3) atrial fibrillation related, (4) metabolic, and (5) cardiometabolic. In the external validity analysis, subtypes were similar across datasets (c-statistics: THIN model in CPRD ranged from 0·79 [subtype 3] to 0·94 [subtype 1], and CPRD model in THIN ranged from 0·79 [subtype 1] to 0·92 [subtypes 2 and 5]). In the prognostic validity analysis, 1-year all-cause mortality after heart failure diagnosis (subtype 1 0·20 [95% CI 0·14–0·25], subtype 2 0·46 [0·43–0·49], subtype 3 0·61 [0·57–0·64], subtype 4 0·11 [0·07–0·16], and subtype 5 0·37 [0·32–0·41]) differed across subtypes in CPRD and THIN data, as did risk of non-fatal cardiovascular diseases and all-cause hospitalisation. In the genetic validity analysis the atrial fibrillation-related subtype showed associations with the related PRS. Late onset and cardiometabolic subtypes were the most similar and strongly associated with PRS for hypertension, myocardial infarction, and obesity (p<0·0009). We developed a prototype app for routine clinical use, which could enable evaluation of effectiveness and cost-effectiveness. Interpretation: Across four methods and three datasets, including genetic data, in the largest study of incident heart failure to date, we identified five machine learning-informed subtypes, which might inform aetiological research, clinical risk prediction, and the design of heart failure trials. Funding: European Union Innovative Medicines Initiative-2.

UR - http://www.scopus.com/inward/record.url?scp=85160019705&partnerID=8YFLogxK

U2 - https://doi.org/10.1016/S2589-7500(23)00065-1

DO - https://doi.org/10.1016/S2589-7500(23)00065-1

M3 - Article

C2 - 37236697

SN - 2589-7500

VL - 5

SP - e370-e379

JO - The Lancet Digital Health

JF - The Lancet Digital Health

IS - 6

ER -

Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study

Abstract

Access to Document

Other files and links

Cite this