A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study

Maarten Homburg; Eline Meijer; Matthijs Berends; Thijmen Kupers; Tim Olde Hartman; Jean Muris; Evelien de Schepper; Premysl Velek; Jeroen Kuiper; Marjolein Berger; Lilian Peters

doi:https://doi.org/10.2196/49944

A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study

Maarten Homburg, Eline Meijer, Matthijs Berends, Thijmen Kupers, Tim Olde Hartman, Jean Muris, Evelien de Schepper, Premysl Velek, Jeroen Kuiper, Marjolein Berger, Lilian Peters

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Background: Natural language processing (NLP) models such as bidirectional encoder representations from transformers (BERT) hold promise in revolutionizing disease identification from electronic health records (EHRs) by potentially enhancing efficiency and accuracy. However, their practical application in practice settings demands a comprehensive and multidisciplinary approach to development and validation. The COVID-19 pandemic highlighted challenges in disease identification due to limited testing availability and challenges in handling unstructured data. In the Netherlands, where general practitioners (GPs) serve as the first point of contact for health care, EHRs generated by these primary care providers contain a wealth of potentially valuable information. Nonetheless, the unstructured nature of free-text entries in EHRs poses challenges in identifying trends, detecting disease outbreaks, or accurately pinpointing COVID-19 cases. Objective: This study aims to develop and validate a BERT model for detecting COVID-19 consultations in general practice EHRs in the Netherlands. Methods: The BERT model was initially pretrained on Dutch language data and fine-tuned using a comprehensive EHR data set comprising confirmed COVID-19 GP consultations and non–COVID-19–related consultations. The data set was partitioned into a training and development set, and the model’s performance was evaluated on an independent test set that served as the primary measure of its effectiveness in COVID-19 detection. To validate the final model, its performance was assessed through 3 approaches. First, external validation was applied on an EHR data set from a different geographic region in the Netherlands. Second, validation was conducted using results of polymerase chain reaction (PCR) test data obtained from municipal health services. Lastly, correlation between predicted outcomes and COVID-19–related hospitalizations in the Netherlands was assessed, encompassing the period around the outbreak of the pandemic in the Netherlands, that is, the period before widespread testing. Results: The model development used 300,359 GP consultations. We developed a highly accurate model for COVID-19 consultations (accuracy 0.97, F ₁-score 0.90, precision 0.85, recall 0.85, specificity 0.99). External validations showed comparable high performance. Validation on PCR test data showed high recall but low precision and specificity. Validation using hospital data showed significant correlation between COVID-19 predictions of the model and COVID-19–related hospitalizations (F ₁-score 96.8; P<.001; R ²=0.69). Most importantly, the model was able to predict COVID-19 cases weeks before the first confirmed case in the Netherlands. Conclusions: The developed BERT model was able to accurately identify COVID-19 cases among GP consultations even preceding confirmed cases. The validated efficacy of our BERT model highlights the potential of NLP models to identify disease outbreaks early, exemplifying the power of multidisciplinary efforts in harnessing technology for disease identification. Moreover, the implications of this study extend beyond COVID-19 and offer a blueprint for the early recognition of various illnesses, revealing that such models could revolutionize disease surveillance.

Original language	English
Article number	e49944
Journal	Journal of Medical Internet Research
Volume	25
Issue number	1
DOIs	https://doi.org/10.2196/49944
Publication status	Published - 1 Jan 2023

Access to Document

https://doi.org/10.2196/49944

Cite this

Homburg, M., Meijer, E., Berends, M., Kupers, T., Hartman, T. O., Muris, J., de Schepper, E., Velek, P., Kuiper, J., Berger, M., & Peters, L. (2023). A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study. Journal of Medical Internet Research, 25(1), Article e49944. https://doi.org/10.2196/49944

@article{8b35cf490ce54b2d9cf983e128c721ce,

title = "A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study",

abstract = "Background: Natural language processing (NLP) models such as bidirectional encoder representations from transformers (BERT) hold promise in revolutionizing disease identification from electronic health records (EHRs) by potentially enhancing efficiency and accuracy. However, their practical application in practice settings demands a comprehensive and multidisciplinary approach to development and validation. The COVID-19 pandemic highlighted challenges in disease identification due to limited testing availability and challenges in handling unstructured data. In the Netherlands, where general practitioners (GPs) serve as the first point of contact for health care, EHRs generated by these primary care providers contain a wealth of potentially valuable information. Nonetheless, the unstructured nature of free-text entries in EHRs poses challenges in identifying trends, detecting disease outbreaks, or accurately pinpointing COVID-19 cases. Objective: This study aims to develop and validate a BERT model for detecting COVID-19 consultations in general practice EHRs in the Netherlands. Methods: The BERT model was initially pretrained on Dutch language data and fine-tuned using a comprehensive EHR data set comprising confirmed COVID-19 GP consultations and non–COVID-19–related consultations. The data set was partitioned into a training and development set, and the model{\textquoteright}s performance was evaluated on an independent test set that served as the primary measure of its effectiveness in COVID-19 detection. To validate the final model, its performance was assessed through 3 approaches. First, external validation was applied on an EHR data set from a different geographic region in the Netherlands. Second, validation was conducted using results of polymerase chain reaction (PCR) test data obtained from municipal health services. Lastly, correlation between predicted outcomes and COVID-19–related hospitalizations in the Netherlands was assessed, encompassing the period around the outbreak of the pandemic in the Netherlands, that is, the period before widespread testing. Results: The model development used 300,359 GP consultations. We developed a highly accurate model for COVID-19 consultations (accuracy 0.97, F 1-score 0.90, precision 0.85, recall 0.85, specificity 0.99). External validations showed comparable high performance. Validation on PCR test data showed high recall but low precision and specificity. Validation using hospital data showed significant correlation between COVID-19 predictions of the model and COVID-19–related hospitalizations (F 1-score 96.8; P<.001; R 2=0.69). Most importantly, the model was able to predict COVID-19 cases weeks before the first confirmed case in the Netherlands. Conclusions: The developed BERT model was able to accurately identify COVID-19 cases among GP consultations even preceding confirmed cases. The validated efficacy of our BERT model highlights the potential of NLP models to identify disease outbreaks early, exemplifying the power of multidisciplinary efforts in harnessing technology for disease identification. Moreover, the implications of this study extend beyond COVID-19 and offer a blueprint for the early recognition of various illnesses, revealing that such models could revolutionize disease surveillance.",

author = "Maarten Homburg and Eline Meijer and Matthijs Berends and Thijmen Kupers and Hartman, {Tim Olde} and Jean Muris and {de Schepper}, Evelien and Premysl Velek and Jeroen Kuiper and Marjolein Berger and Lilian Peters",

note = "Funding Information: This research was conducted using the central databases of Academisch Huisartsen Ontwikkel Netwerk, Family Medicine Network, Research Network Family Medicine, and Rijnmond. We are grateful for the support provided by Feikje Groenhof and Ronald Wilmink (Academisch Huisartsen Ontwikkel Netwerk), Jose Donkers and Hans Peters (Family Medicine Network), Donovan de Jonge (Research Network Family Medicine), and Angeline Bosman (Rijnmond Central) in these networks. We also want to thank Robin Twickler and Karina Sulim, data scientists from the University Medical Center Groningen, who are affiliated with the Department of General Practice and Elderly Care Medicine and Data Science in Health Care. Finally, we thank Dr Roberts Sykes at Doctored Ltd for providing English language editing of the final drafts of this manuscript. The Netherlands Organization for Health Research and Development (ZonMW) funded this study: “Changes in the Use and Organization of Care in General Practices and Out-of-hours Services: Lessons Learned from the COVID-19 Pandemic” (10430022010006) and the “General Practice Research Infrastructure Pandemic Preparedness Program” (GRIP3) (10430112110001). The funder played no role in the study design, data collection, data analysis and interpretation, or writing of this manuscript. Publisher Copyright: {\textcopyright} Maarten Homburg, Eline Meijer, Matthijs Berends, Thijmen Kupers, Tim Olde Hartman, Jean Muris, Evelien de Schepper, Premysl Velek, Jeroen Kuiper, Marjolein Berger, Lilian Peters. Originally published in the Journal of Medical Internet Research.",

year = "2023",

month = jan,

day = "1",

doi = "https://doi.org/10.2196/49944",

language = "English",

volume = "25",

journal = "Journal of Medical Internet Research",

issn = "1438-8871",

publisher = "Journal of medical Internet Research",

number = "1",

}

Homburg, M, Meijer, E, Berends, M, Kupers, T, Hartman, TO, Muris, J, de Schepper, E, Velek, P, Kuiper, J, Berger, M & Peters, L 2023, 'A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study', Journal of Medical Internet Research, vol. 25, no. 1, e49944. https://doi.org/10.2196/49944

A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study. / Homburg, Maarten; Meijer, Eline; Berends, Matthijs et al.
In: Journal of Medical Internet Research, Vol. 25, No. 1, e49944, 01.01.2023.

Research output: Contribution to journal › Article › Academic › peer-review

TY - JOUR

T1 - A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers

T2 - Development and Validation Study

AU - Homburg, Maarten

AU - Meijer, Eline

AU - Berends, Matthijs

AU - Kupers, Thijmen

AU - Hartman, Tim Olde

AU - Muris, Jean

AU - de Schepper, Evelien

AU - Velek, Premysl

AU - Kuiper, Jeroen

AU - Berger, Marjolein

AU - Peters, Lilian

N1 - Funding Information: This research was conducted using the central databases of Academisch Huisartsen Ontwikkel Netwerk, Family Medicine Network, Research Network Family Medicine, and Rijnmond. We are grateful for the support provided by Feikje Groenhof and Ronald Wilmink (Academisch Huisartsen Ontwikkel Netwerk), Jose Donkers and Hans Peters (Family Medicine Network), Donovan de Jonge (Research Network Family Medicine), and Angeline Bosman (Rijnmond Central) in these networks. We also want to thank Robin Twickler and Karina Sulim, data scientists from the University Medical Center Groningen, who are affiliated with the Department of General Practice and Elderly Care Medicine and Data Science in Health Care. Finally, we thank Dr Roberts Sykes at Doctored Ltd for providing English language editing of the final drafts of this manuscript. The Netherlands Organization for Health Research and Development (ZonMW) funded this study: “Changes in the Use and Organization of Care in General Practices and Out-of-hours Services: Lessons Learned from the COVID-19 Pandemic” (10430022010006) and the “General Practice Research Infrastructure Pandemic Preparedness Program” (GRIP3) (10430112110001). The funder played no role in the study design, data collection, data analysis and interpretation, or writing of this manuscript. Publisher Copyright: © Maarten Homburg, Eline Meijer, Matthijs Berends, Thijmen Kupers, Tim Olde Hartman, Jean Muris, Evelien de Schepper, Premysl Velek, Jeroen Kuiper, Marjolein Berger, Lilian Peters. Originally published in the Journal of Medical Internet Research.

PY - 2023/1/1

Y1 - 2023/1/1

N2 - Background: Natural language processing (NLP) models such as bidirectional encoder representations from transformers (BERT) hold promise in revolutionizing disease identification from electronic health records (EHRs) by potentially enhancing efficiency and accuracy. However, their practical application in practice settings demands a comprehensive and multidisciplinary approach to development and validation. The COVID-19 pandemic highlighted challenges in disease identification due to limited testing availability and challenges in handling unstructured data. In the Netherlands, where general practitioners (GPs) serve as the first point of contact for health care, EHRs generated by these primary care providers contain a wealth of potentially valuable information. Nonetheless, the unstructured nature of free-text entries in EHRs poses challenges in identifying trends, detecting disease outbreaks, or accurately pinpointing COVID-19 cases. Objective: This study aims to develop and validate a BERT model for detecting COVID-19 consultations in general practice EHRs in the Netherlands. Methods: The BERT model was initially pretrained on Dutch language data and fine-tuned using a comprehensive EHR data set comprising confirmed COVID-19 GP consultations and non–COVID-19–related consultations. The data set was partitioned into a training and development set, and the model’s performance was evaluated on an independent test set that served as the primary measure of its effectiveness in COVID-19 detection. To validate the final model, its performance was assessed through 3 approaches. First, external validation was applied on an EHR data set from a different geographic region in the Netherlands. Second, validation was conducted using results of polymerase chain reaction (PCR) test data obtained from municipal health services. Lastly, correlation between predicted outcomes and COVID-19–related hospitalizations in the Netherlands was assessed, encompassing the period around the outbreak of the pandemic in the Netherlands, that is, the period before widespread testing. Results: The model development used 300,359 GP consultations. We developed a highly accurate model for COVID-19 consultations (accuracy 0.97, F 1-score 0.90, precision 0.85, recall 0.85, specificity 0.99). External validations showed comparable high performance. Validation on PCR test data showed high recall but low precision and specificity. Validation using hospital data showed significant correlation between COVID-19 predictions of the model and COVID-19–related hospitalizations (F 1-score 96.8; P<.001; R 2=0.69). Most importantly, the model was able to predict COVID-19 cases weeks before the first confirmed case in the Netherlands. Conclusions: The developed BERT model was able to accurately identify COVID-19 cases among GP consultations even preceding confirmed cases. The validated efficacy of our BERT model highlights the potential of NLP models to identify disease outbreaks early, exemplifying the power of multidisciplinary efforts in harnessing technology for disease identification. Moreover, the implications of this study extend beyond COVID-19 and offer a blueprint for the early recognition of various illnesses, revealing that such models could revolutionize disease surveillance.

AB - Background: Natural language processing (NLP) models such as bidirectional encoder representations from transformers (BERT) hold promise in revolutionizing disease identification from electronic health records (EHRs) by potentially enhancing efficiency and accuracy. However, their practical application in practice settings demands a comprehensive and multidisciplinary approach to development and validation. The COVID-19 pandemic highlighted challenges in disease identification due to limited testing availability and challenges in handling unstructured data. In the Netherlands, where general practitioners (GPs) serve as the first point of contact for health care, EHRs generated by these primary care providers contain a wealth of potentially valuable information. Nonetheless, the unstructured nature of free-text entries in EHRs poses challenges in identifying trends, detecting disease outbreaks, or accurately pinpointing COVID-19 cases. Objective: This study aims to develop and validate a BERT model for detecting COVID-19 consultations in general practice EHRs in the Netherlands. Methods: The BERT model was initially pretrained on Dutch language data and fine-tuned using a comprehensive EHR data set comprising confirmed COVID-19 GP consultations and non–COVID-19–related consultations. The data set was partitioned into a training and development set, and the model’s performance was evaluated on an independent test set that served as the primary measure of its effectiveness in COVID-19 detection. To validate the final model, its performance was assessed through 3 approaches. First, external validation was applied on an EHR data set from a different geographic region in the Netherlands. Second, validation was conducted using results of polymerase chain reaction (PCR) test data obtained from municipal health services. Lastly, correlation between predicted outcomes and COVID-19–related hospitalizations in the Netherlands was assessed, encompassing the period around the outbreak of the pandemic in the Netherlands, that is, the period before widespread testing. Results: The model development used 300,359 GP consultations. We developed a highly accurate model for COVID-19 consultations (accuracy 0.97, F 1-score 0.90, precision 0.85, recall 0.85, specificity 0.99). External validations showed comparable high performance. Validation on PCR test data showed high recall but low precision and specificity. Validation using hospital data showed significant correlation between COVID-19 predictions of the model and COVID-19–related hospitalizations (F 1-score 96.8; P<.001; R 2=0.69). Most importantly, the model was able to predict COVID-19 cases weeks before the first confirmed case in the Netherlands. Conclusions: The developed BERT model was able to accurately identify COVID-19 cases among GP consultations even preceding confirmed cases. The validated efficacy of our BERT model highlights the potential of NLP models to identify disease outbreaks early, exemplifying the power of multidisciplinary efforts in harnessing technology for disease identification. Moreover, the implications of this study extend beyond COVID-19 and offer a blueprint for the early recognition of various illnesses, revealing that such models could revolutionize disease surveillance.

UR - http://www.scopus.com/inward/record.url?scp=85175586952&partnerID=8YFLogxK

U2 - https://doi.org/10.2196/49944

DO - https://doi.org/10.2196/49944

M3 - Article

C2 - 37792444

SN - 1438-8871

VL - 25

JO - Journal of Medical Internet Research

JF - Journal of Medical Internet Research

IS - 1

M1 - e49944

ER -

Homburg M, Meijer E, Berends M, Kupers T, Hartman TO, Muris J et al. A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study. Journal of Medical Internet Research. 2023 Jan 1;25(1):e49944. doi: https://doi.org/10.2196/49944

A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study

Abstract

Access to Document

Other files and links

Cite this