Inaccurate recording of routinely collected data items influences identification of COVID-19 patients

Eva S. Klappe; Ronald Cornet; Dave A. Dongelmans; Nicolette F. de Keizer

doi:https://doi.org/10.1016/j.ijmedinf.2022.104808

Inaccurate recording of routinely collected data items influences identification of COVID-19 patients

Eva S. Klappe, Ronald Cornet, Dave A. Dongelmans, Nicolette F. de Keizer

Research output: Contribution to journal › Article › Academic › peer-review

4 Citations (Scopus)

Abstract

Background: During the Coronavirus disease 2019 (COVID-19) pandemic it became apparent that it is difficult to extract standardized Electronic Health Record (EHR) data for secondary purposes like public health decision-making. Accurate recording of, for example, standardized diagnosis codes and test results is required to identify all COVID-19 patients. This study aimed to investigate if specific combinations of routinely collected data items for COVID-19 can be used to identify an accurate set of intensive care unit (ICU)-admitted COVID-19 patients. Methods: The following routinely collected EHR data items to identify COVID-19 patients were evaluated: positive reverse transcription polymerase chain reaction (RT-PCR) test results; problem list codes for COVID-19 registered by healthcare professionals and COVID-19 infection labels. COVID-19 codes registered by clinical coders retrospectively after discharge were also evaluated. A gold standard dataset was created by evaluating two datasets of suspected and confirmed COVID-19-patients admitted to the ICU at a Dutch university hospital between February 2020 and December 2020, of which one set was manually maintained by intensivists and one set was extracted from the EHR by a research data management department. Patients were labeled ‘COVID-19′ if their EHR record showed diagnosing COVID-19 during or right before an ICU-admission. Patients were labeled ‘non-COVID-19′ if the record indicated no COVID-19, exclusion or only suspicion during or right before an ICU-admission or if COVID-19 was diagnosed and cured during non-ICU episodes of the hospitalization in which an ICU-admission took place. Performance was determined for 37 queries including real-time and retrospective data items. We used the F₁ score, which is the harmonic mean between precision and recall. The gold standard dataset was split into one subset including admissions between February and April and one subset including admissions between May and December to determine accuracy differences. Results: The total dataset consisted of 402 patients: 196 ‘COVID-19′ and 206 ‘non-COVID-19′ patients. F₁ scores of search queries including EHR data items that can be extracted real-time ranged between 0.68 and 0.97 and for search queries including the data item that was retrospectively registered by clinical coders F₁ scores ranged between 0.73 and 0.99. F₁ scores showed no clear pattern in variability between the two time periods. Conclusions: Our study showed that one cannot rely on individual routinely collected data items such as coded COVID-19 on problem lists to identify all COVID-19 patients. If information is not required real-time, medical coding from clinical coders is most reliable. Researchers should be transparent about their methods used to extract data. To maximize the ability to completely identify all COVID-19 cases alerts for inconsistent data and policies for standardized data capture could enable reliable data reuse.

Original language	English
Article number	104808
Journal	International Journal of Medical Informatics
Volume	165
DOIs	https://doi.org/10.1016/j.ijmedinf.2022.104808
Publication status	Published - Sept 2022

Keywords

COVID-19
Data accuracy
Electronic Health Records
Problem list
Real-time data extraction
Routinely collected data

Access to Document

https://doi.org/10.1016/j.ijmedinf.2022.104808

Cite this

@article{e6f04686cbd74e27b488a0d825e0517f,

title = "Inaccurate recording of routinely collected data items influences identification of COVID-19 patients",

abstract = "Background: During the Coronavirus disease 2019 (COVID-19) pandemic it became apparent that it is difficult to extract standardized Electronic Health Record (EHR) data for secondary purposes like public health decision-making. Accurate recording of, for example, standardized diagnosis codes and test results is required to identify all COVID-19 patients. This study aimed to investigate if specific combinations of routinely collected data items for COVID-19 can be used to identify an accurate set of intensive care unit (ICU)-admitted COVID-19 patients. Methods: The following routinely collected EHR data items to identify COVID-19 patients were evaluated: positive reverse transcription polymerase chain reaction (RT-PCR) test results; problem list codes for COVID-19 registered by healthcare professionals and COVID-19 infection labels. COVID-19 codes registered by clinical coders retrospectively after discharge were also evaluated. A gold standard dataset was created by evaluating two datasets of suspected and confirmed COVID-19-patients admitted to the ICU at a Dutch university hospital between February 2020 and December 2020, of which one set was manually maintained by intensivists and one set was extracted from the EHR by a research data management department. Patients were labeled {\textquoteleft}COVID-19′ if their EHR record showed diagnosing COVID-19 during or right before an ICU-admission. Patients were labeled {\textquoteleft}non-COVID-19′ if the record indicated no COVID-19, exclusion or only suspicion during or right before an ICU-admission or if COVID-19 was diagnosed and cured during non-ICU episodes of the hospitalization in which an ICU-admission took place. Performance was determined for 37 queries including real-time and retrospective data items. We used the F1 score, which is the harmonic mean between precision and recall. The gold standard dataset was split into one subset including admissions between February and April and one subset including admissions between May and December to determine accuracy differences. Results: The total dataset consisted of 402 patients: 196 {\textquoteleft}COVID-19′ and 206 {\textquoteleft}non-COVID-19′ patients. F1 scores of search queries including EHR data items that can be extracted real-time ranged between 0.68 and 0.97 and for search queries including the data item that was retrospectively registered by clinical coders F1 scores ranged between 0.73 and 0.99. F1 scores showed no clear pattern in variability between the two time periods. Conclusions: Our study showed that one cannot rely on individual routinely collected data items such as coded COVID-19 on problem lists to identify all COVID-19 patients. If information is not required real-time, medical coding from clinical coders is most reliable. Researchers should be transparent about their methods used to extract data. To maximize the ability to completely identify all COVID-19 cases alerts for inconsistent data and policies for standardized data capture could enable reliable data reuse.",

keywords = "COVID-19, Data accuracy, Electronic Health Records, Problem list, Real-time data extraction, Routinely collected data",

author = "Klappe, {Eva S.} and Ronald Cornet and Dongelmans, {Dave A.} and {de Keizer}, {Nicolette F.}",

note = "Funding Information: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. This study was funded by Amsterdam UMC 2019-AMC-JK-7. Amsterdam UMC did not have any role in the study design, collection, analysis, interpretation of the data, writing the report and the decision to submit the report for publication. Publisher Copyright: {\textcopyright} 2022 The Author(s)",

year = "2022",

month = sep,

doi = "https://doi.org/10.1016/j.ijmedinf.2022.104808",

language = "English",

volume = "165",

journal = "International Journal of Medical Informatics",

issn = "1386-5056",

publisher = "Elsevier Ireland Ltd",

}

TY - JOUR

T1 - Inaccurate recording of routinely collected data items influences identification of COVID-19 patients

AU - Klappe, Eva S.

AU - Cornet, Ronald

AU - Dongelmans, Dave A.

AU - de Keizer, Nicolette F.

N1 - Funding Information: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. This study was funded by Amsterdam UMC 2019-AMC-JK-7. Amsterdam UMC did not have any role in the study design, collection, analysis, interpretation of the data, writing the report and the decision to submit the report for publication. Publisher Copyright: © 2022 The Author(s)

PY - 2022/9

Y1 - 2022/9

N2 - Background: During the Coronavirus disease 2019 (COVID-19) pandemic it became apparent that it is difficult to extract standardized Electronic Health Record (EHR) data for secondary purposes like public health decision-making. Accurate recording of, for example, standardized diagnosis codes and test results is required to identify all COVID-19 patients. This study aimed to investigate if specific combinations of routinely collected data items for COVID-19 can be used to identify an accurate set of intensive care unit (ICU)-admitted COVID-19 patients. Methods: The following routinely collected EHR data items to identify COVID-19 patients were evaluated: positive reverse transcription polymerase chain reaction (RT-PCR) test results; problem list codes for COVID-19 registered by healthcare professionals and COVID-19 infection labels. COVID-19 codes registered by clinical coders retrospectively after discharge were also evaluated. A gold standard dataset was created by evaluating two datasets of suspected and confirmed COVID-19-patients admitted to the ICU at a Dutch university hospital between February 2020 and December 2020, of which one set was manually maintained by intensivists and one set was extracted from the EHR by a research data management department. Patients were labeled ‘COVID-19′ if their EHR record showed diagnosing COVID-19 during or right before an ICU-admission. Patients were labeled ‘non-COVID-19′ if the record indicated no COVID-19, exclusion or only suspicion during or right before an ICU-admission or if COVID-19 was diagnosed and cured during non-ICU episodes of the hospitalization in which an ICU-admission took place. Performance was determined for 37 queries including real-time and retrospective data items. We used the F1 score, which is the harmonic mean between precision and recall. The gold standard dataset was split into one subset including admissions between February and April and one subset including admissions between May and December to determine accuracy differences. Results: The total dataset consisted of 402 patients: 196 ‘COVID-19′ and 206 ‘non-COVID-19′ patients. F1 scores of search queries including EHR data items that can be extracted real-time ranged between 0.68 and 0.97 and for search queries including the data item that was retrospectively registered by clinical coders F1 scores ranged between 0.73 and 0.99. F1 scores showed no clear pattern in variability between the two time periods. Conclusions: Our study showed that one cannot rely on individual routinely collected data items such as coded COVID-19 on problem lists to identify all COVID-19 patients. If information is not required real-time, medical coding from clinical coders is most reliable. Researchers should be transparent about their methods used to extract data. To maximize the ability to completely identify all COVID-19 cases alerts for inconsistent data and policies for standardized data capture could enable reliable data reuse.

AB - Background: During the Coronavirus disease 2019 (COVID-19) pandemic it became apparent that it is difficult to extract standardized Electronic Health Record (EHR) data for secondary purposes like public health decision-making. Accurate recording of, for example, standardized diagnosis codes and test results is required to identify all COVID-19 patients. This study aimed to investigate if specific combinations of routinely collected data items for COVID-19 can be used to identify an accurate set of intensive care unit (ICU)-admitted COVID-19 patients. Methods: The following routinely collected EHR data items to identify COVID-19 patients were evaluated: positive reverse transcription polymerase chain reaction (RT-PCR) test results; problem list codes for COVID-19 registered by healthcare professionals and COVID-19 infection labels. COVID-19 codes registered by clinical coders retrospectively after discharge were also evaluated. A gold standard dataset was created by evaluating two datasets of suspected and confirmed COVID-19-patients admitted to the ICU at a Dutch university hospital between February 2020 and December 2020, of which one set was manually maintained by intensivists and one set was extracted from the EHR by a research data management department. Patients were labeled ‘COVID-19′ if their EHR record showed diagnosing COVID-19 during or right before an ICU-admission. Patients were labeled ‘non-COVID-19′ if the record indicated no COVID-19, exclusion or only suspicion during or right before an ICU-admission or if COVID-19 was diagnosed and cured during non-ICU episodes of the hospitalization in which an ICU-admission took place. Performance was determined for 37 queries including real-time and retrospective data items. We used the F1 score, which is the harmonic mean between precision and recall. The gold standard dataset was split into one subset including admissions between February and April and one subset including admissions between May and December to determine accuracy differences. Results: The total dataset consisted of 402 patients: 196 ‘COVID-19′ and 206 ‘non-COVID-19′ patients. F1 scores of search queries including EHR data items that can be extracted real-time ranged between 0.68 and 0.97 and for search queries including the data item that was retrospectively registered by clinical coders F1 scores ranged between 0.73 and 0.99. F1 scores showed no clear pattern in variability between the two time periods. Conclusions: Our study showed that one cannot rely on individual routinely collected data items such as coded COVID-19 on problem lists to identify all COVID-19 patients. If information is not required real-time, medical coding from clinical coders is most reliable. Researchers should be transparent about their methods used to extract data. To maximize the ability to completely identify all COVID-19 cases alerts for inconsistent data and policies for standardized data capture could enable reliable data reuse.

KW - COVID-19

KW - Data accuracy

KW - Electronic Health Records

KW - Problem list

KW - Real-time data extraction

KW - Routinely collected data

UR - http://www.scopus.com/inward/record.url?scp=85132871262&partnerID=8YFLogxK

U2 - https://doi.org/10.1016/j.ijmedinf.2022.104808

DO - https://doi.org/10.1016/j.ijmedinf.2022.104808

M3 - Article

C2 - 35767912

SN - 1386-5056

VL - 165

JO - International Journal of Medical Informatics

JF - International Journal of Medical Informatics

M1 - 104808

ER -

Inaccurate recording of routinely collected data items influences identification of COVID-19 patients

Abstract

Keywords

Access to Document

Other files and links

Cite this