TY - GEN
T1 - A hybrid data harmonization workflow using word embeddings for the interlinking of heterogeneous cross-domain clinical data structures
AU - Pezoulas, Vasileios C.
AU - Sakellarios, Antonis
AU - Kleber, Marcus
AU - Bosch, Jos A.
AU - van der Laan, Sander W.
AU - Lamers, Femke
AU - Lehtimäki, Terho
AU - März, Winfried
AU - Fotiadis, Dimitrios I.
N1 - Funding Information: 2020 research and innovation programme under grant agreement No 848146 (TO_AITION). Funding Information: * This project has received funding from the European Union’s Horizon Funding Information: This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 848146 (TO_AITION). Publisher Copyright: © 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Retrospective data harmonization is an open issue in healthcare due to the emerging need to interlink data from multiple clinical centers with the absence of standardized data collection protocols. In this work, we present an automated data harmonization workflow which utilizes lexical and semantic analysis based on word embeddings and relational modeling to detect terminologies with common lexical and conceptual basis. The method is built on top of a knowledge base to enable the interlinking of heterogeneous cross-domain data. A case study is applied in two clinical domains, namely the cardiovascular disease (CVD) and the mental disorders, where the proposed method yielded matched terminologies with 85% precision in less execution time than the application of lexical analysis and manual mapping which yielded 10% less precision.
AB - Retrospective data harmonization is an open issue in healthcare due to the emerging need to interlink data from multiple clinical centers with the absence of standardized data collection protocols. In this work, we present an automated data harmonization workflow which utilizes lexical and semantic analysis based on word embeddings and relational modeling to detect terminologies with common lexical and conceptual basis. The method is built on top of a knowledge base to enable the interlinking of heterogeneous cross-domain data. A case study is applied in two clinical domains, namely the cardiovascular disease (CVD) and the mental disorders, where the proposed method yielded matched terminologies with 85% precision in less execution time than the application of lexical analysis and manual mapping which yielded 10% less precision.
KW - Cardiovascular diseases
KW - Data harmonization
KW - Lexical matching
KW - Mental disorders
KW - Semantic matching
UR - http://www.scopus.com/inward/record.url?scp=85125466146&partnerID=8YFLogxK
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85125466146&origin=inward
U2 - https://doi.org/10.1109/BHI50953.2021.9508484
DO - https://doi.org/10.1109/BHI50953.2021.9508484
M3 - Conference contribution
SN - 9781665447706
T3 - BHI 2021 - 2021 IEEE EMBS International Conference on Biomedical and Health Informatics, Proceedings
SP - 88
EP - 91
BT - BHI 2021 - 2021 IEEE EMBS International Conference on Biomedical and Health Informatics, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
CY - Piscataway, NJ
T2 - 2021 IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2021
Y2 - 27 July 2021 through 30 July 2021
ER -