Biomedical heterogeneous data categorization and schema mapping toward data integration

Priya Deshpande; Alexander Rasin; Roselyne Tchoua; Jacob Furst; Daniela Raicu; Michiel Schinkel; Hari Trivedi; Sameer Antani

doi:https://doi.org/10.3389/fdata.2023.1173038

Biomedical heterogeneous data categorization and schema mapping toward data integration

Priya Deshpande, Alexander Rasin, Roselyne Tchoua, Jacob Furst, Daniela Raicu, Michiel Schinkel, Hari Trivedi, Sameer Antani

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Data integration is a well-motivated problem in the clinical data science domain. Availability of patient data, reference clinical cases, and datasets for research have the potential to advance the healthcare industry. However, the unstructured (text, audio, or video data) and heterogeneous nature of the data, the variety of data standards and formats, and patient privacy constraint make data interoperability and integration a challenge. The clinical text is further categorized into different semantic groups and may be stored in different files and formats. Even the same organization may store cases in different data structures, making data integration more challenging. With such inherent complexity, domain experts and domain knowledge are often necessary to perform data integration. However, expert human labor is time and cost prohibitive. To overcome the variability in the structure, format, and content of the different data sources, we map the text into common categories and compute similarity within those. In this paper, we present a method to categorize and merge clinical data by considering the underlying semantics behind the cases and use reference information about the cases to perform data integration. Evaluation shows that we were able to merge 88% of clinical data from five different sources.

Original language	English
Article number	1173038
Journal	Frontiers in Big Data
Volume	6
DOIs	https://doi.org/10.3389/fdata.2023.1173038
Publication status	Published - 2023

Keywords

data categorization
data integration
datasets
heterogeneous data
schema mapping
semantic similarity
unstructured data

Access to Document

https://doi.org/10.3389/fdata.2023.1173038

Cite this

@article{7c81e9004ef940958ba7b29c2257e1f6,

title = "Biomedical heterogeneous data categorization and schema mapping toward data integration",

abstract = "Data integration is a well-motivated problem in the clinical data science domain. Availability of patient data, reference clinical cases, and datasets for research have the potential to advance the healthcare industry. However, the unstructured (text, audio, or video data) and heterogeneous nature of the data, the variety of data standards and formats, and patient privacy constraint make data interoperability and integration a challenge. The clinical text is further categorized into different semantic groups and may be stored in different files and formats. Even the same organization may store cases in different data structures, making data integration more challenging. With such inherent complexity, domain experts and domain knowledge are often necessary to perform data integration. However, expert human labor is time and cost prohibitive. To overcome the variability in the structure, format, and content of the different data sources, we map the text into common categories and compute similarity within those. In this paper, we present a method to categorize and merge clinical data by considering the underlying semantics behind the cases and use reference information about the cases to perform data integration. Evaluation shows that we were able to merge 88% of clinical data from five different sources.",

keywords = "data categorization, data integration, datasets, heterogeneous data, schema mapping, semantic similarity, unstructured data",

author = "Priya Deshpande and Alexander Rasin and Roselyne Tchoua and Jacob Furst and Daniela Raicu and Michiel Schinkel and Hari Trivedi and Sameer Antani",

note = "Funding Information: This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM), and Lister Hill National Center for Biomedical Communications (LHNCBC). Publisher Copyright: Copyright {\textcopyright} 2023 Deshpande, Rasin, Tchoua, Furst, Raicu, Schinkel, Trivedi and Antani.",

year = "2023",

doi = "https://doi.org/10.3389/fdata.2023.1173038",

language = "English",

volume = "6",

journal = "Frontiers in Big Data",

issn = "2624-909X",

publisher = "Frontiers Media SA",

}

TY - JOUR

T1 - Biomedical heterogeneous data categorization and schema mapping toward data integration

AU - Deshpande, Priya

AU - Rasin, Alexander

AU - Tchoua, Roselyne

AU - Furst, Jacob

AU - Raicu, Daniela

AU - Schinkel, Michiel

AU - Trivedi, Hari

AU - Antani, Sameer

N1 - Funding Information: This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM), and Lister Hill National Center for Biomedical Communications (LHNCBC). Publisher Copyright: Copyright © 2023 Deshpande, Rasin, Tchoua, Furst, Raicu, Schinkel, Trivedi and Antani.

PY - 2023

Y1 - 2023

N2 - Data integration is a well-motivated problem in the clinical data science domain. Availability of patient data, reference clinical cases, and datasets for research have the potential to advance the healthcare industry. However, the unstructured (text, audio, or video data) and heterogeneous nature of the data, the variety of data standards and formats, and patient privacy constraint make data interoperability and integration a challenge. The clinical text is further categorized into different semantic groups and may be stored in different files and formats. Even the same organization may store cases in different data structures, making data integration more challenging. With such inherent complexity, domain experts and domain knowledge are often necessary to perform data integration. However, expert human labor is time and cost prohibitive. To overcome the variability in the structure, format, and content of the different data sources, we map the text into common categories and compute similarity within those. In this paper, we present a method to categorize and merge clinical data by considering the underlying semantics behind the cases and use reference information about the cases to perform data integration. Evaluation shows that we were able to merge 88% of clinical data from five different sources.

AB - Data integration is a well-motivated problem in the clinical data science domain. Availability of patient data, reference clinical cases, and datasets for research have the potential to advance the healthcare industry. However, the unstructured (text, audio, or video data) and heterogeneous nature of the data, the variety of data standards and formats, and patient privacy constraint make data interoperability and integration a challenge. The clinical text is further categorized into different semantic groups and may be stored in different files and formats. Even the same organization may store cases in different data structures, making data integration more challenging. With such inherent complexity, domain experts and domain knowledge are often necessary to perform data integration. However, expert human labor is time and cost prohibitive. To overcome the variability in the structure, format, and content of the different data sources, we map the text into common categories and compute similarity within those. In this paper, we present a method to categorize and merge clinical data by considering the underlying semantics behind the cases and use reference information about the cases to perform data integration. Evaluation shows that we were able to merge 88% of clinical data from five different sources.

KW - data categorization

KW - data integration

KW - datasets

KW - heterogeneous data

KW - schema mapping

KW - semantic similarity

KW - unstructured data

UR - http://www.scopus.com/inward/record.url?scp=85159918003&partnerID=8YFLogxK

U2 - https://doi.org/10.3389/fdata.2023.1173038

DO - https://doi.org/10.3389/fdata.2023.1173038

M3 - Article

C2 - 37139170

SN - 2624-909X

VL - 6

JO - Frontiers in Big Data

JF - Frontiers in Big Data

M1 - 1173038

ER -

Biomedical heterogeneous data categorization and schema mapping toward data integration

Abstract

Keywords

Access to Document

Other files and links

Cite this