Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data

Chao Zhang; Jochem Bijlard; Christine Staiger; Serena Scollen; David van Enckevort; Youri Hoogstrate; Alexander Senf; Saskia Hiltemann; Susanna Repo; Wibo Pipping; Mariska Bierkens; Stefan Payralbe; Bas Stringer; Jaap Heringa; Andrew Stubbs; Luiz Olavo Bonino Da Silva Santos; Jeroen Belien; Ward Weistra; Rita Azevedo; Kees van Bochove; Gerrit Meijer; Jan-Willem Boiten; Jordi Rambla; Remond Fijneman; J Dylan Spalding; Sanne Abeln

doi:https://doi.org/10.12688/f1000research.12168.1

Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data

Chao Zhang, Jochem Bijlard, Christine Staiger, Serena Scollen, David van Enckevort, Youri Hoogstrate, Alexander Senf, Saskia Hiltemann, Susanna Repo, Wibo Pipping, Mariska Bierkens, Stefan Payralbe, Bas Stringer, Jaap Heringa, Andrew Stubbs, Luiz Olavo Bonino Da Silva Santos, Jeroen Belien, Ward Weistra, Rita Azevedo, Kees van BochoveGerrit Meijer, Jan-Willem Boiten, Jordi Rambla, Remond Fijneman, J Dylan Spalding, Sanne Abeln

Research output: Contribution to journal › Article › Academic › peer-review

7 Citations (Scopus)

Abstract

The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.

Original language	English
Journal	F1000Research
Volume	6
DOIs	https://doi.org/10.12688/f1000research.12168.1
Publication status	Published - 2017

Access to Document

https://doi.org/10.12688/f1000research.12168.1

Cite this

Zhang, C., Bijlard, J., Staiger, C., Scollen, S., van Enckevort, D., Hoogstrate, Y., Senf, A., Hiltemann, S., Repo, S., Pipping, W., Bierkens, M., Payralbe, S., Stringer, B., Heringa, J., Stubbs, A., Bonino Da Silva Santos, L. O., Belien, J., Weistra, W., Azevedo, R., ... Abeln, S. (2017). Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data. F1000Research, 6. https://doi.org/10.12688/f1000research.12168.1

@article{f1459a229c534f918839e241da633086,

title = "Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data",

abstract = "The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.",

author = "Chao Zhang and Jochem Bijlard and Christine Staiger and Serena Scollen and {van Enckevort}, David and Youri Hoogstrate and Alexander Senf and Saskia Hiltemann and Susanna Repo and Wibo Pipping and Mariska Bierkens and Stefan Payralbe and Bas Stringer and Jaap Heringa and Andrew Stubbs and {Bonino Da Silva Santos}, {Luiz Olavo} and Jeroen Belien and Ward Weistra and Rita Azevedo and {van Bochove}, Kees and Gerrit Meijer and Jan-Willem Boiten and Jordi Rambla and Remond Fijneman and Spalding, {J Dylan} and Sanne Abeln",

year = "2017",

doi = "https://doi.org/10.12688/f1000research.12168.1",

language = "English",

volume = "6",

journal = "F1000Research",

issn = "2046-1402",

publisher = "F1000 Research Ltd.",

}

Zhang, C, Bijlard, J, Staiger, C, Scollen, S, van Enckevort, D, Hoogstrate, Y, Senf, A, Hiltemann, S, Repo, S, Pipping, W, Bierkens, M, Payralbe, S, Stringer, B, Heringa, J, Stubbs, A, Bonino Da Silva Santos, LO, Belien, J, Weistra, W, Azevedo, R, van Bochove, K, Meijer, G, Boiten, J-W, Rambla, J, Fijneman, R, Spalding, JD & Abeln, S 2017, 'Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data', F1000Research, vol. 6. https://doi.org/10.12688/f1000research.12168.1

TY - JOUR

T1 - Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data

AU - Zhang, Chao

AU - Bijlard, Jochem

AU - Staiger, Christine

AU - Scollen, Serena

AU - van Enckevort, David

AU - Hoogstrate, Youri

AU - Senf, Alexander

AU - Hiltemann, Saskia

AU - Repo, Susanna

AU - Pipping, Wibo

AU - Bierkens, Mariska

AU - Payralbe, Stefan

AU - Stringer, Bas

AU - Heringa, Jaap

AU - Stubbs, Andrew

AU - Bonino Da Silva Santos, Luiz Olavo

AU - Belien, Jeroen

AU - Weistra, Ward

AU - Azevedo, Rita

AU - van Bochove, Kees

AU - Meijer, Gerrit

AU - Boiten, Jan-Willem

AU - Rambla, Jordi

AU - Fijneman, Remond

AU - Spalding, J Dylan

AU - Abeln, Sanne

PY - 2017

Y1 - 2017

N2 - The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.

AB - The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.

U2 - https://doi.org/10.12688/f1000research.12168.1

DO - https://doi.org/10.12688/f1000research.12168.1

M3 - Article

C2 - 29123641

SN - 2046-1402

VL - 6

JO - F1000Research

JF - F1000Research

ER -