Assessing the complementary information from an increased number of biologically relevant features in liquid biopsy-derived RNA-Seq data

Stavros Giannoukakos; Silvia D'Ambrosi; Danijela Koppers-Lalic; Cristina Gómez-Martín; Alberto Fernandez; Michael Hackenberg

doi:10.1016/j.heliyon.2024.e27360

Assessing the complementary information from an increased number of biologically relevant features in liquid biopsy-derived RNA-Seq data

Stavros Giannoukakos, Silvia D'Ambrosi, Danijela Koppers-Lalic, Cristina Gómez-Martín, Alberto Fernandez, Michael Hackenberg

Pathology (VUmc)

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Liquid biopsy-derived RNA sequencing (lbRNA-seq) exhibits significant promise for clinic-oriented cancer diagnostics due to its non-invasiveness and ease of repeatability. Despite substantial advancements, obstacles like technical artefacts and process standardisation impede seamless clinical integration. Alongside addressing technical aspects such as normalising fluctuating low-input material and establishing a standardised clinical workflow, the lack of result validation using independent datasets remains a critical factor contributing to the often low reproducibility of liquid biopsy-detected biomarkers. Considering the outlined drawbacks, our objective was to establish a workflow/methodology characterised by: 1. Harness the rich diversity of biological features accessible through lbRNA-seq data, encompassing a holistic range of molecular and functional attributes. These components are seamlessly integrated via a Machine Learning-based Ensemble Classification framework, enabling a unified and comprehensive analysis of the intricate information encoded within the data. 2. Implementing and rigorously benchmarking intra-sample normalisation methods to heighten their relevance within clinical settings. 3. Thoroughly assessing its efficacy across independent test sets to ascertain its robustness and potential utility. Using ten datasets from several studies comprising three different sources of biological material, we first show that while the best-performing normalisation methods depend strongly on the dataset and coupled Machine Learning method, the rather simple Counts Per Million method is generally very robust, showing comparable performance to cross-sample methods. Subsequently, we demonstrate that the innovative biofeature types introduced in this study, such as the Fraction of Canonical Transcript, harbour complementary information. Consequently, their inclusion consistently enhances prediction power compared to models relying solely on gene expression-based biofeatures. Finally, we demonstrate that the workflow is robust on completely independent datasets, generally from different labs and/or different protocols. Taken together, the workflow presented here outperforms generally employed methods in prediction accuracy and may hold potential for clinical diagnostics application due to its specific design.

Original language	English
Article number	e27360
Journal	Heliyon
Volume	10
Issue number	6
DOIs	https://doi.org/10.1016/j.heliyon.2024.e27360
Publication status	Published - 30 Mar 2024

Keywords

Bioinformatics
Cancer diagnostics
Ensemble learning
Liquid biopsy
Machine learning
Normalisation
RNA-Seq
Transcriptomics
lbRNA-seq

Access to Document

10.1016/j.heliyon.2024.e27360

Cite this

@article{88adaf54625b4dc795705ab0757ec3f6,

title = "Assessing the complementary information from an increased number of biologically relevant features in liquid biopsy-derived RNA-Seq data",

abstract = "Liquid biopsy-derived RNA sequencing (lbRNA-seq) exhibits significant promise for clinic-oriented cancer diagnostics due to its non-invasiveness and ease of repeatability. Despite substantial advancements, obstacles like technical artefacts and process standardisation impede seamless clinical integration. Alongside addressing technical aspects such as normalising fluctuating low-input material and establishing a standardised clinical workflow, the lack of result validation using independent datasets remains a critical factor contributing to the often low reproducibility of liquid biopsy-detected biomarkers. Considering the outlined drawbacks, our objective was to establish a workflow/methodology characterised by: 1. Harness the rich diversity of biological features accessible through lbRNA-seq data, encompassing a holistic range of molecular and functional attributes. These components are seamlessly integrated via a Machine Learning-based Ensemble Classification framework, enabling a unified and comprehensive analysis of the intricate information encoded within the data. 2. Implementing and rigorously benchmarking intra-sample normalisation methods to heighten their relevance within clinical settings. 3. Thoroughly assessing its efficacy across independent test sets to ascertain its robustness and potential utility. Using ten datasets from several studies comprising three different sources of biological material, we first show that while the best-performing normalisation methods depend strongly on the dataset and coupled Machine Learning method, the rather simple Counts Per Million method is generally very robust, showing comparable performance to cross-sample methods. Subsequently, we demonstrate that the innovative biofeature types introduced in this study, such as the Fraction of Canonical Transcript, harbour complementary information. Consequently, their inclusion consistently enhances prediction power compared to models relying solely on gene expression-based biofeatures. Finally, we demonstrate that the workflow is robust on completely independent datasets, generally from different labs and/or different protocols. Taken together, the workflow presented here outperforms generally employed methods in prediction accuracy and may hold potential for clinical diagnostics application due to its specific design.",

keywords = "Bioinformatics, Cancer diagnostics, Ensemble learning, Liquid biopsy, Machine learning, Normalisation, RNA-Seq, Transcriptomics, lbRNA-seq",

author = "Stavros Giannoukakos and Silvia D'Ambrosi and Danijela Koppers-Lalic and Cristina G{\'o}mez-Mart{\'i}n and Alberto Fernandez and Michael Hackenberg",

note = "Publisher Copyright: {\textcopyright} 2024 The Authors",

year = "2024",

month = mar,

day = "30",

doi = "10.1016/j.heliyon.2024.e27360",

language = "English",

volume = "10",

journal = "Heliyon",

issn = "2405-8440",

publisher = "Elsevier BV",

number = "6",

}

TY - JOUR

T1 - Assessing the complementary information from an increased number of biologically relevant features in liquid biopsy-derived RNA-Seq data

AU - Giannoukakos, Stavros

AU - D'Ambrosi, Silvia

AU - Koppers-Lalic, Danijela

AU - Gómez-Martín, Cristina

AU - Fernandez, Alberto

AU - Hackenberg, Michael

PY - 2024/3/30

Y1 - 2024/3/30

N2 - Liquid biopsy-derived RNA sequencing (lbRNA-seq) exhibits significant promise for clinic-oriented cancer diagnostics due to its non-invasiveness and ease of repeatability. Despite substantial advancements, obstacles like technical artefacts and process standardisation impede seamless clinical integration. Alongside addressing technical aspects such as normalising fluctuating low-input material and establishing a standardised clinical workflow, the lack of result validation using independent datasets remains a critical factor contributing to the often low reproducibility of liquid biopsy-detected biomarkers. Considering the outlined drawbacks, our objective was to establish a workflow/methodology characterised by: 1. Harness the rich diversity of biological features accessible through lbRNA-seq data, encompassing a holistic range of molecular and functional attributes. These components are seamlessly integrated via a Machine Learning-based Ensemble Classification framework, enabling a unified and comprehensive analysis of the intricate information encoded within the data. 2. Implementing and rigorously benchmarking intra-sample normalisation methods to heighten their relevance within clinical settings. 3. Thoroughly assessing its efficacy across independent test sets to ascertain its robustness and potential utility. Using ten datasets from several studies comprising three different sources of biological material, we first show that while the best-performing normalisation methods depend strongly on the dataset and coupled Machine Learning method, the rather simple Counts Per Million method is generally very robust, showing comparable performance to cross-sample methods. Subsequently, we demonstrate that the innovative biofeature types introduced in this study, such as the Fraction of Canonical Transcript, harbour complementary information. Consequently, their inclusion consistently enhances prediction power compared to models relying solely on gene expression-based biofeatures. Finally, we demonstrate that the workflow is robust on completely independent datasets, generally from different labs and/or different protocols. Taken together, the workflow presented here outperforms generally employed methods in prediction accuracy and may hold potential for clinical diagnostics application due to its specific design.

AB - Liquid biopsy-derived RNA sequencing (lbRNA-seq) exhibits significant promise for clinic-oriented cancer diagnostics due to its non-invasiveness and ease of repeatability. Despite substantial advancements, obstacles like technical artefacts and process standardisation impede seamless clinical integration. Alongside addressing technical aspects such as normalising fluctuating low-input material and establishing a standardised clinical workflow, the lack of result validation using independent datasets remains a critical factor contributing to the often low reproducibility of liquid biopsy-detected biomarkers. Considering the outlined drawbacks, our objective was to establish a workflow/methodology characterised by: 1. Harness the rich diversity of biological features accessible through lbRNA-seq data, encompassing a holistic range of molecular and functional attributes. These components are seamlessly integrated via a Machine Learning-based Ensemble Classification framework, enabling a unified and comprehensive analysis of the intricate information encoded within the data. 2. Implementing and rigorously benchmarking intra-sample normalisation methods to heighten their relevance within clinical settings. 3. Thoroughly assessing its efficacy across independent test sets to ascertain its robustness and potential utility. Using ten datasets from several studies comprising three different sources of biological material, we first show that while the best-performing normalisation methods depend strongly on the dataset and coupled Machine Learning method, the rather simple Counts Per Million method is generally very robust, showing comparable performance to cross-sample methods. Subsequently, we demonstrate that the innovative biofeature types introduced in this study, such as the Fraction of Canonical Transcript, harbour complementary information. Consequently, their inclusion consistently enhances prediction power compared to models relying solely on gene expression-based biofeatures. Finally, we demonstrate that the workflow is robust on completely independent datasets, generally from different labs and/or different protocols. Taken together, the workflow presented here outperforms generally employed methods in prediction accuracy and may hold potential for clinical diagnostics application due to its specific design.

KW - Bioinformatics

KW - Cancer diagnostics

KW - Ensemble learning

KW - Liquid biopsy

KW - Machine learning

KW - Normalisation

KW - RNA-Seq

KW - Transcriptomics

KW - lbRNA-seq

UR - http://www.scopus.com/inward/record.url?scp=85188011245&partnerID=8YFLogxK

U2 - 10.1016/j.heliyon.2024.e27360

DO - 10.1016/j.heliyon.2024.e27360

M3 - Article

C2 - 38515664

SN - 2405-8440

VL - 10

JO - Heliyon

JF - Heliyon

IS - 6

M1 - e27360

ER -

Assessing the complementary information from an increased number of biologically relevant features in liquid biopsy-derived RNA-Seq data

Abstract

Keywords

Access to Document

Other files and links

Cite this