Computational analyses to characterise hidden information in short and long read sequencing data of human genomes: there’s more than meets the reference

Jasper Linthorst

Computational analyses to characterise hidden information in short and long read sequencing data of human genomes: there’s more than meets the reference

Research output: PhD Thesis › Phd-Thesis - Research and graduation internal

Abstract

Next generation sequencing (NGS) has enabled us to accurately determine the nucleotide sequence of short fragments of DNA at a massive scale, which has led to various clinical applications of human genome sequencing. To extract information from these NGS experiments, virtually all analyses make use of a reference assembly of the human genome to map sequenced reads. Importantly, in these experiments a large fraction (~12%) of the sequenced DNA fragments are ignored as the origin of these sequences cannot be traced back to a (single) position on the reference assembly. The origin of these ignored or unmapped fragments is dual. On the one hand these fragments originate from sequence that occurs more than once (repeats). On the other hand, these fragments originate from sequence that is absent from the reference assembly. In practice, many of these unmapped fragments originate from so-called structural variations (SVs) where the sequenced genome differs from the reference assembly. In Part 1 of this thesis, we study this source of sequence variation by making use of so-called long-read sequencing technology and introduce methods to do so. In Part 2 of this thesis, we specifically study the DNA fragments that can’t be traced back to the human reference assembly, but instead seem to originate from DNA viruses.

Original language	English
Qualification	Doctor of Philosophy
Awarding Institution
Supervisors/Advisors	Sistermans, Erik, Supervisor Reinders, M.J.T., Supervisor, External person Holstege, Henne, Co-supervisor
Award date	9 Dec 2022
Place of Publication	s.l.
Publisher	s.n.
Publication status	Published - 9 Dec 2022

Keywords

NIPT
cell-free DNA
de-novo assembly
long-read sequencing
next-generation sequencing
non-invasive prenatal testing
structural variation
viral DNA

Access to Document

Cite this

@phdthesis{4b818a22de3c4e4cac36a776d2ee822d,

title = "Computational analyses to characterise hidden information in short and long read sequencing data of human genomes: there{\textquoteright}s more than meets the reference",

abstract = "Next generation sequencing (NGS) has enabled us to accurately determine the nucleotide sequence of short fragments of DNA at a massive scale, which has led to various clinical applications of human genome sequencing. To extract information from these NGS experiments, virtually all analyses make use of a reference assembly of the human genome to map sequenced reads. Importantly, in these experiments a large fraction (~12%) of the sequenced DNA fragments are ignored as the origin of these sequences cannot be traced back to a (single) position on the reference assembly. The origin of these ignored or unmapped fragments is dual. On the one hand these fragments originate from sequence that occurs more than once (repeats). On the other hand, these fragments originate from sequence that is absent from the reference assembly. In practice, many of these unmapped fragments originate from so-called structural variations (SVs) where the sequenced genome differs from the reference assembly. In Part 1 of this thesis, we study this source of sequence variation by making use of so-called long-read sequencing technology and introduce methods to do so. In Part 2 of this thesis, we specifically study the DNA fragments that can{\textquoteright}t be traced back to the human reference assembly, but instead seem to originate from DNA viruses.",

keywords = "NIPT, cell-free DNA, de-novo assembly, long-read sequencing, next-generation sequencing, non-invasive prenatal testing, structural variation, viral DNA",

author = "Jasper Linthorst",

year = "2022",

month = dec,

day = "9",

language = "English",

publisher = "s.n.",

type = "Phd-Thesis - Research and graduation internal",

}

TY - THES

T1 - Computational analyses to characterise hidden information in short and long read sequencing data of human genomes

T2 - there’s more than meets the reference

AU - Linthorst, Jasper

PY - 2022/12/9

Y1 - 2022/12/9

N2 - Next generation sequencing (NGS) has enabled us to accurately determine the nucleotide sequence of short fragments of DNA at a massive scale, which has led to various clinical applications of human genome sequencing. To extract information from these NGS experiments, virtually all analyses make use of a reference assembly of the human genome to map sequenced reads. Importantly, in these experiments a large fraction (~12%) of the sequenced DNA fragments are ignored as the origin of these sequences cannot be traced back to a (single) position on the reference assembly. The origin of these ignored or unmapped fragments is dual. On the one hand these fragments originate from sequence that occurs more than once (repeats). On the other hand, these fragments originate from sequence that is absent from the reference assembly. In practice, many of these unmapped fragments originate from so-called structural variations (SVs) where the sequenced genome differs from the reference assembly. In Part 1 of this thesis, we study this source of sequence variation by making use of so-called long-read sequencing technology and introduce methods to do so. In Part 2 of this thesis, we specifically study the DNA fragments that can’t be traced back to the human reference assembly, but instead seem to originate from DNA viruses.

AB - Next generation sequencing (NGS) has enabled us to accurately determine the nucleotide sequence of short fragments of DNA at a massive scale, which has led to various clinical applications of human genome sequencing. To extract information from these NGS experiments, virtually all analyses make use of a reference assembly of the human genome to map sequenced reads. Importantly, in these experiments a large fraction (~12%) of the sequenced DNA fragments are ignored as the origin of these sequences cannot be traced back to a (single) position on the reference assembly. The origin of these ignored or unmapped fragments is dual. On the one hand these fragments originate from sequence that occurs more than once (repeats). On the other hand, these fragments originate from sequence that is absent from the reference assembly. In practice, many of these unmapped fragments originate from so-called structural variations (SVs) where the sequenced genome differs from the reference assembly. In Part 1 of this thesis, we study this source of sequence variation by making use of so-called long-read sequencing technology and introduce methods to do so. In Part 2 of this thesis, we specifically study the DNA fragments that can’t be traced back to the human reference assembly, but instead seem to originate from DNA viruses.

KW - NIPT

KW - cell-free DNA

KW - de-novo assembly

KW - long-read sequencing

KW - next-generation sequencing

KW - non-invasive prenatal testing

KW - structural variation

KW - viral DNA

M3 - Phd-Thesis - Research and graduation internal

PB - s.n.

CY - s.l.

ER -