Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver

BEEHIVE Collaboration

doi:https://doi.org/10.1093/ve/vey007

Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver

BEEHIVE Collaboration

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.

Original language	English
Pages (from-to)	vey007
Journal	Virus evolution
Volume	4
Issue number	1
DOIs	https://doi.org/10.1093/ve/vey007
Publication status	Published - 2018

Access to Document

https://doi.org/10.1093/ve/vey007

Cite this

@article{53675edb97954c68b6dcf658d1f51e4d,

title = "Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver",

abstract = "Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.",

author = "{BEEHIVE Collaboration} and Chris Wymant and Fran{\c c}ois Blanquart and Tanya Golubchik and Astrid Gall and Margreet Bakker and Daniela Bezemer and Croucher, {Nicholas J} and Matthew Hall and Mariska Hillebregt and Ong, {Swee Hoe} and Oliver Ratmann and Jan Albert and Norbert Bannert and Jacques Fellay and Katrien Fransen and Annabelle Gourlay and Grabowski, {M Kate} and Barbara Gunsenheimer-Bartmeyer and G{\"u}nthard, {Huldrych F} and Pia Kivel{\"a} and Roger Kouyos and Oliver Laeyendecker and Kirsi Liitsola and Laurence Meyer and Kholoud Porter and Matti Ristola and {van Sighem}, Ard and Ben Berkhout and Marion Cornelissen and Paul Kellam and Peter Reiss and Christophe Fraser",

year = "2018",

doi = "https://doi.org/10.1093/ve/vey007",

language = "English",

volume = "4",

pages = "vey007",

journal = "Virus evolution",

issn = "2057-1577",

publisher = "Elsevier",

number = "1",

}

TY - JOUR

T1 - Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver

AU - BEEHIVE Collaboration

AU - Wymant, Chris

AU - Blanquart, François

AU - Golubchik, Tanya

AU - Gall, Astrid

AU - Bakker, Margreet

AU - Bezemer, Daniela

AU - Croucher, Nicholas J

AU - Hall, Matthew

AU - Hillebregt, Mariska

AU - Ong, Swee Hoe

AU - Ratmann, Oliver

AU - Albert, Jan

AU - Bannert, Norbert

AU - Fellay, Jacques

AU - Fransen, Katrien

AU - Gourlay, Annabelle

AU - Grabowski, M Kate

AU - Gunsenheimer-Bartmeyer, Barbara

AU - Günthard, Huldrych F

AU - Kivelä, Pia

AU - Kouyos, Roger

AU - Laeyendecker, Oliver

AU - Liitsola, Kirsi

AU - Meyer, Laurence

AU - Porter, Kholoud

AU - Ristola, Matti

AU - van Sighem, Ard

AU - Berkhout, Ben

AU - Cornelissen, Marion

AU - Kellam, Paul

AU - Reiss, Peter

AU - Fraser, Christophe

PY - 2018

Y1 - 2018

N2 - Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.

AB - Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.

U2 - https://doi.org/10.1093/ve/vey007

DO - https://doi.org/10.1093/ve/vey007

M3 - Article

C2 - 29876136

SN - 2057-1577

VL - 4

SP - vey007

JO - Virus evolution

JF - Virus evolution

IS - 1

ER -