TY - JOUR
T1 - Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer's disease sequencing project
AU - Alzheimer's Disease Sequencing Project (ADSP)
AU - Naj, Adam C.
AU - Lin, Honghuang
AU - Vardarajan, Badri N.
AU - White, Simon
AU - Lancour, Daniel
AU - Ma, Yiyi
AU - Schmidt, Michael
AU - Sun, Fangui
AU - Butkiewicz, Mariusz
AU - Bush, William S.
AU - Kunkle, Brian W.
AU - Malamon, John
AU - Amin, Najaf
AU - Choi, Seung Hoan
AU - Hamilton-Nelson, Kara L.
AU - van der Lee, Sven J.
AU - Gupta, Namrata
AU - Koboldt, Daniel C.
AU - Saad, Mohamad
AU - Wang, Bowen
AU - Nato, Alejandro Q.
AU - Sohi, Harkirat K.
AU - Kuzma, Amanda
AU - Wang, Li San
AU - Cupples, L. Adrienne
AU - van Duijn, Cornelia
AU - Seshadri, Sudha
AU - Schellenberg, Gerard D.
AU - Boerwinkle, Eric
AU - Bis, Joshua C.
AU - Dupuis, Josée
AU - Salerno, William J.
AU - Wijsman, Ellen M.
AU - Martin, Eden R.
AU - DeStefano, Anita L.
PY - 2019/7/1
Y1 - 2019/7/1
N2 - The Alzheimer's Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed “consensus calling,” to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0% of GATK and 14.5% of Atlas SNVs. Between pipelines, ~91% of SNV genotypes across all QCed variants were concordant; 4.23% and 4.56% of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01% of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8% of GATK and 35.3% of Atlas indels. Between pipelines, ~55.6% of indel genotypes were concordant; while 10.3% and 28.3% were exclusive to Atlas or GATK, respectively; and ~0.29% of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.
AB - The Alzheimer's Disease Sequencing Project (ADSP) performed whole genome sequencing (WGS) of 584 subjects from 111 multiplex families at three sequencing centers. Genotype calling of single nucleotide variants (SNVs) and insertion-deletion variants (indels) was performed centrally using GATK-HaplotypeCaller and Atlas V2. The ADSP Quality Control (QC) Working Group applied QC protocols to project-level variant call format files (VCFs) from each pipeline, and developed and implemented a novel protocol, termed “consensus calling,” to combine genotype calls from both pipelines into a single high-quality set. QC was applied to autosomal bi-allelic SNVs and indels, and included pipeline-recommended QC filters, variant-level QC, and sample-level QC. Low-quality variants or genotypes were excluded, and sample outliers were noted. Quality was assessed by examining Mendelian inconsistencies (MIs) among 67 parent-offspring pairs, and MIs were used to establish additional genotype-specific filters for GATK calls. After QC, 578 subjects remained. Pipeline-specific QC excluded ~12.0% of GATK and 14.5% of Atlas SNVs. Between pipelines, ~91% of SNV genotypes across all QCed variants were concordant; 4.23% and 4.56% of genotypes were exclusive to Atlas or GATK, respectively; the remaining ~0.01% of discordant genotypes were excluded. For indels, variant-level QC excluded ~36.8% of GATK and 35.3% of Atlas indels. Between pipelines, ~55.6% of indel genotypes were concordant; while 10.3% and 28.3% were exclusive to Atlas or GATK, respectively; and ~0.29% of discordant genotypes were. The final WGS consensus dataset contains 27,896,774 SNVs and 3,133,926 indels and is publicly available.
KW - Atlas
KW - Consensus calling
KW - GATK
KW - Mendelian inconsistencies
KW - Quality control
KW - Whole genome sequencing
UR - http://www.scopus.com/inward/record.url?scp=85049300452&partnerID=8YFLogxK
U2 - https://doi.org/10.1016/j.ygeno.2018.05.004
DO - https://doi.org/10.1016/j.ygeno.2018.05.004
M3 - Article
C2 - 29857119
SN - 0888-7543
VL - 111
SP - 808
EP - 818
JO - Genomics
JF - Genomics
IS - 4
ER -