Abstract
The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.
Original language | English |
---|---|
Pages (from-to) | 3426-3440.e19 |
Journal | Cell |
Volume | 185 |
Issue number | 18 |
DOIs | |
State | Published - Sep 1 2022 |
Keywords
- 1000 Genomes Project
- INDEL
- SNV
- population genetics
- reference imputation panel
- structural variation
- trio sequencing
- whole-genome sequencing
Fingerprint
Dive into the research topics of 'High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver
}
High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. / Human Genome Structural Variation Consortium.
In: Cell, Vol. 185, No. 18, 01.09.2022, p. 3426-3440.e19.Research output: Contribution to journal › Article › peer-review
TY - JOUR
T1 - High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios
AU - Human Genome Structural Variation Consortium
AU - Byrska-Bishop, Marta
AU - Evani, Uday S.
AU - Zhao, Xuefang
AU - Basile, Anna O.
AU - Abel, Haley J.
AU - Regier, Allison A.
AU - Corvelo, André
AU - Clarke, Wayne E.
AU - Musunuri, Rajeeva
AU - Nagulapalli, Kshithija
AU - Fairley, Susan
AU - Runnels, Alexi
AU - Winterkorn, Lara
AU - Lowy, Ernesto
AU - Eichler, Evan E.
AU - Korbel, Jan O.
AU - Lee, Charles
AU - Marschall, Tobias
AU - Devine, Scott E.
AU - Harvey, William T.
AU - Zhou, Weichen
AU - Mills, Ryan E.
AU - Rausch, Tobias
AU - Kumar, Sushant
AU - Alkan, Can
AU - Hormozdiari, Fereydoun
AU - Chong, Zechen
AU - Chen, Yu
AU - Yang, Xiaofei
AU - Lin, Jiadong
AU - Gerstein, Mark B.
AU - Kai, Ye
AU - Zhu, Qihui
AU - Yilmaz, Feyza
AU - Xiao, Chunlin
AU - Paul Flicek, Flicek
AU - Germer, Soren
AU - Brand, Harrison
AU - Hall, Ira M.
AU - Talkowski, Michael E.
AU - Narzisi, Giuseppe
AU - Zody, Michael C.
N1 - Funding Information: The WGS data were generated at the New York Genome Center with funds provided by NHGRI grants 3UM1HG008901-03S1 and 3UM1HG008901-04S1 . A.C., W.E.C., and M.C.Z. were partially supported by the NHGRI grant UM1HG008901 . S.F., E.L., and P.F. were partially supported by the Wellcome Trust ( WT104947/Z/14/Z ) and the European Molecular Biology Laboratory . Support for analyses by X.Z. and M.E.T. was provided in part by NIMH MH115957 , NICHD HD081256 , and NHGRI UM1HG008895 . I.M.H., H.J.A. and A.A.R. were supported in part by the NHGRI grant UM1HG008853 . C.X. was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health . The HGSVC was supported in part by a grant from the National Institutes of Health U24HG007497 (to C.L., E.E.E., J.O.K., and T.M.). We thank Justin Zook from the GIAB Consortium for his feedback and manual curation of a subset of singleton calls in the NA12878 sample across the technologies employed by the GIAB v4.2.1 truth set. Funding Information: We conducted a series of analyses to benchmark SVs from each of the three methods described above, including their FDR as indicated by inheritance rates and support from orthogonal technologies, as well as their breakpoint precision estimated by the deviation of their SV breakpoints from long-read assemblies in three genomes from analyses in the HGSVC (Chaisson et al., 2019). We also compared the three call sets to decide on the optimal integration strategy to maximize sensitivity and minimize FDR in the final ensemble call set (Figure S3, Table S5). Details of the comparison and integration strategies are described separately for insertions and all other variant classes below.To consider a pair of SVs of the same variant class other than insertions as concordant, 50% reciprocal overlap was required for SVs larger than 5 kb and 10% reciprocal overlap was required for variants under 5 kb. The FDR across variant calls was evaluated using the same measurements as described above. For deletions, duplications, and inversions, we observed low FDR (<5%) among variants that were shared by GATK-SV and svtools, but significantly higher FDR in the subsets that were uniquely discovered by either algorithm (Figure S3E-G). To restrict the final call set to high-quality variants, a machine learning model (lightGBM (Ke et al., 2017)) was trained on each SV class. Three samples that were previously analyzed in the HGSVC studies (HG00514, HG00733, NA19240) (Chaisson et al., 2019; Ebert et al., 2021) were selected to train the model. The truth data was defined by SVs that were uni-parentally inherited, shared by GATK-SV and svtools, supported by VaPoR, and overlapped by PacBio call sets. The false training subset was selected as SVs that appeared as de novo in offspring genomes, specifically discovered by either GATK-SV or svtools, not supported by VaPoR, and not overlapped by PacBio call sets. Multiple features were included in the model, including the sequencing depth of each SV, the depth of the 1 kb region around each SV, the count of aberrant pair ends (PE) within 150 bp of each SV, the count of split reads (SR) within 100 bp of each breakpoints, the size, allele fraction and genomic location (split into short repeats, segmental duplications, all remaining repeat masked regions, and the remaining unique sequences) of each SV, and the fraction of offspring harboring a de novo variant among trios in which the SV is observed. Each SV per genome was assigned a ‘boost score’ by the lightGBM model, and SVs with >0.448 boost score were labeled as ‘PASS’ in the model (Figure S3M, S3N). This threshold was specifically selected to retain an estimated FDR <5%. Call set specific SVs that failed the lightGBM model in less than 48% of all examined samples were included in the final integrated call set (Figure S3N).The WGS data were generated at the New York Genome Center with funds provided by NHGRI grants 3UM1HG008901-03S1 and 3UM1HG008901-04S1. A.C. W.E.C. and M.C.Z. were partially supported by the NHGRI grant UM1HG008901. S.F. E.L. and P.F. were partially supported by the Wellcome Trust (WT104947/Z/14/Z) and the European Molecular Biology Laboratory. Support for analyses by X.Z. and M.E.T. was provided in part by NIMH MH115957, NICHD HD081256, and NHGRI UM1HG008895. I.M.H. H.J.A. and A.A.R. were supported in part by the NHGRI grant UM1HG008853. C.X. was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. The HGSVC was supported in part by a grant from the National Institutes of Health U24HG007497 (to C.L. E.E.E. J.O.K. and T.M.). We thank Justin Zook from the GIAB Consortium for his feedback and manual curation of a subset of singleton calls in the NA12878 sample across the technologies employed by the GIAB v4.2.1 truth set. Writing of the manuscript and figure generation, M.B.-B. U.S.E. X.Z. and A.O.B.; SNV/INDEL calling and analysis, U.S.E. M.B.-B. and R.M. SV calling:, X.Z. H.B. H.J.A. A.A.R. A.C. W.E.C. and HGSVC; SV integration and analysis, X.Z. and H.B.; haplotype phasing, M.B.-B.; imputation evaluation: A.O.B.; production and quality control of the WGS data, L.W. A.A.R. U.S.E. M.B.-B. and K.N.; data coordination, data sharing, and user support, S.F. E.L. P.F. and C.X.; joint supervision of this work, S.G. I.M.H. M.E.T, G.N. and M.C.Z. E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc. P.F. is an SAB member of Fabric Genomics, Inc. and Eagle Genomics, Ltd. Publisher Copyright: © 2022 The Authors
PY - 2022/9/1
Y1 - 2022/9/1
N2 - The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.
AB - The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.
KW - 1000 Genomes Project
KW - INDEL
KW - SNV
KW - population genetics
KW - reference imputation panel
KW - structural variation
KW - trio sequencing
KW - whole-genome sequencing
UR - http://www.scopus.com/inward/record.url?scp=85136603644&partnerID=8YFLogxK
U2 - 10.1016/j.cell.2022.08.004
DO - 10.1016/j.cell.2022.08.004
M3 - Article
C2 - 36055201
AN - SCOPUS:85136603644
VL - 185
SP - 3426-3440.e19
JO - Cell
JF - Cell
SN - 0092-8674
IS - 18
ER -