TY - JOUR
T1 - Semi-automated assembly of high-quality diploid human reference genomes
AU - Human Pangenome Reference Consortium
AU - Jarvis, Erich D.
AU - Formenti, Giulio
AU - Rhie, Arang
AU - Guarracino, Andrea
AU - Yang, Chentao
AU - Wood, Jonathan
AU - Tracey, Alan
AU - Thibaud-Nissen, Francoise
AU - Vollger, Mitchell R.
AU - Porubsky, David
AU - Cheng, Haoyu
AU - Asri, Mobin
AU - Logsdon, Glennis A.
AU - Carnevali, Paolo
AU - Chaisson, Mark J.P.
AU - Chin, Chen Shan
AU - Cody, Sarah
AU - Collins, Joanna
AU - Ebert, Peter
AU - Escalona, Merly
AU - Fedrigo, Olivier
AU - Fulton, Robert S.
AU - Fulton, Lucinda L.
AU - Garg, Shilpa
AU - Gerton, Jennifer L.
AU - Ghurye, Jay
AU - Granat, Anastasiya
AU - Green, Richard E.
AU - Harvey, William
AU - Hasenfeld, Patrick
AU - Hastie, Alex
AU - Haukness, Marina
AU - Jaeger, Erich B.
AU - Jain, Miten
AU - Kirsche, Melanie
AU - Kolmogorov, Mikhail
AU - Korbel, Jan O.
AU - Koren, Sergey
AU - Korlach, Jonas
AU - Lee, Joyce
AU - Li, Daofeng
AU - Lindsay, Tina
AU - Lucas, Julian
AU - Luo, Feng
AU - Marschall, Tobias
AU - Mitchell, Matthew W.
AU - McDaniel, Jennifer
AU - Nie, Fan
AU - Olsen, Hugh E.
AU - Olson, Nathan D.
AU - Pesout, Trevor
AU - Potapova, Tamara
AU - Puiu, Daniela
AU - Regier, Allison
AU - Ruan, Jue
AU - Salzberg, Steven L.
AU - Sanders, Ashley D.
AU - Schatz, Michael C.
AU - Schmitt, Anthony
AU - Schneider, Valerie A.
AU - Selvaraj, Siddarth
AU - Shafin, Kishwar
AU - Shumate, Alaina
AU - Stitziel, Nathan O.
AU - Stober, Catherine
AU - Torrance, James
AU - Wagner, Justin
AU - Wang, Jianxin
AU - Wenger, Aaron
AU - Xiao, Chuanle
AU - Zimin, Aleksey V.
AU - Zhang, Guojie
AU - Wang, Ting
AU - Li, Heng
AU - Garrison, Erik
AU - Haussler, David
AU - Hall, Ira
AU - Zook, Justin M.
AU - Eichler, Evan E.
AU - Phillippy, Adam M.
AU - Paten, Benedict
AU - Howe, Kerstin
AU - Miga, Karen H.
N1 - Funding Information:
Certain commercial equipment, instruments or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. T.M. and P.E. (asm14 team) acknowledge the computational infrastructure and support provided by the Centre for Information and Media Technology at Heinrich Heine University Düsseldorf. We thank P. Audano of the Eichler laboratory for bioinformatic support. This work utilized the computational resources of the National Institutes of Health HPC Biowulf cluster (https://hpc.nih.gov). M.C.S and M. Kirsche (asm17 team) acknowledge A. Carroll and P.-C. Chang from Google AI for assistance running DeepVariant. The primary source of funding for this study was from NHGRI grant U01HG010971 (https://grants.nih.gov/grants/guide/rfa-files/rfa-hg-19-004.html). Additional funding included: Howard Hughes Medical Institute (to E.D.J.), NHGRI grants HG002385 and HG010169 (to E.E.E.), and NHGRI grant R01HG011274-01 (to K.H.M). The computational resources and personnel support for the DipAsm assemblies were HG010906/HG/NHGRI NIH HHS and NNF21OC0069089 to S.G.; NCI U01CA253481 to M.C.S.; and NSF DBI-1627442 to M.C.S. T.M. and P.E. (asm14 team) acknowledge the support by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D and 031A538A), and the German Research Foundation (391137747 to T.M.), and the National Natural Science Foundation of China (62150048 to J. Wang). E.D.J., E.E.E. and D.H. are investigators of the Howard Hughes Medical Institute. This work was supported in part by the National Center for Biotechnology Information of the National Library of Medicine, National Institutes of Health; the Intramural Research Program of the NHGRI, National Institutes of Health; and by National Institute of Standards and Technology intramural research funding.
Funding Information:
Certain commercial equipment, instruments or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. T.M. and P.E. (asm14 team) acknowledge the computational infrastructure and support provided by the Centre for Information and Media Technology at Heinrich Heine University Düsseldorf. We thank P. Audano of the Eichler laboratory for bioinformatic support. This work utilized the computational resources of the National Institutes of Health HPC Biowulf cluster ( https://hpc.nih.gov ). M.C.S and M. Kirsche (asm17 team) acknowledge A. Carroll and P.-C. Chang from Google AI for assistance running DeepVariant. The primary source of funding for this study was from NHGRI grant U01HG010971 ( https://grants.nih.gov/grants/guide/rfa-files/rfa-hg-19-004.html ). Additional funding included: Howard Hughes Medical Institute (to E.D.J.), NHGRI grants HG002385 and HG010169 (to E.E.E.), and NHGRI grant R01HG011274-01 (to K.H.M). The computational resources and personnel support for the DipAsm assemblies were HG010906/HG/NHGRI NIH HHS and NNF21OC0069089 to S.G.; NCI U01CA253481 to M.C.S.; and NSF DBI-1627442 to M.C.S. T.M. and P.E. (asm14 team) acknowledge the support by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D and 031A538A), and the German Research Foundation (391137747 to T.M.), and the National Natural Science Foundation of China (62150048 to J. Wang). E.D.J., E.E.E. and D.H. are investigators of the Howard Hughes Medical Institute. This work was supported in part by the National Center for Biotechnology Information of the National Library of Medicine, National Institutes of Health; the Intramural Research Program of the NHGRI, National Institutes of Health; and by National Institute of Standards and Technology intramural research funding.
Publisher Copyright:
© 2022, The Author(s).
PY - 2022/11/17
Y1 - 2022/11/17
N2 - The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
AB - The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
UR - http://www.scopus.com/inward/record.url?scp=85140231380&partnerID=8YFLogxK
U2 - 10.1038/s41586-022-05325-5
DO - 10.1038/s41586-022-05325-5
M3 - Article
C2 - 36261518
AN - SCOPUS:85140231380
SN - 0028-0836
VL - 611
SP - 519
EP - 531
JO - Nature
JF - Nature
IS - 7936
ER -