TY - JOUR
T1 - Discovery and genotyping of structural variation from long-read haploid genome sequence data
AU - Huddleston, John
AU - Chaisson, Mark J.P.
AU - Steinberg, Karyn Meltz
AU - Warren, Wes
AU - Hoekzema, Kendra
AU - Gordon, David
AU - Graves-Lindsay, Tina A.
AU - Munson, Katherine M.
AU - Kronenberg, Zev N.
AU - Vives, Laura
AU - Peluso, Paul
AU - Boitano, Matthew
AU - Chin, Chen Shin
AU - Korlach, Jonas
AU - Wilson, Richard K.
AU - Eichler, Evan E.
N1 - Publisher Copyright:
© 2017 Huddleston et al.
PY - 2017/5
Y1 - 2017/5
N2 - In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, realtime (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length.We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as 16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery fromgenotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that~59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.
AB - In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, realtime (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length.We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as 16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery fromgenotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that~59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.
UR - http://www.scopus.com/inward/record.url?scp=85019083461&partnerID=8YFLogxK
U2 - 10.1101/gr.214007.116
DO - 10.1101/gr.214007.116
M3 - Article
C2 - 27895111
AN - SCOPUS:85019083461
VL - 27
SP - 677
EP - 685
JO - Genome Research
JF - Genome Research
SN - 1088-9051
IS - 5
ER -