TY - JOUR
T1 - Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing
AU - Wendl, Michael C.
N1 - Funding Information:
This work was partially supported by a grant from the National Human Genome Research Institute (HG003079). The author is grateful for discussions with J. Wallis, R. Wilson, and L. Hillier of the Washington University Genome Sequencing Center.
PY - 2006/1
Y1 - 2006/1
N2 - Expected-value models have long provided a rudimentary theoretical foundation for random DNA sequencing. Here, we are interested in improving characterization of genome coverage in terms of its underlying probability distributions. We find that the mathematical notion of occupancy serves as a good model for evolution of the coverage distribution function and reveals new insights related to sequence redundancy. Established concepts, such as "full shotgun depth," have been assumed invariant, but actually depend on project size and decrease over time. For most microbial projects, the full shotgun milestone should be revised downward by about 30%. Accordingly, many already-completed genomes appear to have been over-sequenced. Results also suggest that read lengths for emerging high-throughput sequencing methods must be increased substantially before they can be considered as possible successors to the standard Sanger method. In particular, gains in throughput and sequence depth cannot be made to compensate for diminished read length. Limits are well approximated by a simple logarithmic equation, which should be useful in estimating maximum coverage-based redundancy for future projects.
AB - Expected-value models have long provided a rudimentary theoretical foundation for random DNA sequencing. Here, we are interested in improving characterization of genome coverage in terms of its underlying probability distributions. We find that the mathematical notion of occupancy serves as a good model for evolution of the coverage distribution function and reveals new insights related to sequence redundancy. Established concepts, such as "full shotgun depth," have been assumed invariant, but actually depend on project size and decrease over time. For most microbial projects, the full shotgun milestone should be revised downward by about 30%. Accordingly, many already-completed genomes appear to have been over-sequenced. Results also suggest that read lengths for emerging high-throughput sequencing methods must be increased substantially before they can be considered as possible successors to the standard Sanger method. In particular, gains in throughput and sequence depth cannot be made to compensate for diminished read length. Limits are well approximated by a simple logarithmic equation, which should be useful in estimating maximum coverage-based redundancy for future projects.
KW - Genome coverage
KW - Probabilistic modeling
KW - Sequence redundancy
UR - http://www.scopus.com/inward/record.url?scp=33746590339&partnerID=8YFLogxK
U2 - 10.1007/s11538-005-9021-4
DO - 10.1007/s11538-005-9021-4
M3 - Article
C2 - 16794926
AN - SCOPUS:33746590339
SN - 0092-8240
VL - 68
SP - 179
EP - 196
JO - Bulletin of Mathematical Biology
JF - Bulletin of Mathematical Biology
IS - 1
ER -