Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

Expected-value models have long provided a rudimentary theoretical foundation for random DNA sequencing. Here, we are interested in improving characterization of genome coverage in terms of its underlying probability distributions. We find that the mathematical notion of occupancy serves as a good model for evolution of the coverage distribution function and reveals new insights related to sequence redundancy. Established concepts, such as "full shotgun depth," have been assumed invariant, but actually depend on project size and decrease over time. For most microbial projects, the full shotgun milestone should be revised downward by about 30%. Accordingly, many already-completed genomes appear to have been over-sequenced. Results also suggest that read lengths for emerging high-throughput sequencing methods must be increased substantially before they can be considered as possible successors to the standard Sanger method. In particular, gains in throughput and sequence depth cannot be made to compensate for diminished read length. Limits are well approximated by a simple logarithmic equation, which should be useful in estimating maximum coverage-based redundancy for future projects.

Original languageEnglish
Pages (from-to)179-196
Number of pages18
JournalBulletin of Mathematical Biology
Volume68
Issue number1
DOIs
StatePublished - Jan 2006

Keywords

  • Genome coverage
  • Probabilistic modeling
  • Sequence redundancy

Fingerprint

Dive into the research topics of 'Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing'. Together they form a unique fingerprint.

Cite this