TY - JOUR

T1 - Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens' theorem

AU - Wendl, Michael C.

AU - Kota, Karthik

AU - Weinstock, George M.

AU - Mitreva, Makedonka

N1 - Funding Information:
The authors wish to acknowledge funding sources for this work: National Human Genome Research Institute grants HG003079 and HG004968.

PY - 2013/11

Y1 - 2013/11

N2 - Metagenomic project design has relied variously upon speculation, semi-empirical and ad hoc heuristic models, and elementary extensions of single-sample Lander-Waterman expectation theory, all of which are demonstrably inadequate. Here, we propose an approach based upon a generalization of Stevens' Theorem for randomly covering a domain. We extend this result to account for the presence of multiple species, from which are derived useful probabilities for fully recovering a particular target microbe of interest and for average contig length. These show improved specificities compared to older measures and recommend deeper data generation than the levels chosen by some early studies, supporting the view that poor assemblies were due at least somewhat to insufficient data. We assess predictions empirically by generating roughly 4.5 Gb of sequence from a twelve member bacterial community, comparing coverage for two particular members, Selenomonas artemidis and Enterococcus faecium, which are the least (∼3 %) and most (∼12 %) abundant species, respectively. Agreement is reasonable, with differences likely attributable to coverage biases. We show that, in some cases, bias is simple in the sense that a small reduction in read length to simulate less efficient covering brings data and theory into essentially complete accord. Finally, we describe two applications of the theory. One plots coverage probability over the relevant parameter space, constructing essentially a "metagenomic design map" to enable straightforward analysis and design of future projects. The other gives an overview of the data requirements for various types of sequencing milestones, including a desired number of contact reads and contig length, for detection of a rare viral species.

AB - Metagenomic project design has relied variously upon speculation, semi-empirical and ad hoc heuristic models, and elementary extensions of single-sample Lander-Waterman expectation theory, all of which are demonstrably inadequate. Here, we propose an approach based upon a generalization of Stevens' Theorem for randomly covering a domain. We extend this result to account for the presence of multiple species, from which are derived useful probabilities for fully recovering a particular target microbe of interest and for average contig length. These show improved specificities compared to older measures and recommend deeper data generation than the levels chosen by some early studies, supporting the view that poor assemblies were due at least somewhat to insufficient data. We assess predictions empirically by generating roughly 4.5 Gb of sequence from a twelve member bacterial community, comparing coverage for two particular members, Selenomonas artemidis and Enterococcus faecium, which are the least (∼3 %) and most (∼12 %) abundant species, respectively. Agreement is reasonable, with differences likely attributable to coverage biases. We show that, in some cases, bias is simple in the sense that a small reduction in read length to simulate less efficient covering brings data and theory into essentially complete accord. Finally, we describe two applications of the theory. One plots coverage probability over the relevant parameter space, constructing essentially a "metagenomic design map" to enable straightforward analysis and design of future projects. The other gives an overview of the data requirements for various types of sequencing milestones, including a desired number of contact reads and contig length, for detection of a rare viral species.

KW - Coverage

KW - DNA sequencing

KW - Metagenomics

KW - Microbiome

UR - http://www.scopus.com/inward/record.url?scp=84885381231&partnerID=8YFLogxK

U2 - 10.1007/s00285-012-0586-x

DO - 10.1007/s00285-012-0586-x

M3 - Article

C2 - 22965653

AN - SCOPUS:84885381231

SN - 0303-6812

VL - 67

SP - 1141

EP - 1161

JO - Journal of Mathematical Biology

JF - Journal of Mathematical Biology

IS - 5

ER -