TY - JOUR
T1 - Operon prediction without a training set
AU - Westover, B. P.
AU - Buhler, J. D.
AU - Sonnenburg, J. L.
AU - Gordon, J. I.
N1 - Funding Information:
The authors wish to thank Jeremy Weatherford for invaluable assistance in revising the manuscript and preparing the software for distribution. This work was supported by NSF awards DBI-0237902 and EF-0333284, and by NIH award CDK30292.
PY - 2005/4/1
Y1 - 2005/4/1
N2 - Motivation: Annotation of operons in a bacterial genome is an important step in determining an organism's transcriptional regulatory program. While extensive studies of operon structure have been carried out in a few species such as Escherichia coli, fewer resources exist to inform operon prediction in newly sequenced genomes. In particular, many extant operon finders require a large body of training examples to learn the properties of operons in the target organism. For newly sequenced genomes, such examples are generally not available; moreover, a model of operons trained on one species may not reflect the properties of other, distantly related organisms. We encountered these issues in the course of predicting operons in the genome of Bacteroides thetaiotaomicron (B.theta), a common anaerobe that is a prominent component of the normal adult human intestinal microbial community. Results: We describe an operon predictor designed to work without extensive training data. We rely on a small set of a priori assumptions about the properties of the genome being annotated that permit estimation of the probability that two adjacent genes lie in a common operon. Predictions integrate several sources of information, including intergenic distance, common functional annotation and a novel formulation of conserved gene order. We validate our predictor both on the known operons of E.coli and on the genome of B.theta, using expression data to evaluate our predictions in the latter.
AB - Motivation: Annotation of operons in a bacterial genome is an important step in determining an organism's transcriptional regulatory program. While extensive studies of operon structure have been carried out in a few species such as Escherichia coli, fewer resources exist to inform operon prediction in newly sequenced genomes. In particular, many extant operon finders require a large body of training examples to learn the properties of operons in the target organism. For newly sequenced genomes, such examples are generally not available; moreover, a model of operons trained on one species may not reflect the properties of other, distantly related organisms. We encountered these issues in the course of predicting operons in the genome of Bacteroides thetaiotaomicron (B.theta), a common anaerobe that is a prominent component of the normal adult human intestinal microbial community. Results: We describe an operon predictor designed to work without extensive training data. We rely on a small set of a priori assumptions about the properties of the genome being annotated that permit estimation of the probability that two adjacent genes lie in a common operon. Predictions integrate several sources of information, including intergenic distance, common functional annotation and a novel formulation of conserved gene order. We validate our predictor both on the known operons of E.coli and on the genome of B.theta, using expression data to evaluate our predictions in the latter.
UR - http://www.scopus.com/inward/record.url?scp=16344390358&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/bti123
DO - 10.1093/bioinformatics/bti123
M3 - Article
C2 - 15539453
AN - SCOPUS:16344390358
SN - 1367-4803
VL - 21
SP - 880
EP - 888
JO - Bioinformatics
JF - Bioinformatics
IS - 7
ER -