TY - JOUR
T1 - Identification of protein coding regions in genomic DNA
AU - Snyder, Eric E.
AU - Stormo, Gary D.
N1 - Funding Information:
We thank Roderic Guigo and Richard Mural for discussions on the general problem of gene identification, and Alan Lapedes and Andrzej Ehrenfeucht for helpful suggestions on the use of neural networks in this system. This work benefitted from discussions at the Recognizing Genes workshop at the Aspen Center for Physics. This work was supported by NIH grant HG00249 and DOE grant ER61606.
PY - 1995/4/21
Y1 - 1995/4/21
N2 - We have developed a computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences. The program scores all subintervals in a sequence for content statistics indicative of introns and exons, and for sites that identify their boundaries. This information is weighted by a neural network to approximate the log-likelihood that each subinterval exactly represents an intron or exon (first, internal or last). A dynamic programming algorithm is then applied to this data to find the combination of introns and exons that maximizes the likelihood function. Using this method, we can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction. We have tested the system on a large collection of human genes. On sequences not used in training, we achieved a correlation coefficient for exon nucleotide prediction of 0.89. For a subset of G + C-rich genes, a correlation coefficient of 0.94 was achieved. We have also quantified the robustness of the method to substitution and frame-shift errors and show how the system can be optimized for performance on sequences with known levels of sequencing errors.
AB - We have developed a computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences. The program scores all subintervals in a sequence for content statistics indicative of introns and exons, and for sites that identify their boundaries. This information is weighted by a neural network to approximate the log-likelihood that each subinterval exactly represents an intron or exon (first, internal or last). A dynamic programming algorithm is then applied to this data to find the combination of introns and exons that maximizes the likelihood function. Using this method, we can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction. We have tested the system on a large collection of human genes. On sequences not used in training, we achieved a correlation coefficient for exon nucleotide prediction of 0.89. For a subset of G + C-rich genes, a correlation coefficient of 0.94 was achieved. We have also quantified the robustness of the method to substitution and frame-shift errors and show how the system can be optimized for performance on sequences with known levels of sequencing errors.
KW - Artificial intelligence
KW - Coding sequence
KW - Dynamic programming
KW - Exon structure
KW - Gene identification
UR - http://www.scopus.com/inward/record.url?scp=0028965444&partnerID=8YFLogxK
U2 - 10.1006/jmbi.1995.0198
DO - 10.1006/jmbi.1995.0198
M3 - Article
C2 - 7731036
AN - SCOPUS:0028965444
SN - 0022-2836
VL - 248
SP - 1
EP - 18
JO - Journal of Molecular Biology
JF - Journal of Molecular Biology
IS - 1
ER -