TY - JOUR
T1 - Identification of coding regions in genomic DNA sequences
T2 - An application of dynamic programming and neural networks
AU - Snyder, Eric E.
AU - Stormo, Gary D.
N1 - Funding Information:
We would like to thank Roderic Guigo and Steen Knudsen for helpful discussion regarding performance analysis and test data selection as well as generously providing their data on the performance of GenelD and GRAIL on the test data. This work benefitted from discussions at the 'Recognizing Genes' workshop at the Aspen Center for Physics. This work was supported by NTH grant HG00249.
PY - 1993/2/11
Y1 - 1993/2/11
N2 - Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences. The program GeneParser first scores the sequence of interest for splice sites and for these intron- and exon-specific content measures: codon usage, local compositional complexity, 6-tuple frequency, length distribution and periodic asymmetry. This information is then organized for interpretation by DP. GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of introns and exons subject to these constraints. Weights for the various classtfication procedures are determined by training a simple feedforward neural network to maximize the number of correct predictions. In a pilot study, the system has been trained on a set of 56 human gene fragments containing 150 internal exons in a total of 158,691 bps of genomic sequence. When tested against the training data, GeneParser precisely identifies 75% of the exons and correctly predicts 86% of coding nucleotides as coding while only 13% of non-exon bps were predicted to be coding. This corresponds to a correlation coefficient for exon prediction of 0.85. Because of the simplicity of the network weighting scheme, generalization performance is nearly as good as with the training set.
AB - Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences. The program GeneParser first scores the sequence of interest for splice sites and for these intron- and exon-specific content measures: codon usage, local compositional complexity, 6-tuple frequency, length distribution and periodic asymmetry. This information is then organized for interpretation by DP. GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of introns and exons subject to these constraints. Weights for the various classtfication procedures are determined by training a simple feedforward neural network to maximize the number of correct predictions. In a pilot study, the system has been trained on a set of 56 human gene fragments containing 150 internal exons in a total of 158,691 bps of genomic sequence. When tested against the training data, GeneParser precisely identifies 75% of the exons and correctly predicts 86% of coding nucleotides as coding while only 13% of non-exon bps were predicted to be coding. This corresponds to a correlation coefficient for exon prediction of 0.85. Because of the simplicity of the network weighting scheme, generalization performance is nearly as good as with the training set.
UR - http://www.scopus.com/inward/record.url?scp=0027229632&partnerID=8YFLogxK
U2 - 10.1093/nar/21.3.607
DO - 10.1093/nar/21.3.607
M3 - Article
C2 - 8441672
AN - SCOPUS:0027229632
SN - 0305-1048
VL - 21
SP - 607
EP - 613
JO - Nucleic acids research
JF - Nucleic acids research
IS - 3
ER -