Identification of protein coding regions in genomic DNA

Eric E. Snyder, Gary D. Stormo

Research output: Contribution to journalArticle

139 Scopus citations

Abstract

We have developed a computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences. The program scores all subintervals in a sequence for content statistics indicative of introns and exons, and for sites that identify their boundaries. This information is weighted by a neural network to approximate the log-likelihood that each subinterval exactly represents an intron or exon (first, internal or last). A dynamic programming algorithm is then applied to this data to find the combination of introns and exons that maximizes the likelihood function. Using this method, we can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction. We have tested the system on a large collection of human genes. On sequences not used in training, we achieved a correlation coefficient for exon nucleotide prediction of 0.89. For a subset of G + C-rich genes, a correlation coefficient of 0.94 was achieved. We have also quantified the robustness of the method to substitution and frame-shift errors and show how the system can be optimized for performance on sequences with known levels of sequencing errors.

Original languageEnglish
Pages (from-to)1-18
Number of pages18
JournalJournal of Molecular Biology
Volume248
Issue number1
DOIs
StatePublished - Apr 21 1995
Externally publishedYes

Keywords

  • Artificial intelligence
  • Coding sequence
  • Dynamic programming
  • Exon structure
  • Gene identification

Fingerprint Dive into the research topics of 'Identification of protein coding regions in genomic DNA'. Together they form a unique fingerprint.

  • Cite this