TY - JOUR
T1 - Promfd 1.0
T2 - A computer program that predicts eukaryotic pol ii promoters using strings and imd matrices
AU - Chen, Qing K.
AU - Hertz, Gerald Z.
AU - Stormo, Gary D.
N1 - Funding Information:
We would like to thank Dr Dan Prestridge of the University of Minnesota for the list of non-promoters and suggestions, we would also like to thank Dr Michael Main of the University of Colorado for the linked list toolkit used in the program. This research is supported by NIH grants HG00124 and HG00249.
PY - 1997
Y1 - 1997
N2 - Motivation: A large number of new DNA sequences with virtually unknown functions are generated as the Human Genome Project progresses. Therefore, it is essential to develop computer algorithms that can predict the functionality of DNA segments according to their primary sequences, including algorithms that can predict promoters. Although several promoter-predicting algorithms are available, they have high false-positive detections and the rate of promoter detection needs to be improved further. Results: In this research, PromFD, a computer program to recognize vertebrate RNA polymerase II promoters, has been developed. Both vertebrate promoters and non-promoter sequences are used in the analysis. The promoters are obtained from the Eukaryotic Promoter Database. Promoters are divided into a training set and a test set. Non-promoter sequences are obtained from the GenBank sequence databank, and are also divided into a training set and a test set. The first step is to search out, among all possible permutations, patterns of strings 5–10 bp long, that are significantly over-represented in the promoter set. The program also searches IMD (Information Matrix Database) matrices that have a significantly higher presence in the promoter set. The results of the searches are stored in the PromFD database, and the program PromFD scores input DNA sequences according to their content of the database entries. PromFD predicts promoters—their locations and the location of potential TATA boxes, if found. The program can detect 71% of promoters in the training set with a false-positive rate of under 1 in every 13 000 bp, and 47% of promoters in the test set with a false-positive rate of under 1 in every 9800 bp. PromFD uses a new approach and its false-positive identification rate is better compared with other available promoter recognition algorithms. The source code for PromFD is in the ‘c++’ language. Availability: PromFD is available for Unix platforms by anonymous ftp to: beagle. colorado. edu, cd pub, get promFD.tar. A Java version of the program is also available for netscape 2.0, by http: // beagle.colorado.edu/∼chenq. Contact: E-mail: [email protected].
AB - Motivation: A large number of new DNA sequences with virtually unknown functions are generated as the Human Genome Project progresses. Therefore, it is essential to develop computer algorithms that can predict the functionality of DNA segments according to their primary sequences, including algorithms that can predict promoters. Although several promoter-predicting algorithms are available, they have high false-positive detections and the rate of promoter detection needs to be improved further. Results: In this research, PromFD, a computer program to recognize vertebrate RNA polymerase II promoters, has been developed. Both vertebrate promoters and non-promoter sequences are used in the analysis. The promoters are obtained from the Eukaryotic Promoter Database. Promoters are divided into a training set and a test set. Non-promoter sequences are obtained from the GenBank sequence databank, and are also divided into a training set and a test set. The first step is to search out, among all possible permutations, patterns of strings 5–10 bp long, that are significantly over-represented in the promoter set. The program also searches IMD (Information Matrix Database) matrices that have a significantly higher presence in the promoter set. The results of the searches are stored in the PromFD database, and the program PromFD scores input DNA sequences according to their content of the database entries. PromFD predicts promoters—their locations and the location of potential TATA boxes, if found. The program can detect 71% of promoters in the training set with a false-positive rate of under 1 in every 13 000 bp, and 47% of promoters in the test set with a false-positive rate of under 1 in every 9800 bp. PromFD uses a new approach and its false-positive identification rate is better compared with other available promoter recognition algorithms. The source code for PromFD is in the ‘c++’ language. Availability: PromFD is available for Unix platforms by anonymous ftp to: beagle. colorado. edu, cd pub, get promFD.tar. A Java version of the program is also available for netscape 2.0, by http: // beagle.colorado.edu/∼chenq. Contact: E-mail: [email protected].
UR - http://www.scopus.com/inward/record.url?scp=0030932088&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/13.1.29
DO - 10.1093/bioinformatics/13.1.29
M3 - Article
C2 - 9088706
AN - SCOPUS:0030932088
SN - 1367-4803
VL - 13
SP - 29
EP - 35
JO - Bioinformatics
JF - Bioinformatics
IS - 1
ER -