TY - JOUR
T1 - Information content of binding sites on nucleotide sequences
AU - Schneider, Thomas D.
AU - Stormo, Gary D.
AU - Gold, Larry
AU - Ehrenfeucht, Andrzej
N1 - Funding Information:
We thank many friends and colleagues for their suggestions, criticisms and patience during the years t,hat this work evolved. We also thank Phil Bloch for a current estimate of the coding capacity of E. coli: F. W. Studier for sending us the sequence of T7 before publication; Michael Perry for a general proof for formula (A5): and Kathie Piekarski for typing the manuscript. Computer resources were generously provided by the ITniversity of Colorado Academic Computing Services. This work was supported by NIH grant GM28755
PY - 1986/4/5
Y1 - 1986/4/5
N2 - Repressors, polymerases, ribosomes and other macromolecules bind to specific nucleic acid sequences. They can find a binding site only if the sequence has a recognizable pattern. We define a measure of the information (Rsequence) in the sequence patterns at binding sites. It allows one to investigate how information is distributed across the sites and to compare one site to another. One can also calculate the amount of information (Rfrequency) that would be required to locate the sites, given that they occur with some frequency in the genome. Several Escherichia coli binding sites were analyzed using these two independent empirical measurements. The two amounts of information are similar for most of the sites we analyzed. In contrast, bacteriophage T7 RNA polymerase binding sites contain about twice as much information as is necessary for recognition by the T7 polymerase, suggesting that a second protein may bind at T7 promoters. The extra information can be accounted for by a strong symmetry element found at the T7 promoters. This element may be an operator. If this model is correct, these promoters and operators do not share much information. The comparisons between Rsequence and Rfrequency suggest that the information at binding sites is just sufficient for the sites to be distinguished from the rest of the genome.
AB - Repressors, polymerases, ribosomes and other macromolecules bind to specific nucleic acid sequences. They can find a binding site only if the sequence has a recognizable pattern. We define a measure of the information (Rsequence) in the sequence patterns at binding sites. It allows one to investigate how information is distributed across the sites and to compare one site to another. One can also calculate the amount of information (Rfrequency) that would be required to locate the sites, given that they occur with some frequency in the genome. Several Escherichia coli binding sites were analyzed using these two independent empirical measurements. The two amounts of information are similar for most of the sites we analyzed. In contrast, bacteriophage T7 RNA polymerase binding sites contain about twice as much information as is necessary for recognition by the T7 polymerase, suggesting that a second protein may bind at T7 promoters. The extra information can be accounted for by a strong symmetry element found at the T7 promoters. This element may be an operator. If this model is correct, these promoters and operators do not share much information. The comparisons between Rsequence and Rfrequency suggest that the information at binding sites is just sufficient for the sites to be distinguished from the rest of the genome.
UR - http://www.scopus.com/inward/record.url?scp=0023042012&partnerID=8YFLogxK
U2 - 10.1016/0022-2836(86)90165-8
DO - 10.1016/0022-2836(86)90165-8
M3 - Article
C2 - 3525846
AN - SCOPUS:0023042012
SN - 0022-2836
VL - 188
SP - 415
EP - 431
JO - Journal of Molecular Biology
JF - Journal of Molecular Biology
IS - 3
ER -