TY - JOUR
T1 - Probabilistic code for DNA recognition by proteins of the EGR family
AU - Benos, Panayiotis V.
AU - Lapedes, Alan S.
AU - Stormo, Gary D.
N1 - Funding Information:
This work was supported by NIH grant HG00249 to G.D.S. The research of A.S.L. was supported by the Department of Energy under contract W-7405-ENG-36. The authors thank the Santa Fe Institute, where part of this work was performed.
PY - 2002
Y1 - 2002
N2 - A recognition code for protein-DNA interactions would allow for the prediction of binding sites based on protein sequence, and the identification of binding proteins for specific DNA targets. Crystallographic studies of protein-DNA complexes showed that a simple, deterministic recognition code does not exist. Here, we present a probabilistic recognition code (P-code) that assigns energies to all possible base-pair-amino acid interactions for the early growth response factor (EGR) family of zinc-finger transcription factors. The specific energy values are determined by a maximum likelihood method using examples from in vitro randomisation experiments (namely, SELEX and phage display) reported in the literature. The accuracy of the model is tested in several ways, including the ability to predict in vivo binding sites of EGR proteins and other non-EGR zinc-finger proteins, and the correlation between predicted and measured binding affinities of various EGR proteins to several different DNA sites. We also show that this model improves significantly upon the prediction capabilities of previous qualitative and quantitative models. The probabilistic code we develop uses information about the interacting positions between the protein and DNA, but we show that such information is not necessary, although it reduces the number of parameters to be determined. We also employ the assumption that the total binding energy is the sum of the energies of the individual contacts, but we describe how that assumption can be relaxed at the cost of additional parameters.
AB - A recognition code for protein-DNA interactions would allow for the prediction of binding sites based on protein sequence, and the identification of binding proteins for specific DNA targets. Crystallographic studies of protein-DNA complexes showed that a simple, deterministic recognition code does not exist. Here, we present a probabilistic recognition code (P-code) that assigns energies to all possible base-pair-amino acid interactions for the early growth response factor (EGR) family of zinc-finger transcription factors. The specific energy values are determined by a maximum likelihood method using examples from in vitro randomisation experiments (namely, SELEX and phage display) reported in the literature. The accuracy of the model is tested in several ways, including the ability to predict in vivo binding sites of EGR proteins and other non-EGR zinc-finger proteins, and the correlation between predicted and measured binding affinities of various EGR proteins to several different DNA sites. We also show that this model improves significantly upon the prediction capabilities of previous qualitative and quantitative models. The probabilistic code we develop uses information about the interacting positions between the protein and DNA, but we show that such information is not necessary, although it reduces the number of parameters to be determined. We also employ the assumption that the total binding energy is the sum of the energies of the individual contacts, but we describe how that assumption can be relaxed at the cost of additional parameters.
KW - DNA-binding specificity
KW - DNA-protein interactions
KW - Recognition code
KW - Zinc-finger proteins
UR - http://www.scopus.com/inward/record.url?scp=0036977044&partnerID=8YFLogxK
U2 - 10.1016/S0022-2836(02)00917-8
DO - 10.1016/S0022-2836(02)00917-8
M3 - Article
C2 - 12419259
AN - SCOPUS:0036977044
SN - 0022-2836
VL - 323
SP - 701
EP - 727
JO - Journal of Molecular Biology
JF - Journal of Molecular Biology
IS - 4
ER -