TY - GEN
T1 - Statistical distribution of chemical fingerprints
AU - Joshua Swamidass, S.
AU - Baldi, Pierre
PY - 2006/6/23
Y1 - 2006/6/23
N2 - Binary fingerprints are binary vectors used to represent chemical molecules by recording the presence or absence of particular substructures, such as labeled paths in the 2D graph of bonds. Complete fingerprints are often reduced to a compressed format-of typical dimension n = 512 or n -1024-by using a simple congruence operation. The statistical properties of complete or compressed fingerprints representations are important since fingerprints are used to rapidly search large databases and to develop statistical machine learning methods in chemoinformatics. Here we present an empirical and mathematical analysis of the distribution of complete and compressed fingerprints. In particular, we derive formulas that provide good approximation for the expected number of bits set to one in a compressed fingerprint, given its uncompressed version, and vice versa.
AB - Binary fingerprints are binary vectors used to represent chemical molecules by recording the presence or absence of particular substructures, such as labeled paths in the 2D graph of bonds. Complete fingerprints are often reduced to a compressed format-of typical dimension n = 512 or n -1024-by using a simple congruence operation. The statistical properties of complete or compressed fingerprints representations are important since fingerprints are used to rapidly search large databases and to develop statistical machine learning methods in chemoinformatics. Here we present an empirical and mathematical analysis of the distribution of complete and compressed fingerprints. In particular, we derive formulas that provide good approximation for the expected number of bits set to one in a compressed fingerprint, given its uncompressed version, and vice versa.
UR - http://www.scopus.com/inward/record.url?scp=33745121144&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:33745121144
SN - 3540325298
SN - 9783540325291
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 11
EP - 18
BT - Fuzzy Logic and Applications - 6th International Workshop, WILF 2005, Revised Selected Papers
T2 - 6th International Workshop - Fuzzy Logic and Applications
Y2 - 15 September 2005 through 17 September 2005
ER -