TY - JOUR
T1 - Discovery of power-laws in chemical space
AU - Benz, Ryan W.
AU - Swamidass, S. Joshua
AU - Baldi, Pierre
PY - 2008/6
Y1 - 2008/6
N2 - Power-law distributions have been observed in a wide variety of areas. To our knowledge however, there has been no systematic observation of power-law distributions in chemoinformatics. Here, we present several examples of power-law distributions arising from the features of small, organic molecules. The distributions of rigid segments and ring systems, the distributions of molecular patJis and circular substructures, and the sizes of molecular similarity clusters all show linear trends on log-log rank/frequency plots, suggesting underlying power-law distributions. The number of unique features also follow Heaps'-like laws. The characteristic exponents of the power-laws lie in the 1.5-3 range, consistently with the exponents observed in other power-law phenomena. The power-law nature of these distributions leads to several applications including the prediction of the growth of available data through Heaps' law and the optimal allocation of experimental or computational resources via the 80/20 rule. More importantly, we also show how the powerlaws can be leveraged to efficiently compress chemical fingerprints in a lossless manner, useful for the improved storage and retrieval of molecules in large chemical databases.
AB - Power-law distributions have been observed in a wide variety of areas. To our knowledge however, there has been no systematic observation of power-law distributions in chemoinformatics. Here, we present several examples of power-law distributions arising from the features of small, organic molecules. The distributions of rigid segments and ring systems, the distributions of molecular patJis and circular substructures, and the sizes of molecular similarity clusters all show linear trends on log-log rank/frequency plots, suggesting underlying power-law distributions. The number of unique features also follow Heaps'-like laws. The characteristic exponents of the power-laws lie in the 1.5-3 range, consistently with the exponents observed in other power-law phenomena. The power-law nature of these distributions leads to several applications including the prediction of the growth of available data through Heaps' law and the optimal allocation of experimental or computational resources via the 80/20 rule. More importantly, we also show how the powerlaws can be leveraged to efficiently compress chemical fingerprints in a lossless manner, useful for the improved storage and retrieval of molecules in large chemical databases.
UR - http://www.scopus.com/inward/record.url?scp=47349090549&partnerID=8YFLogxK
U2 - 10.1021/ci700353m
DO - 10.1021/ci700353m
M3 - Article
C2 - 18522387
AN - SCOPUS:47349090549
SN - 1549-9596
VL - 48
SP - 1138
EP - 1151
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 6
ER -