Abstract

Power-law distributions have been observed in a wide variety of areas. To our knowledge however, there has been no systematic observation of power-law distributions in chemoinformatics. Here, we present several examples of power-law distributions arising from the features of small, organic molecules. The distributions of rigid segments and ring systems, the distributions of molecular patJis and circular substructures, and the sizes of molecular similarity clusters all show linear trends on log-log rank/frequency plots, suggesting underlying power-law distributions. The number of unique features also follow Heaps'-like laws. The characteristic exponents of the power-laws lie in the 1.5-3 range, consistently with the exponents observed in other power-law phenomena. The power-law nature of these distributions leads to several applications including the prediction of the growth of available data through Heaps' law and the optimal allocation of experimental or computational resources via the 80/20 rule. More importantly, we also show how the powerlaws can be leveraged to efficiently compress chemical fingerprints in a lossless manner, useful for the improved storage and retrieval of molecules in large chemical databases.

Original languageEnglish
Pages (from-to)1138-1151
Number of pages14
JournalJournal of Chemical Information and Modeling
Volume48
Issue number6
DOIs
StatePublished - Jun 2008

Fingerprint

Dive into the research topics of 'Discovery of power-laws in chemical space'. Together they form a unique fingerprint.

Cite this