One- To four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties

Chloé Agathe Azencott, Alexandre Ksikes, S. Joshua Swamidass, Jonathan H. Chen, Liva Ralaivola, Pierre Baldi

Research output: Contribution to journalArticlepeer-review

54 Scopus citations


Many chemoinformatics applications, including high-throughput virtual screening, benefit from being able to rapidly predict the physical, chemical, and biological properties of small molecules to screen large repositories and identify suitable candidates. When training sets are available, machine learning methods provide an effective alternative to ab initio methods for these predictions. Here, we leverage rich molecular representations including ID SMILES strings, 2D graphs of bonds, and 3D coordinates to derive efficient machine learning kernels to address regression problems. We further expand the library of available spectral kernels for small molecules developed for classification problems to include 2.5D surface and 3D kernels using Delaunay tetrahedrization and other techniques from computational geometry, 3D pharmacophore kernels, and 3.5D or 4D kernels capable of taking into account multiple molecular configurations, such as conformers. The kernels are comprehensively tested using cross-validation and redundancy-reduction methods on regression problems using several available data sets to predict boiling points, melting points, aqueous solubility, octanol/water partition coefficients, and biological activity with state-of-the art results. When sufficient training data are available, 2D spectral kernels in general tend to yield the best and most robust results, better than state-of-the art. On data sets containing thousands of molecules, the kernels achieve a squared correlation coefficient of 0.91 for aqueous solubility prediction and 0.94 for octanol/water partition coefficient prediction. Averaging over conformations improves the performance of kernels based on the three-dimensional structure of molecules, especially on challenging data sets. Kernel predictors for aqueous solubility (kSOL), LogP (kLOGP), and melting point (kMELT) are available over the Web through:

Original languageEnglish
Pages (from-to)965-974
Number of pages10
JournalJournal of Chemical Information and Modeling
Issue number3
StatePublished - 2007


Dive into the research topics of 'One- To four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties'. Together they form a unique fingerprint.

Cite this