Diversity deck, diversity set, diversity collection: most chemical screening efforts try to have some bunch of compounds that are selected for being as unlike each other as possible. Fragment-based collections, being smaller by design, are particularly combed through for this property, in order to cover the most chemical space possible. But how, exactly, do you evaluate chemical diversity?
There are a lot of algorithmic approaches, and a new paper helpfully tries to sort them out for everyone. Here's the take-home:
We assessed both the similar behavior of the descriptors in assessing the diversity of chemical libraries, and their ability to select compounds from libraries that are diverse in bioactivity space, which is a property of much practical relevance in screening library design. This is particularly evident, given that many future targets to be screened are not known in advance, but that the library should still maximize the likelihood of containing bioactive matter also for future screening campaigns. Overall, our results showed that descriptors based on atom topology (i.e., fingerprint-based descriptors and pharmacophore-based descriptors) correlate well in rank-ordering compounds, both within and between descriptor types. On the other hand, shape-based descriptors such as ROCS and PMI showed weak correlation with the other descriptors utilized in this study, demonstrating significantly different behavior.
One of the best-performing methods was Bayes Activity Fingerprints, a technique proposed a few years ago by a group at Novartis. That (at least to my non-computational eyes) doesn't seem too surprising, since this new paper is trying to see how well diversity measure perform when compared to bioactivity space, and that earlier one was specifically adding in a measure to account for bioactivity space as well.
On the other hand, shape-based descriptors were problematic. One that turns up a lot is Principle Moments of Inertia (PMI), the scheme that separates compounds into rod-like, disk-like, and sphere-like shape families, but it and ROCS (based on overlaying molecular volumes) were definitely off in their own world when compared to the other descriptors. In fact, the authors found that there seemed to be no correlation at all between PMI diversity and diverse bioactivity, which should be worth thinking about. You'd apparently do better just picking things randomly than using PMI.