About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
Not Voodoo

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
Realizations in Biostatistics
ChemSpider Blog
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Eye on FDA
Chemical Forums
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa

Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
Gene Expression (I)
Gene Expression (II)
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net

Medical Blogs
DB's Medical Rants
Science-Based Medicine
Respectful Insolence
Diabetes Mine

Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem

Politics / Current Events
Virginia Postrel
Belmont Club
Mickey Kaus

Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Bristol-Myers Squibb Exits Diabetes | Main | Holiday Blogging »

December 20, 2013

Picking Diverse Compounds

Email This Entry

Posted by Derek

Diversity deck, diversity set, diversity collection: most chemical screening efforts try to have some bunch of compounds that are selected for being as unlike each other as possible. Fragment-based collections, being smaller by design, are particularly combed through for this property, in order to cover the most chemical space possible. But how, exactly, do you evaluate chemical diversity?

There are a lot of algorithmic approaches, and a new paper helpfully tries to sort them out for everyone. Here's the take-home:

We assessed both the similar behavior of the descriptors in assessing the diversity of chemical libraries, and their ability to select compounds from libraries that are diverse in bioactivity space, which is a property of much practical relevance in screening library design. This is particularly evident, given that many future targets to be screened are not known in advance, but that the library should still maximize the likelihood of containing bioactive matter also for future screening campaigns. Overall, our results showed that descriptors based on atom topology (i.e., fingerprint-based descriptors and pharmacophore-based descriptors) correlate well in rank-ordering compounds, both within and between descriptor types. On the other hand, shape-based descriptors such as ROCS and PMI showed weak correlation with the other descriptors utilized in this study, demonstrating significantly different behavior.

One of the best-performing methods was Bayes Activity Fingerprints, a technique proposed a few years ago by a group at Novartis. That (at least to my non-computational eyes) doesn't seem too surprising, since this new paper is trying to see how well diversity measure perform when compared to bioactivity space, and that earlier one was specifically adding in a measure to account for bioactivity space as well.

On the other hand, shape-based descriptors were problematic. One that turns up a lot is Principle Moments of Inertia (PMI), the scheme that separates compounds into rod-like, disk-like, and sphere-like shape families, but it and ROCS (based on overlaying molecular volumes) were definitely off in their own world when compared to the other descriptors. In fact, the authors found that there seemed to be no correlation at all between PMI diversity and diverse bioactivity, which should be worth thinking about. You'd apparently do better just picking things randomly than using PMI.

Comments (13) + TrackBacks (0) | Category: In Silico


1. Anonymous on December 20, 2013 8:15 AM writes...

If only drug discovery processes were as diverse: We might find one that works much more efficiently.

Permalink to Comment

2. Hap on December 20, 2013 8:22 AM writes...

Isn't PMI what the Schreiber/Broad groups use to argue (in part) that their diversity-oriented libraries are more diverse than other more combinatorial libraries?

Permalink to Comment

3. Pete on December 20, 2013 8:50 AM writes...

Pure shape-based methods don't encode chemical information like an atom's interaction type. Benzamidinium cation and benzoate anion are very similar in shape. Using PMI to classifying molecules as rod-like, disk-like and sphere-like is effectively a binning procedure and raises the question of why not just use the PMI descriptors themselves. How one handles conformational space is a huge issue when using molecular shape-based descriptors.

Permalink to Comment

4. Anonymous on December 20, 2013 9:26 AM writes...

Hap - yes, they have used PMI, among many other metrics, to compare "shape diversity" among small molecule libraries. Schreiber himself has said repeatedly the he doesn't know whether or not PMI is useful in this regard (and especially for predicting the overall "performance" of a library... which is another word worth defining), only that it's data that is easy to generate and allows one to test hypotheses regarding the "shape" (as defined by PMI) of a molecule, despite its limitations. And, my sense is that this analysis is usually retrospective and not actually used in the planning of a library; although I could be wrong.

In the end, I'm happy to see these types of publications attempt to assess the assessments. As chemists, we intuitively know that shape is an important feature that affects a compound's activity. But is PMI the best calculation of "shape"? Probably not.

Permalink to Comment

5. Hap on December 20, 2013 9:45 AM writes...

I wasn't trying to argue that it makes the DOS libraries he and Broad do worthless. I was concluding that PMI's use as a comparison measure and as evidence of diversity might not be all that helpful; if PMI doesn't correlate to biological diversity (which is the real point of the libraries) than whatever PMI diversity is shown isn't really relevant, other than in showing the DOS libraries are different than others.

Permalink to Comment

6. Curious Wavefunction on December 20, 2013 10:14 AM writes...

It's a nice paper. PMI has lately been a popular metric; the Broad group along with David Spring's group at Cambridge and Derek Tan's group at Sloan-Kettering have used it to design libraries including ones with macrocycles, but since they haven't published the results of this design in terms of hit rate the verdict is still out on whether PMIs actually give you biological diversity. The original 2003 paper by Sauer and Schwarz i JCIM is worth reading though.

The problem with 3D methods is that they tend to introduce a lot of noise and other complications. 2D methods often work better not because they are better per se but because they strip off this extraneous noise. It's interesting how diverse (pun) studies (including Shoichet's Nature study on the similarity of drugs) have concluded the utility of simple 2D fingerprints, especially ECFP4.

We in the macrocycle field especially struggle with this whole issue of diversity (and especially to what extent building block diversity correlates with product diversity). Ultimately there is really no final answer on what the 'correct' diversity metric is for getting biological diversity, so even after hours of brainstorming you inevitably end up throwing a few in the mix, crossing your fingers and spinning around thrice.

Permalink to Comment

7. Anonymous on December 20, 2013 10:37 AM writes...

"Ultimately there is really no final answer on what the 'correct' diversity metric is for getting biological diversity, so even after hours of brainstorming you inevitably end up throwing a few in the mix, crossing your fingers and spinning around thrice."

I have tried spinning around only twice, but to no avail. I will have to try your 'thrice' method, thanks for the pointer! :-)

Permalink to Comment

8. Anonymous on December 20, 2013 11:12 AM writes...

There is of course no perfect metric for chemical diversity. Probably the best metric is the geometric mean of all other diversity metrics one can dream up!

Permalink to Comment

9. DCRogers on December 20, 2013 11:13 AM writes...

One reason diversity is a hard problem (as suggested by continuing publications on this topic) is that random selection sets a pretty reasonable baseline, and is computationally easy and cheap.

An analogous problem is the estimation of an integral over N-dimensional space from a set of sampled points. Certainly, optimally-positioned sample points are superior, but randomly-chosen points still give pretty damn good estimates.

Permalink to Comment

10. @kayakphilip on December 20, 2013 1:22 PM writes...

Interesting timing for this blog article.

I'm working on a POC this break on using Spotfire to help with a similar issue. Note that Spotfire is only usng the built in, or a third part, fingerprint, to do the actual analysis of similarity/diversity.

The question posed to us however was as such: If you have e.g an SDFile of a set of commercial compounds, but can only afford to buy a given number, can you have something tell you which ones to buy once you tell it how many you can afford.

I hope to have a video or something up in the new year but it was a different way of thinking of things for me. I'd been thinking of chemical clustering as good for identifying things that were similar, but this usecase seems very much more interesting in some ways.

Interesting thoughts re the different models, I'd have to research how many different models we could incorporate.

Permalink to Comment

11. DCRogers on December 20, 2013 2:18 PM writes...

One more thing: an under-appreciated aspect of descriptor selection has to do with their dimensionality, independent of their content. The choice of dimensionality itself is a choice about the distance structure of your data space.

Sets of whole-molecule descriptors (logP, molecular weight, etc) typically can be compressed using PCA to a handful of dimensions - useful for visualization, and with several choices of well-behaved distance metrics.

But this amounts to building in an assumption - if the information of interest cannot be represented in low dimensions, the results will be limited in quality.

High dimensions, on the other hand, are hard to visualize; worse, in high-dimensional spaces, our natural intuitions about distances break down. In short, from the perspective of any single sample, most other samples are 'the same' distance away - that is, far. Distance is basically uninformative other than in a tight near-neighborhood. (For an explanation of this, see Kanerva's "Sparse Distributed Memory" book.)

Such descriptors can be useful for de-cluttering local neighborhoods of near-duplicates, but that's about it -- measures of distance between different neighborhoods are effectively random.

TL;DR -- don't read to much into the value of the content of a descriptor when many effects are explained mostly by its dimensionality.

Permalink to Comment

12. Kelvin Stott on December 20, 2013 3:34 PM writes...

Graph theory might help solve this problem.

Permalink to Comment

13. Der Hindenburg on December 20, 2013 6:49 PM writes...

#3 von K is sage in these matters.

Permalink to Comment


Remember Me?


Email this entry to:

Your email address:

Message (optional):

How Not to Do It: NMR Magnets
Allergan Escapes Valeant
Vytorin Actually Works
Fatalities at DuPont
The New York TImes on Drug Discovery
How Are Things at Princeton?
Phage-Derived Catalysts
Our Most Snorted-At Papers This Month. . .