About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
Not Voodoo

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
Realizations in Biostatistics
ChemSpider Blog
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Eye on FDA
Chemical Forums
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa

Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
Gene Expression (I)
Gene Expression (II)
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net

Medical Blogs
DB's Medical Rants
Science-Based Medicine
Respectful Insolence
Diabetes Mine

Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem

Politics / Current Events
Virginia Postrel
Belmont Club
Mickey Kaus

Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Chemical Warfare in Syria? | Main | Science Gifts: Microscopes »

December 6, 2012

Four Million Compounds to Screen

Email This Entry

Posted by Derek

There's a new paper out that does something unique: it compares the screening libraries of two large drug companies, both of which agreed to open their books to each other (up to a point) for the exercise. The closest analog that I know of is when Bayer merged with/bought Schering AG, and the companies published on the differences between the two compound collections as they worked on merging them. (As a sideline, I hope that they've culled some of the things that were in that collection when I worked there. I actually had a gallery of horrible compounds from the files that I kept around to amaze people - it was hard to come up with a functional group that wasn't represented somewhere). That combined Bayer collection 2.75 million compounds) has now been compared with AstraZeneca's (1.4 million compounds). The two of them have clearly been exploring precompetitive collaboration in high-throughput screening, and trying to figure out how much there is to gain.

The first question that comes to mind is how the companies managed this - after all, you wouldn't want another outfit to actually stroll through your structures. They used 2-D fingerprints to get around this problem, the ECFP4 system, to be exact. That's a descriptor that gives a lot of structural information without being reversible; you can't reassemble the actual compound from the fingerprint.

So what's in these collections, and how much do the two overlap? I think that the main take-away from the paper is the answer to the second question, which is "Not as much as you'd think". Using Tanimoto similarity calculations (ratio of the intersecting set to the union set) for all those molecular fingerprints (with a cutoff of 0.70 for "similar"), they found that about 144,000 compounds in the Bayer collection seem to be duplicated in the AstraZeneca collection. Not surprisingly, these turned out to be commercially available; they'd been bought from the same vendors, most likely. That's not much!

Considering that all pharmaceutical companies can access the same external vendors this number is certainly lower than expected. There are 290K compounds that are not identical but very similar between both databases, with nearest neighbors with Tanimoto values in the range of 0.7–1.0. In a joint HTS campaign this would lead to a higher coverage of the chemical space in SAR exploration. The remaining 2.3M compounds of the Bayer collection have no similar compounds in the AstraZeneca collection, as is reflected in nearest neighbors with Tanimoto values ≤0.7. Thus, a practical interpretation is that AstraZeneca would extend their available chemical space with 2.3M novel, distinct chemical entities by testing the Bayer Pharma AG collection in a HTS campaign, provided that intellectual property issues could be resolved.

One interesting effect, though, is that compounds which would be classed as "singletons" in each collection (and thus could be a bit problematic to follow up on) had closer relatives over in the other company's collection. That could be a real advantage, rescuing what might otherwise be a collection of unrelated stuff - a few legitimate leads buried in a bunch of tedious compounds that would eventually have to be discarded one by one.

The teams also compared their collections to a large public on, the ChEMBL database:

The public ChEMBL database was chosen to simulate a third-party compound collection. It consisted of 600K molecules derived from medicinal chemistry publications annotated with pharmacological/biological data. Hence, we used this source as a proxy for ‘a pharmaceutical’ compound collection. We opted to avoid the use of commercial screening collections for this assessment as it would clearly reveal the number and source of acquisitions. In Fig. 6, we display the distribution of the nearest neighbors in the ChEMBL compounds (query collection) to the target collection corresponding to the merged AstraZeneca and Bayer Pharma AG compounds. Despite the huge set of more than 3.7 million compounds to which the relatively small ChEMBL collection is compared, more than 80% of this collection has their nearest neighbor with a Tanimoto index below 0.70. Consistent with the volume of published and patented compounds this result again emphasize that even in large collections there is still relevant unexplored chemical space accessed by other groups in industry and academia.

So the question comes up, after all these comparisons: have the two companies decided to do anything about this? The conclusions of the paper seem clear. If you're interested in high-throughput screening, combining the two collections would significantly improve the results obtained from screening either one alone. How much value does either company assign to that, compared to the intellectual property risks involved? The decision (or lack of decision) that's reached on this will serve as the best answer: revealed preference always wins out over stated preference.

Comments (15) + TrackBacks (0) | Category: Drug Assays


1. weirdo on December 6, 2012 9:17 AM writes...

Assigning value is certainly the key, to this potential sharing as well as some of the "pre-competitive library consortia" out there. As medicinal chemists recognize, a screening hit from a partner collection is not equivalent to a hit from an internal library for a number of reasons (setting aside LE, "attractiveness", etc. etc.):

1) There is no institutional knowledge of the chemotype -- how to make, handle, etc.
2) Any IP was filed by someone else.
3) There is no history of how dirty the molecule is -- what other HTS campaigns found it as a hit, no ADME data, no (non-public) information on in vivo behavior of analogues, etc. etc.
4) Probably no dry powder for confirmations -- someone will have to make it to simply reconfirm the screening data.

And the list goes on.

Being able to screen someone else's collection of equal size increases the odds of success, yes. But it certainly doesn't double them.

Permalink to Comment

2. DrSnowboard on December 6, 2012 10:29 AM writes...

As Mike Hann at GSK once (and probably still says) - " You don't find a needle in a haystack by building a bigger haystack". Particularly if you are actually looking for corn.

Permalink to Comment

3. Teddy Z on December 6, 2012 10:54 AM writes...

#2 is exactly correct. I am a long time practitioner and proponent of fragments and I still don't get it why people think putting more and more of big things in the screens will increase the probability of finding the drug. Kubinyi is a good place to start:

Permalink to Comment

4. B on December 6, 2012 11:25 AM writes...

#3: I think it's just a different philosophy. Yes a large HTS is more like finding a needle in a haystack, but you are also trying to put probability on your side (with huge libraries, etc.).

Fragments are great, but after you get your fragment I think the prospect of transitioning that in to a drug is daunting. Often it seems like the chemical space you can explore is so vast that it can be difficult to be as creative as some of the large molecules you may find in full-molecule HTS.

I think it is more opposite ends of the spectrum, both have their pros and cons. More importantly than either is doing the right assays to make sure you can see binding or don't miss hits due to poor specificity and the like.

Permalink to Comment

5. schambrg on December 6, 2012 11:54 AM writes...

Mike Hann is quite right with his comment about haystacks. But he implies that the needle is already in the haystack he is currently looking at. On the otherhand one might find a needle in the second haystack, as there might not be one in the first.

Permalink to Comment

6. Sleepless in SSF on December 6, 2012 11:57 AM writes...

So the argument for a large library at Exelxis (ca. 4.5MM) was that they got low nM hits from the screen and so the hit-to-lead time was reduced. Given the large audience of med chemists here, I'm interested to hear if there are any comments on that philosophy from people who hadn't drunk the koolaid.

Permalink to Comment

7. Anonymous on December 6, 2012 12:45 PM writes...

Looking at the impact that HTPS has had on our industry's fanstastic new drug productivity track record over the past 15 years, my conclusion is: size only matters in porn.

Permalink to Comment

8. anon on December 6, 2012 1:20 PM writes...

#6 Potency like size isn't everything. Potency without the ability to optimise all the many other parameters to get to the drug is nothing

Permalink to Comment

9. KCMO_chemist on December 6, 2012 8:11 PM writes...

#6/#8: The Koolaid would appear to have been made out of potent optimized compounds moved into the clinic.

23 development candidates over 10 years, 6 of which remain in the clinic, 3 of which remain in preclinical, and many others of which were shelved (or never moved forward) mainly by the financial constraints on the company as it focused on Cometriq (carbozantinib/XL184).

Those results suggest if you design it well, you get solid results.

Permalink to Comment

10. Loving it on December 6, 2012 11:43 PM writes...

Derek, in answer to your question "have the two companies decided to do anything about this?" they are indeed taking the extra step of screening each others library:
#7, HTS has had the same great impact as medicinal chemistry, rational design, fragments, genomics, combinatorial chemistry, translational medicine, personalized medicine....looking for scape goats? you can choose any and will lead you nowhere - all tools and paradigms can be useful or misused, oversold or trashed when they are starting to work.

Permalink to Comment

11. Chris Swain on December 7, 2012 3:49 AM writes...

What might be interesting would be if the companies looked their own libraries with the same similarity cut-off. I suspect they would each find that they could significantly reduce the size of their own screening collection without compromising diversity.

Permalink to Comment

12. john on December 7, 2012 4:31 AM writes...

The real problem is recognizing the needle when you see it. All to easy to end up with a big heap of false positives while the gold nugget sits in a heap (of unknown size...) of false negatives because it didn't fancy dissolving in DMSO that morning.

Permalink to Comment

13. petros on December 7, 2012 9:34 AM writes...

One can imagine that both Bayer and Schering libraries originally contained, from a med chem perspective, some horrible compounds because of the two companies agrochem activities.

Early screening sets at Bayer were heavily biased towards dihydropyridines while a good chunk of the initial large collection was comprised of compounds from agro R&D

Permalink to Comment

14. Mark Mackey on December 7, 2012 9:59 AM writes...

@11 - Possibly, but the issue is that 2D fingerprint similarity is only a partial proxy for activity. Given a compound is active, close analogues (Tanimoto>0.7, for example) are more likely to be active than random stuff, but it's far from guaranteed. If you aggressively go for a "diverse pick", you're picking one from any given cluster of related compounds, and it's more likely than not that you'll have discarded the actives and kept one of the inactives for any given assay.

Permalink to Comment

15. Jöns Jacob B on December 9, 2012 5:47 PM writes...

in adddition to @10: there is a large European consortium forming for cross-screening of pharma collections:

So, a key tenet of pharma medicinal chemistry research is becoming pre-competitive.

Permalink to Comment


Remember Me?


Email this entry to:

Your email address:

Message (optional):

The Last Post
The GSK Layoffs Continue, By Proxy
The Move is Nigh
Another Alzheimer's IPO
Cutbacks at C&E News
Sanofi Pays to Get Back Into Oncology
An Irresponsible Statement About Curing Cancer
Oliver Sacks on Turning Back to Chemistry