There's a new paper out that does something unique: it compares the screening libraries of two large drug companies, both of which agreed to open their books to each other (up to a point) for the exercise. The closest analog that I know of is when Bayer merged with/bought Schering AG, and the companies published on the differences between the two compound collections as they worked on merging them. (As a sideline, I hope that they've culled some of the things that were in that collection when I worked there. I actually had a gallery of horrible compounds from the files that I kept around to amaze people - it was hard to come up with a functional group that wasn't represented somewhere). That combined Bayer collection 2.75 million compounds) has now been compared with AstraZeneca's (1.4 million compounds). The two of them have clearly been exploring precompetitive collaboration in high-throughput screening, and trying to figure out how much there is to gain.
The first question that comes to mind is how the companies managed this - after all, you wouldn't want another outfit to actually stroll through your structures. They used 2-D fingerprints to get around this problem, the ECFP4 system, to be exact. That's a descriptor that gives a lot of structural information without being reversible; you can't reassemble the actual compound from the fingerprint.
So what's in these collections, and how much do the two overlap? I think that the main take-away from the paper is the answer to the second question, which is "Not as much as you'd think". Using Tanimoto similarity calculations (ratio of the intersecting set to the union set) for all those molecular fingerprints (with a cutoff of 0.70 for "similar"), they found that about 144,000 compounds in the Bayer collection seem to be duplicated in the AstraZeneca collection. Not surprisingly, these turned out to be commercially available; they'd been bought from the same vendors, most likely. That's not much!
Considering that all pharmaceutical companies can access the same external vendors this number is certainly lower than expected. There are 290K compounds that are not identical but very similar between both databases, with nearest neighbors with Tanimoto values in the range of 0.7–1.0. In a joint HTS campaign this would lead to a higher coverage of the chemical space in SAR exploration. The remaining 2.3M compounds of the Bayer collection have no similar compounds in the AstraZeneca collection, as is reflected in nearest neighbors with Tanimoto values ≤0.7. Thus, a practical interpretation is that AstraZeneca would extend their available chemical space with 2.3M novel, distinct chemical entities by testing the Bayer Pharma AG collection in a HTS campaign, provided that intellectual property issues could be resolved.
One interesting effect, though, is that compounds which would be classed as "singletons" in each collection (and thus could be a bit problematic to follow up on) had closer relatives over in the other company's collection. That could be a real advantage, rescuing what might otherwise be a collection of unrelated stuff - a few legitimate leads buried in a bunch of tedious compounds that would eventually have to be discarded one by one.
The teams also compared their collections to a large public on, the ChEMBL database:
The public ChEMBL database was chosen to simulate a third-party compound collection. It consisted of 600K molecules derived from medicinal chemistry publications annotated with pharmacological/biological data. Hence, we used this source as a proxy for ‘a pharmaceutical’ compound collection. We opted to avoid the use of commercial screening collections for this assessment as it would clearly reveal the number and source of acquisitions. In Fig. 6, we display the distribution of the nearest neighbors in the ChEMBL compounds (query collection) to the target collection corresponding to the merged AstraZeneca and Bayer Pharma AG compounds. Despite the huge set of more than 3.7 million compounds to which the relatively small ChEMBL collection is compared, more than 80% of this collection has their nearest neighbor with a Tanimoto index below 0.70. Consistent with the volume of published and patented compounds this result again emphasize that even in large collections there is still relevant unexplored chemical space accessed by other groups in industry and academia.
So the question comes up, after all these comparisons: have the two companies decided to do anything about this? The conclusions of the paper seem clear. If you're interested in high-throughput screening, combining the two collections would significantly improve the results obtained from screening either one alone. How much value does either company assign to that, compared to the intellectual property risks involved? The decision (or lack of decision) that's reached on this will serve as the best answer: revealed preference always wins out over stated preference.