I've been meaning to get around to a very interesting paper from the Shoichet group that came out a month or so ago in Nature Chemical Biology. Today's the day! It examines the content of screening libraries and compares them to what natural products generally look like, and they turn up some surprising things along the way. The main question they're trying to answer is: given the huge numbers of possible compounds, and the relatively tiny fraction of those we can screen, why does high-throughput screening even work at all?
The first data set they consider is the Generated Database (GDB), a calculated set of all the reasonable structures with 11 or fewer nonhydrogen atoms, which grew out of this work. Neglecting stereochemistry, that gives you between 26 and 27 million compounds. Once you're past the assumptions of the enumeration (which certainly seem defensible - no multiheteroatom single-bond chains, no gem-diols, no acid chlorides, etc.), then there are no human bias involved: that's the list.
The second list is everything from the Dictionary of Natural Products and all the metabolites and natural products from the Kyoto Encyclopedia of Genes and Genomes. That gives you 140,000+ compounds. And the final list is the ZINC database of over 9 million commercially available compounds, which (as they point out) is a pretty good proxy for a lot of screening collections as well.
One rather disturbing statistic comes out early when you start looking at overlaps between these data sets. For example, how many of the possible GDB structures are commercially available? The answer: 25,810 of them - in other words, you can only buy fewer than 0.01% of the possible compounds with 11 heavy atoms or below, making the "purchasable GDB" a paltry list indeed.
Now, what happens when you compare that list of natural products to these other data sets? Well, for one thing, the purchasable part of the GDB turns out to be much more similar to the natural product list than the full set. Everything in the GDB has at least 20% Tanimoto similarity to at least one compound in the natural products set, not that 20% means much of anything in that scoring system. But only 1% of the GDB has a 40% Tanimoto similarity, and less than 0.005% has an 80% Tanimoto similarity. That's a pretty steep dropoff!
But the "purchasable GDB" holds up much better. 10% of that list has 100% Tanimoto similarity (that is, 10% of the purchasable compounds are natural products themselves). The authors also compare individual commercial screening collections. If you're interested, ChemBridge and Asinex are the least natural-product-rich (about 5% of their collections), whereas IBS and Otava are the most (about 10%).
So one answer to "why does HTS ever work for anything" is that compound collections seem to be biased toward natural-product type structures, which we can reasonably assume have generally evolved to have some sort of biological activity. It would be most interesting to see the results of such an analysis run from inside several drug companies against their own compound collections. My guess is that the natural product similarities would be even higher than the "purchasable GDB" set's, because drug company collections have been deliberately stocked with structural series that have shown activity in one project or another.
That's certainly looking at things from a different perspective, because you can also hear a lot of talk about how our compound files are too ugly - too flat, too hydrophobic, not natural-product-like enough. These viewpoints aren't contradictory, though - if Shoichet is right, then improving those similarities would indeed lead to higher hit rates. Compared to everything else, we're already at the top of the similarity list, but in absolute terms there's still a lot of room for improvement.
So how would one go about changing this, assuming that one buys into this set of assumptions? The authors have searched through the various databases for ring structures, taking those as a good proxy for structural scaffolds. As it turns out 83% of the ring scaffolds among the natural products are unrepresented among the commercially available molecules - a result that I assume that Asinex, ChemBridge, Life Chemicals, Otava, Bionet and their ilk are noting with great interest. In fact, the authors go even further in pointing out opportunities, with a table of rings from this group that closely resemble known drug-like ring systems.
But wait a minute. . .when you look at those scaffolds, a number of them turn out to be rather, well, homely. I'd be worried about elimination to form a Michael acceptor in compound 19, for example. I'm not crazy about the N,S acetal in 21 or the overall stability of the acetals in 15, 17 and 31. The propiolactone in 23 is surely reactive, as is the quinone in 25, and I'd be very surprised if that's not what they owe their biological activities to. And so on.
All that said, there are still some structures in there that I'd be willing to check out, and there must be more of them in that 83%. No doubt a number of the rings that do sneak into the commercial list are not very well elaborated, either. I think that there is a real commercial opportunity here. A company could do quite well for itself by promoting its compound collection as being more natural-product similar than the competition, with tractable molecules, and a huge number of them unrepresented in any other catalog.
Now all you'd have to do is make these things. . .which would require hiring synthetic organic chemists, and plenty of them. These things aren't easy to make, or to work with. And as it so happens, there are quite a few good ones available these days. Anyone want to take this business model to heart?