So, how many good screening compounds are there to be had? We can now start to argue about the definition of "good"; that's the traditional next step in this process. But there's a new paper from Australia's Jonathan Baell on this very question that's worth a look.
He and his co-workers have already called attention to the number of compounds with possibly problematic functional groups for high-throughput screening. In this paper, he also quantifies the way that commercial compound collections tend to go wild on certain scaffolds - giving you, say, three hundred of one series and one of another. One does not mind diagnosing synthetic ease as the reason for this. And it's not always bad - if you get a hit from the series, then you have an SAR collection ready to go in the follow-up. But you wouldn't necessarily want all of them in there for the first go-round.
But there are many other criteria, and as anyone who's done the exercise can appreciate, large lists of compounds tend to be cut down to size rapidly. The paper shows this in action with a commercial set of 400,000 compounds. Apply some not-too-stringent criteria (between 1 and 4 rings, molecular weights between 150 and 450, cLogP less than 6, no more than 5 hydrogen bond donors and no more than 8 acceptors, up to three chiral centers, and up to 12 rotatable bonds), and you're down to 250K compounds right there. Clear out some functional groups and the PAINS list, and you're down to 170K. Want to cut the molecular weight down to 400, and rotatable bonds down to 10? 130,000 compounds remain. cLogP only up to 5, donors down to 3 or fewer, acceptors down to 6 or fewer? 110,000.
At this point, the paper says, further inspection of the list led to the realization that there were still a lot of problematic functional groups present. (I had a similar experience myself recently, filtered down a less humungous list. Even after several rounds, I was surprised to find, on looking more closely, how many oximes, hydrazones, Schiff bases, hydrazines, and N-hydroxyls were left). In Baell's case, clearing out the not-so-great at this point cut things down to 50,000 compounds. Then a Tanimoto cutoff (to get rid of things that were at least 90% similar to the existing screening compounds) cleared out all but 10,000. Applying the same cutoff, but getting rid of compounds on the list that were more than 90% similar to each other, reduced it to 6,000. So, in other words, one could make a good case for getting rid of over 98% of the vendor's list for high-throughput screening purposes. Similar results were obtained for many other commercial sets of compounds; the paper has the exact numbers (although not, alas, the vendor names involved!)
There were other vendor considerations as well. By the time Baell and his group had gone through all this compound-crunching and placed orders, significant numbers of compounds turned out to be unavailable. (I'm willing to bet that quite a few of them would have turned out to be unavailable even if they'd placed their orders that afternoon, but I'm of a cynical bent). That catalog turnover also brings up the problem of being able to re-order compounds if they turn out to be hits:
. . .there were only two vendors whose resupply philosophy we considered to be sound, this philosophy being that around 40 mg stock was set aside and accessible exclusively to prior buyers of that compound for the purposes of resupply of ca. 1−2 mg for secondary assay of a confirmed screening hit. We believe this issue of resupply is in urgent need of attention by vendors and will provide a competitive edge to those vendors willing to better guarantee resupply.
By the time they'd surveyed the various large-scale compound vendors, the group had looked over the majority of commercially available screening compounds. Given the attrition rates, how many actual compounds would cover the world's purchasable chemical space? The best guess is about 340,000, of the many millions of potentially purchasable items.
Of course, all these numbers are subject to dispute - you may not agree with some of the functional group or property cutoffs, or you might want things cut down even more. The paper addresses this question, and the general one of why any particular compound should be in a screening collection at all. My own criterion is "Would I be willing to follow up on this compound if it were a hit?" But different chemists, as has been proven many times, will answer such questions in different ways.
A big part of the discussion are those Tanimoto similarity scores, and the paper has a good deal to say about that. You wouldn't want to cut everything down to just singleton compounds (most likely), but you also don't need to have dozens and dozens of para-chloro/para-fluoro methyl-ethyl analogs in each series, either. The best guess is that most vendor catalogs are still rather unbalanced: they have far too many analogs for some compound classes, but too few for many more. Singleton compounds represent most of the chemical diversity for many collections, but you could make the case that there shouldn't be any singletons, ideally. Even two or three representatives from each structural class would be a real improvement. A vendor collection of 400,000 compounds that consisted of 40,000 fairly distinct structures with ten members of each class would be something to see - but no one's ever seen such a thing.
This new paper, by the way, is full of references to the screening-collection literature, as well as discussing many of the issues itself. I recommend it to anyone thinking about these issues; there are a lot of things that you don't want to have to rediscover!