So you make a new chemical structure as part of a drug research program. What's it going to hit when it goes into an animal?
That question is a good indicator of the divide between the general public and actual chemists and pharmacologists. People without any med-chem background tend to think that we can predict these things, and people with it know that we can't predict much at all. Even just predicting activity at the actual desired target is no joke, and guessing what other targets a given compound might hit is, well, usually just guessing. We get surprised all the time.
That hasn't been for lack of trying, of course. Here's an effort from a few years ago on this exact question, and a team from Novartis has just published another approach. It builds on some earlier work of theirs (HTS fingerprints, HTSFP) that tries to classify compounds according to similar fingerprints of biological activity in suites of assays, rather than by their structures, and this latest one is called HTSFP-TID (target ID, and I think the acronym is getting a bit overloaded at that point).
We apply HTSFP-TID to make predictions for 1,357 natural products (NPs) and 1,416 experimental small molecules and marketed drugs (hereafter generally referred to as drugs). Our large-scale target prediction enables us to detect differences in the protein classes predicted for the two data sets, reveal target classes that so far have been underrepresented in target elucidation efforts, and devise strategies for a more effective targeting of the druggable genome. Our results show that even for highly investigated compounds such as marketed drugs, HTSFP-TID provides fresh hypotheses that were previously not pursued because they were not obvious based on the chemical structure of a molecule or against human intuition.
They have up to 230 or so assays to pick from, although it's for sure that none of the compounds have been through all of them. They required that any given compound have at least 50 different assays to its name, though (and these were dealt with as standard deviations off the mean, to keep things comparable). And what they found shows some interesting (and believable) discrepancies between the two sets of compounds. The natural product set gave mostly predictions for enzyme targets (70%), half of them being kinases. Proteases were about 15% of the target predictions, and only 4% were predicted GPCR targets. The drug-like set also predicted a lot of kinase interactions (44%), and this from a set where only 20% of the compounds were known to hit any kinases before. But it had only 5% protease target predictions, as opposed to 23% GPCR target predictions.
The group took a subset of compounds and ran them through new assays to see how the predictions came out, and the results weren't bad - overall, about 73% of the predictions were borne out by experiment. The kinase predictions, especially, seemed fairly accurate, although the GPCR calls were less so. They identified several new modes of action for existing compounds (a few of which they later discovered buried in the literature). They also tried a set of predictions based on chemical descriptor (the other standard approach), but found a lower hit rate. Interestingly, though, the two methods tended to give orthogonal predictions, which suggests that you might want to run things both ways if you care enough. Such efforts would seem particularly useful as you push into weirdo chemical or biological space, where we'll take whatever guidance we can get.
Novartis has 1.8 million compounds to work with, and plenty of assay data. It would be worth knowing what some other large collections would yield with the same algorithms: if you used (say) Merck's in-house data as a training set, and then applied it to all the compounds in the CHEMBL database, how similar would the set of predictions for them be? I'd very much like for someone to do something like this (and publish the results), but we'll see if that happens or not.