I mentioned an interesting paper that's coming out in the Journal of Medicinal Chemistry on molecular modeling. It's a long one from a large group of people scattered across GlaxoSmithKline's worldwide research facilities, entitled "A Critical Assessment of Docking Programs and Scoring Functions." And that's what it is, all right.
For the non-med-chem readers, those are two of the key techniques in computational molecular modeling. Docking refers to taking a modeled version of your small molecule and trying to fit it into a similarly modeled version of the binding site of your protein target. The program ties to take into account the size and shape of the molecule and the binding site, of course, as well as more subtle interactions between the various functional groups. Scoring functions are what the programs use to try to rate how well the docking procedure went for a given compound, and to compare it to others in a given data set.
The GSK team did a very thorough job, evaluating ten different docking programs. They started with seven varying types of protein targets, mostly different classes of enzymes, all of which are known drug targets. An expert computational chemist took each one and polished up the model of the binding site. At the same time, lists of between one and two hundred potential binding compounds were put together for each target, including several series of related compounds. Another modeling chemist took these structures and got them ready for docking. They made sure that a crystal structure of each structural class was known for each case (to check the accuracy of the modeling later on), and also made sure that the binding affinity of the compounds ranged over at least four orders of magnitude (from pretty darn good, in other words, to pretty darn awful). The goal was to make the whole exercise as real-world as possible. Then each of those binding site models and their associated lists of potential ligands were turned over to separate chemists with experience in the various docking programs, and they told them to have at it. As the paper puts it:
"To optimize the performance of each docking program, computational chemists with expertise in a particular program were identified from the worldwide GSK computational chemistry community. Each program expert was given complete freedom and sufficient time to maximize the performance of the docking program. . .No time deadlines were imposed so that even low-throughput docking programs could be evaluated. Indeed, no constraints whatsoever were placed on the level of agonizing over details of how each docking program was applied."
It's important to remember that the results of this paper come from experienced users who had a great deal of knowledge about the targets, and all the time they needed to mess with them. The aformentioned agonizing was devoted to three typical kinds of question that such software is designed to answer: The first was: what is the conformation (the 3-D physical "pose") of a small molecule once it's in a binding site? This is why they picked all these things with known crystal structures, since those provide a check with real data. Results of this test were OK, in some cases fairly good. Some of the target proteins seemed to have binding sites that were more suited for the capabilities of the programs, which could take the majority of the compounds in their list and fit them pretty close (within two angstroms) to the known crystal structures.
And every target had at least one program that could take at least a third or so of the test compounds and dock them fairly well. But the problem was, no one program could do that for more than 35% of the binding modes. The best performances were scattered among the different software packages, and there seems to be absolutely no way to know in advance whether a given program is going to perform well on a new target. The other problem, and it's a big one, was that the scoring functions couldn't reliably identify when the program had hit on one of the good answers. There wasn't much correlation between what the program thought was a well-docked conformation and its resemblance to the known crystal structure.
The second question they looked at was: given a list of molecules (some active, some inactive), how well can the software pick out some active ones? This process is often known as "virtual screening". Again, the results were fairly good, but with some significant problems. For all but one of the targets, at least one of the programs could find at least half of the top 10% of the active compounds. (I know, that sounds like a lot of defensive hedging compared to what some people think these programs can do, but that's the real world for you). The programs also did pretty well at pulling a variety of structures out, and not just making their total by grabbing only the members of one particular class.
But that fairly-decent performance is for the programs as a group. As before, though, the best performances were scattered through all the software packages, with no real standout. Most of the programs, at one point or another, had to grind through a significant amount of a compound lists to do the job, too, which is something you really don't want in real-world use. Another disturbing result was that some of the scoring functions seemed to be picking the right compounds for the wrong reasons – that is, based on incorrect binding modes.
Now we're ready for the third question, a hard one which (in my experience) is one of the ones that medicinal chemists most would like molecular modeling software to answer: given a list of compounds, can the program rank-order them according to their expected affinity for the target? Unfortunately, the answer is "absolutely not." No scoring function in any of the software packages could even come close. The compounds that the programs ranked as winners were just as likely to stink, and the ones that they put into the discard heap were just as likely to be fine.
My way of looking at the first two tests is to say that if you have just one molecular modeling package, it is guaranteed to mislead you a fair amount of the time. And you have no way of knowing when it's doing that. If you have more than one program to work with, though, then they are guaranteed to disagree with each other a fair amount of the time, and you have no way of knowing which one of them is right – if either. I'll let the authors have last word on the third test, and on the software in general:
". . .in the area of rank-ordering or affinity prediction, reliance on a scoring function alone will not provide broadly reliable or useful information. . .This study demonstrates unequivocally that significant improvements are needed before compound scoring by docking algorithms will routinely have a consistent and major impact on lead optimization. . .it is not completely obvious by what means these improvements will arise. . ."