Corante

About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: derekb.lowe@gmail.com Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Emolecules
ChemSpider
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
PubChem
Not Voodoo
DailyMed
Druglib
Clinicaltrials.gov

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
Kilomentor
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
ChemBark
Realizations in Biostatistics
Chemjobber
Pharmalot
ChemSpider Blog
Pharmagossip
Med-Chemist
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
SimBioSys
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Business|Bytes|Genes|Molecules
Eye on FDA
Chemical Forums
Depth-First
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa


Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
FuturePundit
Aetiology
Gene Expression (I)
Gene Expression (II)
Sciencebase
Pharyngula
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net


Medical Blogs
DB's Medical Rants
Science-Based Medicine
GruntDoc
Respectful Insolence
Diabetes Mine


Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem


Politics / Current Events
Virginia Postrel
Instapundit
Belmont Club
Mickey Kaus


Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Why Did Pfizer Have All That Gold Dust, Anyway? | Main | Free To Promote Off-Label? Not So Fast. . . »

December 11, 2012

Did Kaggle Predict Drug Candidate Activities? Or Not?

Email This Entry

Posted by Derek

I notied this piece on Slate (originally published in New Scientist) about Kaggle, a company that's working on data-prediction algorithms. Actually, it might be more accurate to say that they're asking other people to work on data-prediction algorithems, since they structure their tasks as a series of open challenges, inviting all comers to submit their best shots via whatever computational technique they think appropriate.

PA: How exactly do these competitions work?
JH: They rely on techniques like data mining and machine learning to predict future trends from current data. Companies, governments, and researchers present data sets and problems, and offer prize money for the best solutions. Anyone can enter: We have nearly 64,000 registered users. We've discovered that creative-data scientists can solve problems in every field better than experts in those fields can.

PA: These competitions deal with very specialized subjects. Do experts enter?
JH: Oh yes. Every time a new competition comes out, the experts say: "We've built a whole industry around this. We know the answers." And after a couple of weeks, they get blown out of the water.

I have a real approach-avoidance conflict with this sort of thing. I tend to root for outsiders and underdogs, but naturally enough, when they're coming to blow up what I feel is my own field of expertise, that's a different story, right? And that's just what this looks like: the Merck Molecular Activity Challenge, which took place earlier this fall. Merck seems to have offered up a list of compounds of known activity in a given assay, and asked people to see if they could recapitulate the data through simulation.

Looking at the data that were made available, I see that there's a training set and a test set. They're furnished as a long run of molecular descriptors, but the descriptors themselves are opaque, no doubt deliberately (Merck was not interested in causing themselves any future IP problems with this exercise). The winning team was a group of machine-learning specialists from the University of Toronto and the University of Washington. If you'd like to know a bit more about how they did it, here you go. No doubt some of you will be able to make more of their description than I did.

But I would be very interested in hearing some more details on the other end of things. How did the folks at Merck feel about the results, with the doors closed and the speaker phone turned off? Was it better or worse than what they could have come up with themselves? Are they interested enough in the winning techniques that they've approached the high-ranking groups with offers to work on virtual screening techniques? Because that's what this is all about: running a (comparatively small) test set of real molecules past a target, and then switching to simulations and screening as much of small molecule chemical space as you can computationally stand. Virtual screening is always promising, always cost-attractive, and sometimes quite useful. But you never quite know when that utility is going to manifest itself, and when it's going to be another goose hunt. It's a longstanding goal of computational drug design, for good reason.

So, how good was this one? That also depends on the data set that was used, of course. All of these algorithm-hunting methods can face a crucial dependence on the training sets used, and their relations to the real data. Never was "Garbage In, Garbage Out" more appropriate. If you feed in numbers that are intrinsically too well-behaved, you can emerge with a set of rules that look rock-solid, but will take ou completely off into the weeds when faced with a more real-world situation. And if you go to the other extreme, starting with wooly multi-binding-mode SAR with a lot of outliers and singletons in it, you can end up fitting equations to noise and fantasies. That does no one any good, either.

Back last year, I talked about the types of journal article titles that make me keep on scrolling past them, and invited more. One of the comments suggested "New and Original strategies for Predictive Chemistry: Why use knowledge when fifty cross-correlated molecular descriptors and a consensus of over-fit models will tell you the same thing?". What I'd like to know is, was this the right title for this work, or not?

Comments (28) + TrackBacks (0) | Category: In Silico


COMMENTS

1. eugene on December 11, 2012 10:24 AM writes...

This technology is still in the testing phase and hasn't been successfully used by Merck. You don't have a link to the Slate article as of now, but the quotes sound like a lot of hubris for something that is untested.

It also makes me a bit uncomfortable that there was no chemical structure out there at all during the contest. I don't know why that should be, since I guess they had other descriptors that could substitute well for it... but it just does.

Permalink to Comment

2. Morten G on December 11, 2012 10:25 AM writes...

I still wonder how Kaggle determines the winner if the test set is released with the training set?

Also, when I read that article I went straight to the Merck challenge and noticed that it was won by machine learning experts, not amateurs. Hell, more than experts - these are the people who invent these algorithms / data treatment structures / whatever you want to call them.

PS Noticed the $3 mio challenge up now?

Permalink to Comment

3. Jack H. Pincus on December 11, 2012 10:53 AM writes...

The best use of machine learning at this time would not be virtual screening but predicting outcomes of compounds with certain properties. That may be Merck is hoping for. Machine learning has been useful for predicting consumer behavior or the outcome of elections. As you correctly point out, the outcome in this case may depend on the quality of the data and how the teams interpret it. I haven't looked at Merck's dataset but your description suggests it may be lacking details critical for successful projections.

Successful machine learning projects often include a domain expert to help analyze and interpret the results. Kaggle teams are almost exclusively data scientists which could effect the outcome of a drug candidate analysis project

Permalink to Comment

4. LeeH on December 11, 2012 10:57 AM writes...

I competed in an earlier modeling exercise, from Boehringer, and came in 31st out of 703 entrants. Rather than being an exercise in creating a linear prediction of activity, it was a categorical one (i.e. what was the probability of a particular compound being active).

How did I do it? By successive tries, with various methods and combinations of methods, optimizing what I did by using feedback from the test set (which, by the way, is only 25% of the total test instances). Some would say that I overfit, but in fact the performance by the methods on the smaller test set mirrored the performance on the final test set extremely well. I would have done somewhat better (15th or so) by choosing one of my other methods, but I'm sure that's true of everyone. And I did choose the second best methods, out of about 100 candidates.

The good thing about the exercise was that I was able to test out various methods in a blinded way. I wouldn't have put as much effort into a non-blinded situation (money and besting the "experts" is a strong motivator), but I'm glad I did it once. I did learn some interesting things about the available data mining methods which I can use in everyday life. For example, in general, Forest of Trees is king.

On the other hand, it's important to note a few things. First, the amount of effort put into this competition is way more than you could hope to spend in a real-world environment. Second, as mentioned before, you didn't know what the descriptors referred to or anything about the structural features of the compounds. And third, and most importantly, the criteria used for winning are rather ridiculous. The actual score was a log-loss calculation, which is sensitive to the log of the difference between your prediction (i.e. probability of belonging to one class or another). The difference between the winner and the 50th place finisher was 0.009, and the scores were reported to 5 decimal places. Under these conditions, you could predict the ranking extremely well, getting them all at the top of a ranked list, but if you predicted a probability of 0.5 for too many of the inactives (which should be 0.000) you would blow your overall performance. This scoring method artificially created a user ranking which was not appropriate to the exercise, that is, finding active compounds.

I did voice my concerns on the forum boards related to that particular competition. My impression is that, with a few exceptions, people didn't really get it, and seemed impressed with improvements in the 3rd and 4th decimal places that we, in the modeling community, would consider trivial.

What would be interesting for everyone, I think, would be to see which methods, over the entirety of the submitted results, performed best. Perhaps the folks at Boehringger are writing up a paper.

Permalink to Comment

5. Random observer on December 11, 2012 11:18 AM writes...

#2: According to Kaggle data description test dataset as released had activity information removed.

Still, an important caveat about their approach for determining the "winner" is that the difference between 1st place score of 0.494 and 2nd place score of 0.488 could very well be due to chance and not necessarily indicative of how performance of these models would compare on a truly new dataset that have not been seen before. There were 99 teams participating with their scores higher than Merck's internal result of 0.423 (to arbitrarily draw line in the sand for choosing qualified and committed competitors) that made over 2000 (two thousand!) submissions to solve this problem (pardon my quantitative upbringing and fondness for numbers). This is a lot of shots at the same goal (i.e. same test set). Especially if the competitors know results of their previous submission as they work on their next one - if they do, then such process is guaranteed to overfit (i.e. make error look smaller than what it would be on a new undisclosed dataset) - albeit elaborate one, for sure - with many teams, distributed environment, web submissions, etc.

And, as another comparison of different (genomics based) models - MAQC-II (certainly better conducted as far as process goes) - tells us, there are datasets with more information in them, less and none to speak of. That and team proficiency were in that case what mostly determined the performance of the models on properly blinded test datasets - the choice of the modeling tool did not matter so much.

Then consider that some of the models allow for higher interpretability and some for much less (it is usually tough to understand what drives predictions by neural networks) and the value of 0.494 versus 0.423 might be very-very questionable.

Permalink to Comment

6. ptm on December 11, 2012 11:36 AM writes...

Only training set is released, from what I understand of the rules contestants submit models and people at Kaggle evaluate them on test data.

Looking at the leaderboard the winners R2 correlation coefficient is 0.494 compared to Merc internal standard of 0.423 so an improvement of 0.071.

I have no idea how valuable such an improvement is in practice. Naively it doesn't strike me as a groundbreaking result but it's clearly not negligible either.

Permalink to Comment

7. drug_hunter on December 11, 2012 12:07 PM writes...

For all the reasons given by previous posters and Derek, the difference of 0.07 is NOT CONCLUSIVE of any actual improvement.

However, I am a fan of this kind of work. I think these kinds of computational experiments are going to be very useful to help us understand which methods work best under which circumstances, so over time I think we WILL get to the point where we can say with more confidence that an improvement of 0.07 is in fact useful.

We just aren't there yet. We need another dozen or so competitions and I bet at that point some trends will start to emerge.

Permalink to Comment

8. gwern on December 11, 2012 12:17 PM writes...

It's worth noting that one of the team members was Geoff Hinton. Yes, *that* Hinton, the deep learning neural networks Hinton whose work has been routinely racking up rewards and records over the past few years.

> Since our goal was to demonstrate the power of our models, we did no feature engineering and only minimal preprocessing. The only preprocessing we did was occasionally, for some models, to log-transform each individual input feature/covariate. Whenever possible, we prefer to learn features rather than engineer them. This preference probably gives us a disadvantage relative to other Kaggle competitors who have more practice doing effective feature engineering. In this case, however, it worked out well. We probably should have explored more feature engineering and preprocessing possibilities since they might have given us a better solution.

I could believe that.

Permalink to Comment

9. MTK on December 11, 2012 12:20 PM writes...

"Uh, I was told there would be no math."

Permalink to Comment

10. Guy Cavet on December 11, 2012 1:43 PM writes...

Disclosure: I work for Kaggle.

Thanks for the nice post and discussion.

Regarding overfitting, there are two test sets: one used to give feedback on submissions and generate the leaderboard during the competition, and another used to determine the final standings. No information from the second test set is available to the participants during the competition. So, if the participants overfit the first test set, they will not win. And conversely, the winners will not have overfit the test set. There's a nice blog post about this at http://blog.kaggle.com/2012/07/06/the-dangers-of-overfitting-psychopathy-post-mortem/

Permalink to Comment

11. MoMo on December 11, 2012 2:05 PM writes...

Sounds like another scheme by Big Pharma-Merck to get high quality work for free or next to nothing.

All of you who think you are doing something scientific are just being taken advantage of.

Hire some computational scientists MERCK!

You've been Kaggled!

Permalink to Comment

12. Anthony Nicholls on December 11, 2012 2:18 PM writes...

As has been pointed out here the improvements over basic methods is small- small enough that such criteria such as distribution drift, noise in the data, change in performance metric, assignment uncertainty or uneven sampling of the original distribution can easily swamp any differences. See David Hand's wonderful paper for a better description:
http://arxiv.org/pdf/math/0606441.pdf

What is really shocking about the Merck event, in particular, is how poor the statistics are- the performance metric is over 15 systems, so essentially the resultant average- reported to 5 decimal places- is over 15 numbers. Not reported is the variance over those 15 systems, but even a reasonable estimate of the variability of performance over 15 systems would suggest the winning entry is statistically indistinguishable from the entry from Merck scientists. We don't know though because all we get is the average R**2.

If you actually plot all scores on the entries on the "leaderboard", they form a pretty good gaussian distribution (centered around the Merck entry), again leading me to think it's pretty much a case of random variance of methods.

I've heard Merck is very excited about the 0.07 increase in R**2 over their methods- 0.07 which equates, back of the envelope to an increase in predictive accuracy of 0.05 of an activity unit over their own method. Good luck with that, Merck.

The great golfer Lee Trevino used to say, "Driving is for show and putting is for dough". These competitions are for show- even though Merck was happy to cough up real dough! This is the next great hype upper management in pharma is going to fall for hook, line and sinker.

Permalink to Comment

13. LeeH on December 11, 2012 2:58 PM writes...

Anthony

Thanks for elaborating on the point I was trying to make. I would contend that things are EVEN WORSE, especially in this case.

Let's assume that the models are completely real (yes, I know, they're not, but let's pretend). The problem is that they are linear models in log space. Have you ever noticed the 95% confidence limit prediction (not model) estimates on a linear model with an R2 in the neighborhood of 0.4 or 0.5? They're huge. They easily span 2 or 3 orders of magnitude (in non-log space), maybe more, at the extremes. That means that your predictions for the compounds that supposedly live at the high end of the activity curve (i.e. where you want to be) are really not much better (if at all) than the range of activities of all of the compounds that you started with. Some model.

This why I never even attempt linear models.

Permalink to Comment

14. Teddy Z on December 11, 2012 3:39 PM writes...

Back in a former life, I worked at a company that had a "blackbox" model for figuring out your SAR and predicting what the next set of compounds should look like. After a few sets of black box based compound picking, the compounds where looking very much like antibiotics (but interestingly of several different classes), but this was a PPI. Well, it turns out young biologist didn't understand his assay well, it used a luciferase reporter readout and the compounds were simply shutting down protein synthesis, not indicative at all of inhibition of the PPI.

Permalink to Comment

15. Q tsar on December 11, 2012 7:13 PM writes...

A stopped clock tells the right time twice a day. If you have 238 stopped clocks, one of them is likely to always be close to the right time.

Permalink to Comment

16. chris on December 12, 2012 2:47 AM writes...

I've worked with academic groups who are developing novel computation methods, things like this do give them the chance to try them out on a range of different problems.

I don't think the results are unexpected and probably reflect the current state of virtual screening, and whilst it would have been interesting if one group had been substantially better than the basic method, if you have to select one prize winner then this is the result.

Permalink to Comment

17. George on December 12, 2012 4:19 AM writes...

Hi Derek,
I'm the leader of the winning team in the competition. I have recently discovered your wonderful blog and I am ecstatic that my team's work has been mentioned on it!

I hope what we did will be useful to Merck and I think it probably will be, but of course these sorts of things need to be evaluated very carefully. I don't have the drug discovery expertise to know how it will play out. If people in these pharma companies provide lots of data to train models on and a metric they are interested in improving, we machine learning researchers can only try to improve the models according to the metric and hope that we are optimizing something useful and that there is enough data.

Permalink to Comment

18. drug_hunter on December 12, 2012 8:19 AM writes...

Regarding George's post, and the general question of whether machine learning in the absence of any knowledge of chemistry can be useful at all, I recommend everyone take a look at:

https://xkcd.com/793/

Permalink to Comment

19. Neo on December 12, 2012 11:00 AM writes...

Derek said: "Virtual screening is always promising, always cost-attractive, and sometimes quite useful. But you never quite know when that utility is going to manifest itself, and when it's going to be another goose hunt. It's a longstanding goal of computational drug design, for good reason."

The problem is this area is that there are lot of software vendors who oversell their virtual screening methods using flawed retrospective "validations". So it is hard for end-users to distinguish between things that actually work and things that just look good in paper. Always look for prospective applications of the technique published in peer-reviewed journals. And correct by the number of scientists using that technique.

Permalink to Comment

20. Neo on December 12, 2012 11:17 AM writes...

Also, regarding your comment that you never know when virtual screening it's going to be another goose hunt.

Is it not the same with HTS when used against new targets? After all, you are only screening about 10^5 molecules out of an estimated 10^60 possible drug-like molecules. You cannot find hits where there are not hits to find. This is a very real possibility (e.g. antibacterial HTS). It is on challenging targets where virtual screening can be very useful.

Permalink to Comment

21. George on December 12, 2012 3:52 PM writes...

drug_hunter, there is knowledge of chemistry in this process, I am just not the one who has it. Chemistry knowledge is not something that is in short supply at Merck. It is too much to ask that the people with specialized chemistry knowledge also have specialized machine learning knowledge which is why it makes sense for chemists and machine learning researchers to collaborate.

Permalink to Comment

22. ChrisL on December 12, 2012 5:50 PM writes...

The Merck experiment as described completely prevents any conclusions as to the relative abilities of experts compared to outsiders. The reason is that chemical structures were not shown. Reducing a chemical structure to a bit string of descriptors completely eliminates the power of chemical structure to biomedical information pattern recognition which I would maintain is the forte of medicinal chemists. Imagine reducing a histo-pathology slide to the equivalent of a bit string or an old world painting to a bit string. You would eliminate the pattern recognition skill of the pathologist in reading slides or the pattern recognition skill of the art expert in detection of a bona fide versus fake picture. I think the old maxim still holds - the computational expert system typically only does about 85% as well as the human expert. The book "blink" by Malcolm Gladwell has an extensive and excellent discussion of the power of pattern recognition among human experts.

Permalink to Comment

23. gogoosh on December 13, 2012 8:56 AM writes...

@22:
The power of these pattern recognition algorithms isn't that they outperform human experts, it's that they are automated. Experiments like scanning a huge virtual library cannot be done with human expertise alone.
Some of the comments in this thread make me think that the scientists writing them have never been involved in a productive collaboration between a medicinal chemist and a computational chemist.
In my opinion, any modeler worth her salt will acknowledge that most models are a poor substitute for human expertise, and any med chemist worth her salt will acknowledge that automated, objective methods of assessing the vastness that is chemical space are useful tools.

Permalink to Comment

24. Chris on December 17, 2012 12:32 AM writes...

@12 Anothony Nicholls.

I'm unclear how the scores looking Gaussian implies the methods were essentially random. If scores from students taking a test followed a Gaussian curve I'd think it was mostly because the students had different abilities, even if there were some randomness involved (i.e. test questions, test topic, etc..)

disclaimer--I was on the winning kaggle team.

Permalink to Comment

25. Chris on December 17, 2012 12:35 AM writes...

Also, awesome post. It's a great question, and I hope we'll soon learn the answer.

Permalink to Comment

26. Anthony Nicholls on December 19, 2012 9:04 AM writes...

@24 Chris

In answer to your question as to why I think the distribution of scores looking Gaussian is an indication that you were lucky, not good, I would suggest a simple statistical argument- the metric for success was an average of 15 numbers- a very small sample that inevitably will have a significant variance. Different methods will produce a different set of 15 numbers- essentially at 'random'. At question here is whether such a variance between methods could explain the distribution seen in the Kaggle event. By eye it would look like the std of contributions is about 0.02 (R**2), i.e. your entry was about 3 standard deviations out (1 in a 100). Anyone in our field would be very comfortable with the concept any two different methods giving (averaged) results different by 0.02 over a sample size of 15. Hence, it is a reasonable assumption you were lucky.

Of course, the organizers could have partially addressed this by providing all 15 scores, not just the average, because then we could have looked at the correlated improvement of your method- i.e. was your approach consistently better across all systems-that would have been interesting. But this was not provided, confirming my suspicions that a lot of people who do machine learning actually have a fairly poor grasp of statistics.

Permalink to Comment

27. Vladimir Chupakhin on January 4, 2013 7:39 PM writes...

The true validation of the approach will be: take absolutely NEW dataset with the same data nature/distribution, randomly split it up to 20-30 times to the training and test set, build mode and test them. The actual improvement of 0.07 is not really a deal, but if the approach behave in the same manner with new datasets - it's a brilliant.

Permalink to Comment

28. Sergio on March 11, 2013 12:58 AM writes...

In response to Anthony Nicholls's argument:

If the variation between teams were due to the addition of random effects on the 15 problems, you would expect that, given a different set of problems, the order of the teams would be scrambled.

This doesn't happen. Kaggle uses a separate test set to produce the final leaderboard, which is different from the one that is used by contestants throughout the competition. Nonetheless, the leaderboards before and after the deadline tend to look quite similar.

Permalink to Comment

POST A COMMENT




Remember Me?



EMAIL THIS ENTRY TO A FRIEND

Email this entry to:

Your email address:

Message (optional):




RELATED ENTRIES
XKCD on Protein Folding
The 2014 Chemistry Nobel: Beating the Diffraction Limit
German Pharma, Or What's Left of It
Sunesis Fails with Vosaroxin
A New Way to Estimate a Compound's Chances?
Meinwald Honored
Molecular Biology Turns Into Chemistry
Speaking at Northeastern