About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
Not Voodoo

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
Realizations in Biostatistics
ChemSpider Blog
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Eye on FDA
Chemical Forums
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa

Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
Gene Expression (I)
Gene Expression (II)
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net

Medical Blogs
DB's Medical Rants
Science-Based Medicine
Respectful Insolence
Diabetes Mine

Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem

Politics / Current Events
Virginia Postrel
Belmont Club
Mickey Kaus

Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Imperfect Pitch | Main | Funding People, Not Projects? »

May 17, 2011

Quis Custodiet Ipso Custodes?

Email This Entry

Posted by Derek

Yesterday's look into the Google Ngram data set brought up a discussion in the comments on how good the numbers are in it (and in other large datasets). "Garbage in, garbage out" is as true a statement as ever, so it's a real worry. (Even if the data were perfect, the numbers could still be misused and misinterpreted, of course).

An e-mail from a reader pointed me to another example of this sort of thing. The NIH Chemical Genomics Center (NCGC) has a collection of known pharmaceutically active compounds for use in screening and target ID. This is a good idea, and the same sort of thing is done internally in the drug industry. But the ChemConnector blog has some questions about how robust the dataset is. The rough estimate is that between 5 and 10% of the 7600+ structures are messed up in some way (stereochemistry, salt form, the dreaded pentavalent carbon, and so on).

Read the comments there for some interesting back-and-forthing with the NIH people. The NCGC folks realize that they have some problems, and are willing to put in the work to help clean things up. The problem is, they'd already published on this list, calling it "definitive, complete, and nonredundant", which now seems to be a bit premature. . .

Comments (11) + TrackBacks (0) | Category: Chemical News | The Scientific Literature


1. Martin on May 17, 2011 5:50 PM writes...

And it's not just NCGC. I've lost count of the number of times we've used Pubchem's SMILES export to generate structures for modelling only to find that all the amide bonds in a molecule are "Smiled" into their enol forms. Looks really bad in a cyclic pentapeptide..

Permalink to Comment

2. Noel Southall on May 17, 2011 5:57 PM writes...

The goal was to build a definitive, complete and nonredundant screening resource for the community and we say right in the paper that we will never been completely 'done'. This is a work in progress, and we are actively soliciting community feedback via the site - we've actually had a fantastic response on individual records so far from the community - and Tony in particular has graciously shared the results of his own work to improve the resource. This is what we were hoping for, as it gets us beyond what we could have possibly hoped to accomplish on our own. We point to Tony himself as an example in helping curate the db and correct the errors that were found rather than simply complaining about them. "Better to light a lamp than to curse the darkness" as the old saying goes.

Our endgame is really to provide a public screening resource. And as embarrassing as it might be for us in some of these cases, doing it in the open and publicly allows everyone else to reap the benefits.

Permalink to Comment

3. Antony Williams, ChemConnector on May 17, 2011 8:48 PM writes...

Derek...I have stopped reviewing the dataset at this point as it would consume weeks of work to validate/curate. I have gathered data on particular classes of compounds and for certain classes am finding errors such as 1) Complete stereochemistry, but incorrect 2) No stereochemistry at all, but multiple sites are expected to be explicitly defined and 3) the usual situation of incomplete stereochemistry. For some of the classes I have examined I have observed over 70% "errors". But it goes much deeper than this.

Antibiotics represented as simple aliphatic chains with terminal carboxylic acid groups is quite surprising!

It appears that many of the heuristics referenced in the paper as being applied were not actually used because if they were I would not see many of the issues I have referenced on in other posts on the ChemConnector Blog. For example:

I appreciate Noel's comments that I have been helping out with reporting some of the issues. The reality is it would take me a long time to manually wade through the data set and report all of the errors through the NPC Browser one at a time. And, not meaning to sound selfish but if I did it where would the data be held regarding the contribution made to clean the data? We need to develop systems where peoples contributions to improve/curate/validate data are tracked and recognized. This should all contribute to an AltMetrics representation of a scientists contribution:

In relation to this I'm on my way to an ORCID meeting (and stuck in an airport) Acceptance of ORCID and using such an identifier to track contributions to data clean-up such as that in the NCGC dataset might encourage scientists to offer their contributions more willingly. We already track contributions on ChemSpider and expose them in the curation panel but are more than willing to jump on with ORCID support when the time is right.

Permalink to Comment

4. Derek Lowe on May 17, 2011 9:02 PM writes...

Thanks for coming by to comment! It really is a worthy goal, and it's a lot harder than it looks. I'm glad that you're getting so much feedback, because that's what it's going to take to whip something like this into shape.

But I think what struck me was the line from the Science paper's abstract: "We report here the creation of a definitive, complete, and nonredundant list of all approved molecular entities as a freely available electronic resource and a physical collection of small molecules amenable to high-throughput screening." That part, at any rate, made it sound as if the goal had already been achieved.

Permalink to Comment

5. gippgig on May 17, 2011 10:09 PM writes...

This is a universal principle - databases (space launches, nucleotide sequences, subway systems, you name it) are always riddled with errors. Despite the fact that I was aware of this and took countermeasures, errors still managed to creep into my yeast gene maps back in the early 1990s. ALWAYS go back and check the original source.

Permalink to Comment

6. Anonymous on May 17, 2011 10:46 PM writes...

It takes lots of patience and rigor to do a good job like that.

Think of how much trash in the pdb database.

I also looked at some supporting data files in literature occasionally, but it's routine to see issues when the dataset is big.

It's probably why most papers do not have supporting molecular files (I mean 3D). Otheriwse, it's easy to be trashed :-)

Permalink to Comment

7. Sean Ekins on May 18, 2011 8:14 AM writes...

In some ways I feel personally responsible. Innocently I passed Antony the subset of NCGC structures a couple of weeks back. We had worked on finding drug structures for a couple of FDA datasets / databases (read Excel files) on drugs repurposed for rare diseases so this was really a very interesting resource. Antony did a quick analysis and found errors. He blogged on it then I suggested on my own blog the need for a gold standard database. Following this the flood gates opened. We were criticized and told basically that we should be contributing rather than complaining and that no database is perfect. I have a few issues with this. NCGC have not posted a warning on their site that the structures are
"problematic" despite the many blogs from Antony as he digs deeper in the even more error prone "highly curated" subset. I am a tax payer and I expect my dollars to be used wisely. Building a "definitive" database and screening resource could be valuable. Yet there are errors and therefore as it stands the database is potentially useless. I will not accept second best and neither should Chris Austin or for that matter Francis Collins.

I feel strongly enough about this. Both Antony and I have contributed by raising the issue and sticking our necks out. It is not our day job to police NIH funded databases. Yes communities can help point out errors but when a database is so fundamentally flawed it needs attention immediately. The bigger problem as we pointed out is that these databases get reused and any errors proliferate and pollute our scientific environment. Our government has many agencies to regulate products like drugs being manufactured but they cannot even put a database of correct structures together for the drugs they have approved. It is beyond my comprehension. They should pull it now. Fix it and then re- release.

Permalink to Comment

8. JeffHarris,untangledhealth on May 18, 2011 10:26 PM writes...

So while the wizard’s refer to their questionable reference datasets with new tools that expedite new products for the healthcare industry; we sit on the other side attempting to isolate ‘therapeutic effect’. Same problem with our databases and lack of standards for communicating outcomes derived by the very people funded to implement pilot projects in population medicine. For example: healthcare claims which carry wonderful information on diagnosis and procedure are intrinsically unreliable; vendors who state they adhere to data communication standards such as HL7 often modify message segments for their customers who have a special program they are implementing through a grant emerging from one of many funders: NIH, BPHC, SAMHSA, CMS… and need to figure out a quick work around to generate their reports for the next round of funding. The next time we folks involved in the analysis of merged HL7 data find groups of patients with a median blood pressure of 6.8. Turns out the HL7 record was changed by the vendor to carry a quality of life score that was not annotated.
Ah the tangled web we weave. Now the Feds are implementing Accountable Care Organizations that will be charged with ranking physician and system-wide performance. I sure hope the junior analysts performing the data cleaning have some sense of process for filtering out the junk. I hope even more that physicians and patients capture data in uniform structures or the distribution of reward for improvement will be meaningless.
Someday we might have traceability between carbon compound, lifestyle, geographic position, genetic predisposition, demographics and treatment facility. Well, it’s a nice dream anyway that is shared by many of us such as # and those of us who want to make a difference at all levels experienced by our patients including: Disease, Impairment, Disability and Handicap.
Jeff Harris

Permalink to Comment

9. cliffintokyo on May 19, 2011 1:28 AM writes...

Fry your hard drive - pass the custard - take out the garbage - go on vacation (permanently) to NIH.....ah! life in the fat lane.

Permalink to Comment

10. newnickname on May 20, 2011 8:59 AM writes...

An historical note, of sorts, from the earlier days (1980's) of compound and reaction databases (pre-PubChem). The computational chem grad student needed to re-encode several databases (SynLib, Reaccs, InfoChem, ...) into a large set in order to evaluate other parameters. Since not everyone was willing to share their file formats, that led to some skillful code cracking so he could re-encode into a single format. He discovered that a HUGE number of structures were corrupted of their intended chemical meaning and that resulted in spurious reaction "hits" (in the original programs, too). Among the errors: disconnected bonds (-C C- instead of -C-C-); atom labels covering / floating over a node but not replacing it (-C Br instead of -Br); etc.

I think he eliminated over 30% of a one million member compound library that way. I think the end results were published but with only brief commentary on the source dbases. ("We extracted 600,000 usable structures from the 1,000,000 member X database...") The companies were notified of the problems.

Rest assured that bibliographic info was also corrupted.

Permalink to Comment

11. Sean Ekins on May 23, 2011 9:00 AM writes...

Derek, it appears NCGC just added a disclaimer to the dataset download page (I just blogged on it at While they do not address the compound quality issue they focus on the browser adding curation mechanisms in future. This is not a fix of the errors nor an explicit acceptance of such. Still a big question as to why it was released in such a shape.

Permalink to Comment


Remember Me?


Email this entry to:

Your email address:

Message (optional):

The Last Post
The GSK Layoffs Continue, By Proxy
The Move is Nigh
Another Alzheimer's IPO
Cutbacks at C&E News
Sanofi Pays to Get Back Into Oncology
An Irresponsible Statement About Curing Cancer
Oliver Sacks on Turning Back to Chemistry