« Imperfect Pitch |
| Funding People, Not Projects? »
May 17, 2011
Quis Custodiet Ipso Custodes?
Yesterday's look into the Google Ngram data set brought up a discussion in the comments on how good the numbers are in it (and in other large datasets). "Garbage in, garbage out" is as true a statement as ever, so it's a real worry. (Even if the data were perfect, the numbers could still be misused and misinterpreted, of course).
An e-mail from a reader pointed me to another example of this sort of thing. The NIH Chemical Genomics Center (NCGC) has a collection of known pharmaceutically active compounds for use in screening and target ID. This is a good idea, and the same sort of thing is done internally in the drug industry. But the ChemConnector blog has some questions about how robust the dataset is. The rough estimate is that between 5 and 10% of the 7600+ structures are messed up in some way (stereochemistry, salt form, the dreaded pentavalent carbon, and so on).
Read the comments there for some interesting back-and-forthing with the NIH people. The NCGC folks realize that they have some problems, and are willing to put in the work to help clean things up. The problem is, they'd already published on this list, calling it "definitive, complete, and nonredundant", which now seems to be a bit premature. . .
+ TrackBacks (0) | Category: Chemical News | The Scientific Literature
POST A COMMENT
- RELATED ENTRIES
- Merck's Aftermath
- Models and Reality
- Rewriting History at the Smithsonian?
- The FDA: Too Loose, Or Appropriately Brave?
- More Magic Methyls, Please
- Totaling Up a Job Search
- Humble Enzyme Dodges Spotlight
- Unraveling An Off-Rate