Corante

About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: derekb.lowe@gmail.com Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Emolecules
ChemSpider
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
PubChem
Not Voodoo
DailyMed
Druglib
Clinicaltrials.gov

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
Kilomentor
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
ChemBark
Realizations in Biostatistics
Chemjobber
Pharmalot
ChemSpider Blog
Pharmagossip
Med-Chemist
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
SimBioSys
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Business|Bytes|Genes|Molecules
Eye on FDA
Chemical Forums
Depth-First
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa


Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
FuturePundit
Aetiology
Gene Expression (I)
Gene Expression (II)
Sciencebase
Pharyngula
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net


Medical Blogs
DB's Medical Rants
Science-Based Medicine
GruntDoc
Respectful Insolence
Diabetes Mine


Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem


Politics / Current Events
Virginia Postrel
Instapundit
Belmont Club
Mickey Kaus


Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Once More Into the Patent Breech | Main | How Not to Do It: Distilling HMPA »

April 5, 2006

Which World Do We Live In, Anyway?

Email This Entry

Posted by Derek

Keeping up with the literature? I'm clearly not, although I regularly (or semi-regularly) read nearly a dozen journals. There's too much stuff coming out, and has been for a long time. Even harder than just "keeping up" is integrating what you read into some sort of coherent whole. I'm pretty sure that only the greatest scientists have ever done a really good job on that front.

These thoughts are prompted by a recent paper in PNAS by some statisticians at Columbia and Yale. They reference the GeneWays project, an attempt at text-mining the biomedical literature for relevant information. To me, what the project looks like is an automated version of what I've done with papers at various points in my career - sit down and rewrite them into a condensed version that gets across their key points. Doing that forces you to read and comprehend the whole thing, in detail, and explain it to yourself in terms that you understand. It's labor-intensive, but worthwhile when you really have to absorb something you're unfamiliar with.

Here's a look at some of the information typically extracted out of a paper. As you can see, they're going for the basics - what interacts with what. This may seem like trivializing the papers involved, but it's a good first step, as any look into the molecular and cell biology literature will confirm. If you'd like to experience this for yourself, try running the term "NF-{kappa}B" through a search engine like PubMed, pretending as if you had rashly agreed to do a review article on its various biological activities. There are, as of this evening, nine hundred and thirteen pages of search results. Next week there will be more.

The PNAS article uses the GeneWays data set to look at how scientific information is dealt with over time. They searched out statements about the same pair of substances - things like "Kinase W phosphorylates (or does not phosphorylate) Protein X". Then they tracked how these cascaded through the literature. There are, of course, plenty of opportunities for honestly conflicting statements about the same actors, given the variables involved.

But there are probably even more reasons for such assertions to reinforce each other, as the authors point out. The statements could indeed be true, and relatively easy to verify, through experiments with a low error rate. Or some of the later statements could be unverified restatements of the earlier ones: "As is well known. . .". There's also the way that some substances have of interacting with so many things that statements about their involvement with some other molecule have better than even odds of being true just by chance - some kinases come to mind.

So they used a probabalistic model, with parameters for each step along the way. These include possibilities like discarding negative data, performing (or not performing) an independent experiment before publishing a statement, the probabilities of getting false positives or false negatives in such experiments, the probabilities that positive (or negative) data are actually published and the probabilities that other scientists actually read them before publishing their own work, etc.

They come up with a whole range of hypothetical chains through the literature. Turning the dials on the model in different ways can give you strongly conforming virtual scientists, who believe everything they read, strongly argumentative ones who are fond of reversing earlier data, scientists who go with the flow until someone brave publishes a contrary instance and then switch to follow that, and many others. Then there's the style they call "mild scepticism":

"In this hypothetical world, scientists do read their peer's articles and try to compare their own results to the published ones but tend to trust their own results more than the data published by their peers. Patterns that resemble the mild skepticism were prevalent in our real-world data set, but analysis revealed the presence of all five hypothetical patterns."

Putting some numbers on that, it appears from the real publication data that scientists tend to weight their own personal results about ten times more than those that they read in the literature. (I'd love to see this broken down by author, I can tell you). They also found, as I'd expect, that postive statements make up more than 95% of the whole data set. (The authors seem a bit baffled by this - negative results are famously difficult to publish, guys). They also found very high correlations within individual chains of statements - reversals are rare indeed.

What they found completely unnerving was that, given their assumptions, the real-world data are explained equally well by two possible research universes: one where false-positive and false-negative rates are low, and there's a huge perponderance of positive statements among the set of true ones. The other one is a world with very high error rates, in which a given positive statement is much more likely to be false than it is to be true:

"Another major question also remains open: In which of the two alternative universes discovered in our analysis are we living? Our results indicate that the optimistic and pessimistic realities are almost equally likely given currently available data."

Comments (13) + TrackBacks (0) | Category: The Scientific Literature


COMMENTS

1. Carlos on April 5, 2006 7:47 PM writes...

"Next week there will be more."

Oh it's much worse than that. I few years ago I had to perform a thorough, but never-to-be published literature review of TNF-alpha for a biotech client. My search terms were:

TNF-alpha
tumor necrosis factor alpha
tumour necrosis factor alpha...

My point is that with synonyms and spelling variants and abbreviations, the number of articles is actually much higher than you might think, and thus growing much faster than you want!

I think it's impossible to keep up. At best, we can take an occasional "deep dive" as a situation warrants.

Permalink to Comment

2. Theodore Price on April 5, 2006 10:06 PM writes...

I'm not on my work computer, so I'm not going to be able to tell you the databases, but, in a related vein, about a month ago i spent quite a bit of time utilizing one of these "literature interaction" search engines trying to find candidate binding partners for a protein of interest. I stumbled upon it quite by accident while trying to find a yeast 2 hybrid (y2h) database where real (at least in yeast) interacting partners are deposited from high throughput screens. While I found quite a few "lit interactions" for my protein of interest, of course it wasn't in the y2h database at all. Hence, i switched to some other more common proteins I'm interested in and the "lit interaction" database was actually a fairly good predictor for the y2h database. Stunned, I went through some of the papers and none of them really had any direct interaction information, just upstream/downstream cascade type stuff. The y2h database is still a bit sparse, but as the interactome projects really start to pick up steam I'd be very interested to see if someone does a large scale analysis to see if I just got lucky or if there is really something to it.

Permalink to Comment

3. RKN on April 6, 2006 7:42 AM writes...

There's actually multiple efforts underway trying to accomplish this. Several of them have already been implemented in pathway analysis software, both freeware and commerical packages. I've been working with a number of them lately.

What I find an exciting possibility is not just discovering what "interacts" with what, but once most of the interactions for an organism are known (the interactome), being able to go the next step and understand the underlying kinectics and dynamics of the interactions. It would be great, for instance, to be able to run "what-if" scenarios with the software and predict what changes might cccur in the interactome if I up/down regulated this or that protein.

Permalink to Comment

4. Robin Goodfellow on April 6, 2006 9:57 AM writes...

Is this a true dichotomy? Can't we also live in a universe where some arbitrary mixture of options 1 & 2 holds?

Permalink to Comment

5. Derek Lowe on April 6, 2006 10:37 AM writes...

Robin, the authors seem to have been just as puzzled. They say that under their assumptions (all of which seem reasonably realistic to me) that the intermediate universes don't fit the data at all, but that the two extremes do an equally good job.

When they broke down the statements into the fuzzier ones (using verbs like "activate" or "regulate") versus the physical ones (with verbs like "methylate" or "phosphorylate"), the pessimist's universe had a high posterior probability for the physical statements, but the two universes were much more even in the fuzzier set.

Permalink to Comment

6. Paul on April 6, 2006 10:57 AM writes...

"The authors seem a bit baffled by this - negative results are famously difficult to publish, guys."

I still do not see why a negative result is so hard to publish. One would think, at least I do in my universe, that a negative result is good information. Is it that a "failed" experiment is viewed by the researcher as a failure?

Think about it. What is the success rate of organic chemists? How many "failed" reactions are run for every novel product? Lets not forget all those "rationally designed systems" that never made it into the literature because they did not do what they were designed to do. I have always thought there should be an outlet for this information.

Permalink to Comment

7. SRC on April 6, 2006 11:47 AM writes...

At the risk of starting another row (!), I'll plump for the pessimistic model for the biological literature.

I reviewed quite a bit (>200 papers/y, roughly one a day) while in in academia, and find relatively few papers in the bio literature that would pass muster in chemistry.

JACS explicitly asks reviewers whether the data support the conclusions, and rejects out of hand papers for which that is not the case. The bio journals seem to apply an implicit standard that the data not contradicting the conclusions is good enough. Many times the data don't really speak to the validity of the conclusions (typically for an experiment where one outcome is dispositive, but the other ambiguous, e.g., the fishing for the Loch Ness monster scenario). The conclusions in such cases are one of the possible explanations for the data, but others remain unexcluded.

That sort of ambiguity gets a paper bounced from the chemistry literature, but apparently not from the biology journals, which owing to the complexity of the systems make room for that sort of thing. To put it another way, chemistry journals apply a reasonable doubt standard, biology ones a "more likely than not" standard.

Permalink to Comment

8. Jim Hu on April 6, 2006 2:56 PM writes...

RKN,

There's a discussion of the different types of experiments over at Alex Palazzo's blog.

I'm not sure I buy your argument about standards of proof. This is journal with chem in the name, for example.

I'd like to see data that supports your conclusions about chem vs. biology journals beyond a reasonable doubt before accepting that conclusion!

Permalink to Comment

9. JSinger on April 6, 2006 4:50 PM writes...

I still do not see why a negative result is so hard to publish. One would think, at least I do in my universe, that a negative result is good information. Is it that a "failed" experiment is viewed by the researcher as a failure?

No, it's a question of Least Publishable Unit. Informative isn't necessarily publishable, because the universe of things that might possibly be true is so enormous that the publication system doesn't have room for 99.9% of the things that aren't.

While the new online-only journals have published more high-profile work than I had expected, one of the biggest wins with them (BioMedCentral, especially) is the amount of previously unpublishable work that makes it into press now. Several of those have saved me a huge amount of time.

Permalink to Comment

10. RKN on April 6, 2006 6:36 PM writes...

I'm not sure I buy your argument about standards of proof.

Me? I didn't make any such argument.

Permalink to Comment

11. APalazzo on April 6, 2006 8:09 PM writes...

"Another major question also remains open: In which of the two alternative universes discovered in our analysis are we living? Our results indicate that the optimistic and pessimistic realities are almost equally likely given currently available data."

We live in world number 2.

And yes negative data is almost IMPOSIBLE to publish in Biology. Why? Well untill 2-3 years ago it was thought that neurogenesis did not occur. One positive result wiped all those negative results away. Why? because before that one positive result, everyone else was looking at the problem in the wrong way.

Look it's hard to publish negative results in biology and it's a fact that we all have to live by. And it's not going to change anytime soon.

Permalink to Comment

12. Jim Hu on April 7, 2006 12:01 AM writes...

Oops...wrong 3 letter regular. SRC, not RKN. My apologies.

Permalink to Comment

13. bioinfoguy on April 8, 2006 10:00 PM writes...

take the study with a grain of salt. Geneways is a very highly distilled feature extraction of the literature. It is meant to extract positive statements about interaction, which should not necessarily be confused with positive *results* (if A interacts with B and you didn't think it did, well...not so positive for your theory).

The conclusions in the article apply not to the literature at large, but to the Geneways dataset, and are thus much less compelling. Had they made an attempt to mine negative statements from the literature itself it would be more convincing.

Permalink to Comment

POST A COMMENT




Remember Me?



EMAIL THIS ENTRY TO A FRIEND

Email this entry to:

Your email address:

Message (optional):




RELATED ENTRIES
A Last Summer Day Off
The Early FDA
Drug Repurposing
The Smallest Drugs
Life Is Too Short For Some Journal Feeds
A New Look at Phenotypic Screening
Small Molecules - Really, Really Small
InterMune Bought