Corante

About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: derekb.lowe@gmail.com Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Emolecules
ChemSpider
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
PubChem
Not Voodoo
DailyMed
Druglib
Clinicaltrials.gov

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
Kilomentor
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
ChemBark
Realizations in Biostatistics
Chemjobber
Pharmalot
ChemSpider Blog
Pharmagossip
Med-Chemist
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
SimBioSys
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Business|Bytes|Genes|Molecules
Eye on FDA
Chemical Forums
Depth-First
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa


Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
FuturePundit
Aetiology
Gene Expression (I)
Gene Expression (II)
Sciencebase
Pharyngula
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net


Medical Blogs
DB's Medical Rants
Science-Based Medicine
GruntDoc
Respectful Insolence
Diabetes Mine


Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem


Politics / Current Events
Virginia Postrel
Instapundit
Belmont Club
Mickey Kaus


Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Climategate and Scientific Conduct | Main | Copyright 1671: I Like the Sound of That »

December 2, 2009

Data, Raw and Otherwise

Email This Entry

Posted by Derek

Perhaps I should talk a bit about this phrase "raw data" that I and others have been throwing around. For people who don't do research for a living, it may be useful to see just what's meant by that term.

As an example, I'll use some work that I was doing a few years ago. I had an reaction that was being run under a variety of conditions (about a dozen different ways, actually), but in each case was expected to either do nothing or produce the same product molecule. (This was, as you can see, a screen to see which conditions did the best job at getting the reaction to work). I set this up in a series of vials, taking care to run everything at the same concentration and to start all the reactions off at as close to the same time as I could manage.

After a set period, the reaction vials were all analyzed by LC/MS, a common (and extremely useful) analytical technique. I'd already given the folks running that machine a sample of the known product I was looking for, and they'd worked up conditions that reproducibly detected it with high sensitivity. They ran all my samples through the machine, and each one gave a response at the other end.

And those numbers were my raw data - but it's useful to think about what they represented. The machine was set to monitor a particular combination of ions, which would be produced by my desired product. As the sample was pumped through a purification column, the material coming out the far end was continuously monitored for those specific ions, and when they showed up, the machine would count the response it detected and display this as a function of time: a flat line, then a curvy, pointed, peak which went up and then came back down as the material of interest emerged from the column and dwindled away again.

So the numbers the machine gave me were the area under the curve of that peak, and that means, technically, that we're one step away from raw numbers right there. After all, area-under-the-curve is something that's subject to the judgment of a program or a person - where, exactly, does this curve start, and where does it end? Modern analytical machines are quite good at judging this sort of thing, but it's always good to look over their little electronic shoulders to make sure that their calls look correct to you. If you want to be hard-core about it, the raw data would be the detector response for each individual reading, at whatever frame rate the machine was sampling at. That's even more raw than most people need - actually, while writing this, I had to think for a moment to picture the data at that level, because it's not something I'd usually see or worry about. For my purposes, I took the areas that were calculated, since the peak shapes looked good, and the machine's software was able to evaluate them consistently and didn't have to apply any sort of correction to them to meet its own quality standards.

So there's one set of numbers. But the person running the machine had taken the trouble (as they should have) to run a standard curve using my supplied reference compound. That is, they'd dissolved it up to a series of ever-more-dilute solutions, and run those through the machine beforehand. This, plotted as peak area versus the real concentration, gave pretty much a straight line (as it should), and the machine's software was set up to use this information to also calculate a concentration for every one of my product peaks. So the data set that I got had the standard plot, followed by the experiments themselves, with both the peak areas and the resulting calculated amounts. Since these were related by what was very nearly a straight line, I probably could have used either one. But it's important to realize the difference: by using the calculated concentrations, I could either be correcting for a defect in the machine (if its detector response really wasn't quite linear), or I could be introducing error (if the standard solutions hadn't been made up quite right) It's up to you, as a scientist, to decide which way to go. In my case, I worked up the data both ways, and found that the resulting differences were far too small to worry about. So far, so good.

But there's another layer: I had done these experiments in triplicate. There were actually thirty-six vials for the twelve different conditions, because I wanted to see how reproducible the experiments were. For my final plots, then, I used the averages of the three runs for each reaction, and plotted the error bars thus generated to show how tight or loose these values really were. That's what I meant about the area numbers versus the concentration numbers question not meaning much in this case. Not only did they agree very well, but the variations between them were far smaller than the variations between different runs of the same experiments, and thus could safely be put in the "don't worry about it" category while interpreting the data.

What I did notice while doing this, though, was something else that was significant. My mass spec colleague had done something else which was very good practice: including a standard injection every so often during the runs of experimental determinations. Looking these over, I found that this same exact sample, of known concentration, was coming out as having less and less product in it as the process went on. That's certainly not unheard of - it usually means that the detector was getting less sensitive as time went on due to some sort of gradually accumulating gunk from my samples. Those numbers really should have been the same - after all, they were from the same vial - so I plotted out a curve to see how they declined with time. I then produced another column of numbers where I used that as a correction factor to adjust the data I'd actually obtained. The first runs needed little or no correction, as you'd figure, but by the end of the run, there were some noticeable changes.

So now I had several iterations of data for the same experiment. There was the raw raw data set, which I never really saw, and which would have been quite a pile if printed out. This was stored on the mass spec machine itself, in its own data format. Then I had numbers that I could use, the calculated areas of all those peaks. After that I had the corresponding concentrations, corrected for by the standard concentration curve run before the samples where injected. Then I had the values that I'd corrected for the detector response over time. And finally, once all this was done, I had the averages of the three duplicate runs for each set of conditions.

When I saved my file of data for this experiment, I took care to label everything I'd done. (I was sometimes lazier about such things earlier in my career, but I've learned that you can save ten minutes now only to spend hours eventually trying to figure out what you've done). The spreadsheet included all those iterations, each in a labeled column ("Area" "Concentration" "Corrected for response"), and both the standard curves and my response-versus-injection-number plots were included.

So how did my experiments look? Pretty good, actually. The error bars were small enough to see differences in the various conditions, which is what I'd hoped for, and some of those conditions were definitely much better than others. In fact, I thought I saw a useful trend in which ones worked best, and (as it turned out), this trend was even clearer after applying the correction for the detector response. I was glad to have the data; I've had far, far worse.

When presenting these results to my colleagues, I showed them a bar chart of the averages for the twelve different conditions, with the associated error bars plotted, which was good enough for everyone in my audience. If someone had asked to see my raw data, I would have sent them the file I mentioned above, with a note about how the numbers had been worked up. It's important to remember that the raw data are the numbers that come right out of the machine - the answers the universe gave you when you asked it a series of questions. The averages and the corrections are all useful (in fact, they can be essential), but it's important to have the source from which they came, and it's essential to show how that source material has been refined.

Comments (43) + TrackBacks (0) | Category: Life in the Drug Labs


COMMENTS

1. RB Woodweird on December 2, 2009 9:51 AM writes...

A goes to B. How to best make B from A - this is what we do all day long. But as long as you are using this as an analogy to the recent climate science brouhaha:

You present your conclusions at a meeting in your company. You get several responses.

"A goes to B? We don't believe that A can go to B, therefore you must be lying."

"A goes to B? But the existence of B will reduce our profits, so upper management says that you are mistaken. By the way, you are fired."

"A goes to B? It doesn't matter. The Bible says that God gave man dominion over B."

"A goes to B? Maybe in your limited set of experiments. Where is the data for running the reaction at 56.7 degrees? Suspiciouly missing from your presentation."

"A goes to B? Unlikely, because we had a guy here who worked with another guy who had a lab where they found that A never went to B."

"A goes to B? No way. I have a bottle of A in my lab right now, and it's not going to B. As a matter of fact, it is going to C. You need to rerun all your experiments to find out why A goes to C instead of B."

Permalink to Comment

2. Chemjobber on December 2, 2009 10:13 AM writes...

Presumably, Derek still has his lab notebook (or e-version), with all the attached LC/MS files. If he had scaled up his experiments, I would hope that he would have PDFs of the 1H NMRs of his products, and maybe also the FIDs.

So I guess the key question is this, re Climatedatagate: what do they not have anymore?

Permalink to Comment

3. gyges on December 2, 2009 10:35 AM writes...

Haven't NASA lost tapes of the moon landing?

Permalink to Comment

4. Anon on December 2, 2009 11:16 AM writes...

Machines do work. Instruments (LC-MS) measure things

Permalink to Comment

5. Hap on December 2, 2009 11:24 AM writes...

RBW: the fact that people are dishonest (in lots of ways) about AGW and in pointing out its (perceived) flaws doesn't excuse one from trying to be honest. Since most people aren't going to be able to analyze the climate data (either raw or analyzed) or the methods used to obtain and analyze it, people are relying on only the conclusions, which depend significantly on how the data is operated on and analyzed. If the people (at least some of them) are dishonest in how they do so, it threatens their ability (and that of others) to mitigate the problems they foresee.

In addition, with something whose consequences (and the consequences of any mitigation) are likely to substantially effect human civilization (and other life forms on Earth), one would figure that the people studying it would be both careful and honest, to make sure at least to themselves that they have the correct conclusions and not just the ones they want. And if one is anticipating the political consequences of one's work, and knowing that those who oppose it will pick any real or imagined bit of dishonesty in order to allow others to ignore their work, you would also figure that honesty and clarity (backed of course by data) would be both effective strategy to get people to believe their work and a requirement for them to do so. That they didn't, and were cavalier with a significant part of the data set they use for their research, seems like a problem to me.

"In the end, your word is all there is, really."

Permalink to Comment

6. rob on December 2, 2009 12:58 PM writes...

Derek,

Let's keep going with your description of raw data a bit.

First, as a good scientist, I would expect that you have all data generated by experiments performed by your employer in in the early '80s. (In the event that your employer wasn't around then, you may substitute the raw data recorded by grad students working the labs that later gave rise to your company.)

Some of the data may have stored on floppy disks. Naturally, you'll be expected to provide the data in the original floppy-disk format for us to examine. You must also provide it in a more readable format that its stored in now. You must also provide the continuous chain-of-custody information describing how the data was transformed from the floppy-disk storage to (perhaps) tape disk storage to whatever form it is in now.

Because this data likely backs up patents (long since published) I would expect you will make all of it available online. Now. Otherwise, your patents simply aren't valid.

The only possible reasons why you might not do any of these things are (depending on exactly which therapy area you work in) because 1) your employer knows about but is actively suppressing a cure for cancer, or 2) you know that HIV really isn't the cause of AIDS. (Actually, lots of scientists believe both points, but they can't get their papers published in good journals due to a vast scientific conspiracy.)

Oh, and while we're on the subject, please provide a copy of your company's entire internal email system for the last ten or fifteen years so we can look for further evidence of the conspiracy. And please understand that if we find any emails that look at all questionable-- or if any of your key data sets are missing-- that our allies in the media will ensure it is front page news.

Finally, please be prepared to stop whatever you're doing this week and prove to us that you are not suppressing a cancer cure. Please keep in mind that we don't know a ribozyme from a ribosome, and we simply don't have the time to learn or even pick up an introductory biology textbook. (But we can point you to this great blog which talks about ribozymes all the time, and which thinks you, personally, are part of the vast scientific conspiracy.)

And we-- and our well-funded, well-connected friends-- will get very mad if you don't treat us with the respect we deserve.

Permalink to Comment

7. MarkySparky on December 2, 2009 1:15 PM writes...

#1 RBW:

Is the International A goes to B Committee meeting to decide the economic future of the human race? If so, the scientific validity of A going to B is subordinate to the political reality of the day.

Climategate will likely have very little impact on the scientific questions. Doesn't matter. Reshuffling trillions of dollars needs the absolute ethical high ground, or else it is just plain old stealing in the eyes of the public. Screaming "denier" is just as pointless as screaming "conspiracy".

Permalink to Comment

8. Greg Q on December 2, 2009 1:59 PM writes...

RBW:

It's too bad you can't even get your analogies right. Since you left out the most important one:

You're measuring what makes A -> B? Didn't half the A in your reaction vessels convert to B before you did anything? What makes you think that what you did had any effect, especially since you only did one run of the experiment?

The current warming trend started in 1850. The planet heated more from 1900 to 1950 than it did from 1950 to 2000. CO2 levels are even more elevated, but the temp has stayed the same / gone down over the last decade, contrary to the predictions of the models.

In short, the critics are right to question and challenge the AGW fantasists.

And if they, and you, were real scientists, they've focus on doing real science, which is to say transparent, reproducible experiments, rather than on playing political games.

If the claims were valid, then the CRU people would have had nothing to fear by releasing their data for others to play with.

If "the truth" (according to the CRU dogma) could only be "seeb" when you stood on one leg, with your hand over your right eye, and hopped with a certain frequency, then the problem isn't with the "deniers" who demanded to see the data for themselves, the problem is with the dishonest hacks who pretended that their method of hopping was the only valid way to look at the data.

And that, dear sir, is the nicest possible way to look at what the CRU people are doing.

Permalink to Comment

9. Chemjobber on December 2, 2009 2:09 PM writes...

Rob: Can you answer my question? You sound like you can.

Permalink to Comment

10. Jay on December 2, 2009 2:17 PM writes...

Derek, I mostly share your view but I would add a few minor revisions.

I think we all understand that there is no ideal concept of "raw data", however the word still has meaning in conversation.

The "raw" data was the millivolt reading coming out of the counter. This was processed into a count, and this was further processed by comparing it against the deflector's output to calculate the molecular weight of the ion, and so forth.

However, what is important for your work to be considered "peer reviewed", or "published", is that you provide us with enough information for us to reproduce the experiment. In this case, it's enough to say "I ran it through a mass spec and this is the output", because we all know what that means and it's enough for us to reproduce your results.

Now, unfortunately, it's not really possible to analogize this much furthee, but that won't stop me from trying. It would be most analogous to inventing the first mass spectrometer and publishing the results.

In this case, the CRU invented a "world thermometer", and this thermometer has yet to be calibrated, but they published their results anyway. Some people question their calibration method and would like them to explain the method by which they converted the "raw" readings of many sensors into the output, but the CRU simply would not provide their readings, or their calibration method, or how the device worked.

That is what really strikes to the heart of the request for the "raw data", and that is that the device is not a commercial-off-the-shelf (COTS) machine, it is a one-of-a-kind invention and there is no real evidence that it works properly.

There is nothing wrong with asking a scientist how he processed the data. That topic should be his area of expertise.

Permalink to Comment

11. Greg Q on December 2, 2009 2:22 PM writes...

Hey Rob,

What's it like to be totally in the wrong, know that you're totally in the wrong, but be unwilling to admit it? Does your stomach churn with acid all the time? Must be unpleasant.

Now, let's look at the CRU people and their situation. They collected a lot of data, and then modified that data. Now, this is not a "one off" event. If you're getting information from a weather station, and modifying it to get it normalized, you're going to have to apply those modifications every time you get data from that station. Which means, if you're at all competent, you're going to have to keep track of exactly what you did. Furthermore, at some point in the future you may discover that you made a mistake. In which case, it would be good to go back and run your new, better, normalizations over the data.

Hard to do if you've thrown it all away.

Which is why no honest and competent scientist would throw that data away.

Now, let's consider the situation they were working in. They knew that what they were working on was controversial. They knew that people were going to question and challenge their conclusions. So, what did they do?

They threw out the data that questioners would need in order to challenge their results.

Now, imagine you're in their shoes, you're competent, honest, and have done everything in a correct and reasonable manner. Are you going to throw out your supporting data?

Or are you going to keep it, and have a great deal of fun giving it to critics, and then mocking them when they come up with the same answers you got?

What's that? You say you fear that your critics would come up with different answers from your data? Well, one possibility is that they will do that by screwing up, in which case you get to mock them in published papers (a win - win for you).

The other is that they will expose holes in your work, invalid, or at least not necessarily valid, assumptions you made, use of the wrong statistical analysis, etc. In other words, they might do real science with the data, but their real science might show that you don't, in fact, own the One True Way.

And the fear of that happening is the most favorable explanation for what the CRU people did.

And if you don't find that indefensible, then I'm glad you're on the other team.

Permalink to Comment

12. Mark on December 2, 2009 2:27 PM writes...

Practical question:

How do you decide how to label the columns? How do you keep track of what you've done to the raw numbers?

I'm trying to get better in this, but I go back a few years to old stuff, and I'm trying to remember what I mean by "Correction Applied" column. Or that I have a piece of code that transforms count-per-day into count-per-person. Or converts data from UTC to local time, based on the individual user. And it's not clear where to document all of that. (We don't have "lab notebooks" where I am)

Are there standards in your group? Any that you can talk about?

Permalink to Comment

13. Greg Q on December 2, 2009 2:38 PM writes...

RBW and Rob,

Have you guys read "Harry's" Readme? Think of how much happier he would have been, if the CRU people had had the basic competence to save off what they were doing, and how they were doing it. Consider how much the quality of their work has been degraded by the fact that they have a bunch of numbers, and programs, and procedures, where nobody knows what they're doing, or why they're doing it that way.

This isn't "oh, they did it the 'wrong' way." This is "these people are freaking incompetents who set up a magical black box and danced about it doing a ritual chant, rather than actually doing anything worth being called 'science'."

This is not just "no one else can reproduce their results", this is "they can't reproduce their own results." They've flat out admitted that. They don't know what data they used when.

Heck, it's not just an "admission", it's their defense. "Hey, we can't give you the data we used to write these papers. No, it's not because we committed fraud, it's because we're so incompetent that we didn't bother to do anything to let us know what data was in our database at the time we wrote the paper."

This is "first year undergraduate" levels of incompetence, from people who are supposed to be world leaders in the field.

And people are jumping up to defend this.

So, what's that tell us about the quality of the work of everyone else in the field?

Permalink to Comment

14. Derek Lowe on December 2, 2009 3:05 PM writes...

Rob, if I were still publishing papers based on that data from the 1980s, I would definitely have had it transferred to another medium by now. In fact, since these numbers would presumably be irreplaceable, I would have made sure to have multiple backups. It should go without saying that if I were making my data an important part of an international advisory recommendation that could upend the world economy that I would want to make sure that I still had copies. Of everything. At every stage.

And no, I don't see the current situation as asking someone to produce some beige-colored 1980s floppy disks. I see it as (to use my real-life example in the post), my furnishing people with data, some of which (but not all) has been adjusted for detector response, some of which has an N of 1 and some of which has an N of 3, and some of which was obtained on a different machine entirely. And not telling people any of that. As one observer has put it:

"Datasets that were processed and developed decades ago and that are now regarded as essential elements of the climate data record often contain elements whose raw data or metadata were not preserved (this appears to be the case with HADCRUT). The HADCRU surface climate dataset needs public documentation that details the time period and location of individual station measurements used in the data set, statistical adjustments to the data, how the data were analyzed to produce the climatology, and what measurements were omitted and why. If these data and metadata are unavailable, I would argue that the data set needs to be reprocessed (presumably the original raw data is available from the original sources)."

I hope that last sentence is accurate, too.

Permalink to Comment

15. milkshake on December 2, 2009 3:15 PM writes...

As someone pointed out, if the data was mangled and the tracks aggressively covered like in this case - but by a pharma company for the purpose of drug approval, the evil company would end up out of business and the people responsible in jail. Look how many billions it cost Merck to settle their tiny little Vioxx data subterfuge.

Permalink to Comment

16. Hap on December 2, 2009 3:51 PM writes...

I thought that most journals didn't accept work based on datasets if the datasets were not publically available (with exceptions - the early human genome papers, for example). If the data is unavailable, then readers have only your word as to what is going on - your paper isn't subject to independent testing, which is a key part of the scientific enterprise. Since, at least in some matters, the word of some of the people studying AGW is not sufficient, the data (and its independent validation) is kind of important.

In addition, you would figure that if you are concerned with, I don't know, the future of human existence and purpose on earth, and your research impinges directly on that, you might take care of making sure you have data. People do transfer data between source materials, since the constant copying of data is the only really way to assure the persistence of data, technical or otherwise, so not having done so seems a significant error. It's hard and costly to do so (hence some of NASA's data losses), but if your research is as important as you say, then it's necessary. If you couldn't be bothered to do that, then you're either fatally stupid or substantially dishonest, either of which is fatal to the credibility of the work in question.

Permalink to Comment

17. enzgrrl on December 2, 2009 4:31 PM writes...

Derek,
I read your post with bated breath, hoping you would finally tell us the rest of the "vial 33" story.
You're such a tease.

Permalink to Comment

18. rob on December 2, 2009 5:32 PM writes...

Derek,

While I can't comment on your personal practices, I sometimes wonder if we work in the same industry.

To take just one example, was the raw data for Scott Reuben's Vioxx studies (the ones subsequently found to be fraudulent) made available? If so, please supply a link. If not, why didn't the lack of raw data raise eyebrows?

Where is the raw data for the VIGOR studies? (The ones which Merck scientists interpreted as showing a protective for naproxen.) Please supply a link. And, if this is available, please explain why no one reexamined this data before prescribing Vioxx on a scale large enough to cause ~100k cardiac events.

And when I say raw data, I mean *raw data.* The medical records of every single patient. Their family histories. The evidence that the randomization was appropriately conducted.

But, of course, this isn't really a very good analogy. Everybody knows that tens of thousands of lives (and billions of dollars) ride on the outcome of certain drug trials, so this absolutely should have been made available. By contrast, nobody had much of a clue back in the early '80s that a bunch of random weather records from all over the world were very important.

Climate scientists today are doing the best job they have with old data gathered on a series of instruments designed for a very different purpose. They are very upfront about this. And they hold themselves (and are held to) much higher standards than scientists in most other fields-- and *far* higher than those who advocate the unrestricted dumping of CO2 into the atmosphere.

There are mountains of other evidence for the incredibly harmful effects dumping CO2 into the air will cause. This evidence has been completely ignored in the current debate for a variety of reasons, including a belief that all of climate science is one vast conspiracy.

If you are really curious about the underlying reality-- and it sounds like you are (although not all your commenters)-- I'd recommend actually learning about the field. You can start at the 50,000' level with this (http://www.realclimate.org/index.php/archives/2007/08/the-co2-problem-in-6-easy-steps/) or with books by Tim Flannery, Peter Ward or other climate scientists, or by Joe Romm, Mark Lymas or other climate science writers. Or you can read online presentations of science by scientists, such as those by James Hansen or Lonnie Thompson. Once you have the 50,000' few, you can delve into the primary literature.

The science is important, Derek. Read up on it.

Permalink to Comment

19. srp on December 2, 2009 5:38 PM writes...

I second the sentiment on Vial 33.

The data hiding (and the confessed motives for such hiding) speak for themselves. And it's not like the outside critics of the paleoclimate reconstructions aren't better trained in statistics and econometrics than the insiders. When experts look at your results and tell you that your statistical methods look dodgy, it would behoove you to enlist their expertise to get it right on matters of such consequence. When you are unfamiliar with the term "heteroskedasticiy" (as Gavin Schmidt of RealClimate once casually admitted) you probably shouldn't be debating proper regression techniques. .

Permalink to Comment

20. Martin on December 2, 2009 5:51 PM writes...

Arrggh, the GW debate rises to the top here too...

Anyway back on topic, raw data can be faked too. A couple of years ago there was an illuminating stoush over a faked protein crystal structure in Nature (PDB code 2hr0)

google "ccp4 the importance of using our validation tools"

Basically the PDB structure and the structure factors could not be reconciled. So the conclusion that had to be drawn was that the data had been faked. This lead to suggestions that the original image files (several GB) need to be deposited too.

Interestingly the same author just had a 1999 structure paper retracted by JBiolChem. One line retraction, no details

Permalink to Comment

21. rob on December 2, 2009 5:53 PM writes...

Also, see this editorial in the 12/3 Nature on the subject.

http://www.nature.com/nature/journal/v462/n7273/full/462545a.html

Nature calls claims that the emails demonstrate suppression of evidence "laughable" and "paranoid", which will doubtless be interpreted by some here as proof that Nature is also part of the Vast Climate Scientist Data-hiding Conspiracy.

Permalink to Comment

22. srp on December 2, 2009 5:59 PM writes...

In point of fact, Nature has historically been one of the worst offenders in violating its own policies on data disclosure in this area.

Permalink to Comment

23. Derek Lowe on December 2, 2009 6:29 PM writes...

Rob, the data for those studies is with the FDA, which is within its rights to ask for it in as raw a form as it wants. And the companies involved have to produce it, too.

As for Vioxx, Merck has settled about 3100 cases, not all of them fatalities, and the FDA's own estimate was that the drug might have contributed to as many as about 28,000 heart attacks (not all of them fatal, either). So I don't know where your 100K number comes from, actually. You'd think that Merck wouldn't be able to make their legal problems go away by settling 3% of that total.

And - as I'm sure you know - drug approvals are granted for their risk/benefit ratio, with no drug being warranted free from risk. Everything that goes out onto the market is capable of coming right back off if it starts to show trouble out in the real world. We can argue, of course, about how good a job the FDA does evaluating the numbers beforehand, or how good a job the companies do generating them.

Thanks for the reading recommendations. I'm fairly familiar with the debate in this area, although I don't think that fighting it out in this blog is a good use of anyone's time. But it's possible to be much more informed than I am and still not be completely sold on anthropogenic warming, or its severity.

Permalink to Comment

24. Hap on December 2, 2009 6:32 PM writes...

Hearing about vial 33 would be neat, but I thought that it was likely to be relate to stuff he does for his current employer and hence off-limits to blogging.

Permalink to Comment

25. Hap on December 2, 2009 6:35 PM writes...

I think the relevant phrasing to Rob would be "Thank you, cowboy. I will take it under advisement."

Permalink to Comment

26. dearieme on December 2, 2009 7:22 PM writes...

"And they hold themselves (and are held to) much higher standards than scientists in most other fields": hide the decline.

Permalink to Comment

27. rob on December 2, 2009 8:45 PM writes...

Derek,

The ~100k Viox cardiac event estimate (~35% fatal) I cited came from a citation by Wikipedia (too lazy to do more than a quick google search) which appears to have gotten the number from a Lancet paper by an FDA analyst. Although I verified the number came from the Lancet, I must admit I stopped there, since I'm sure any study published in such a respected journal would be eminently reproducible-- as is typical for our field-- right down to the raw data, which will of course be online.

http://en.wikipedia.org/wiki/Rofecoxib#Withdrawal

So your argument now seems to be "well, people in our field don't publish raw data, but they would share with the government if told to."

I'm curious: would you accept an argument like "We climate scientists don't publish our raw data, but we would confidentially make it available to the IPCC (or UK gov't, or gov't of the Maldives) if told to"? Would this meet your high standards of scientific transparency?

And that's before we get into the issues of regulatory capture, and the numerous sad examples of regulators failing to ask difficult questions (Bernie Madoff, financial crisis, Viox itself, etc.)

Finally, dearierme, I'm not sure if you read the whole sentence from the purloined email that contains the phrase "hide the decline", but if you did you'd know it doesn't refer either to temperatures or to data. Instead it refers to construction of a graph. While such behavior is unfortunate, it is ubiquitous at every company I've ever worked at.

Permalink to Comment

28. Greg Q on December 2, 2009 8:55 PM writes...

Something crystallized for me today. Rob said something that, in various forms, I've read by multiple "defenders" of the Jones, Mann, et. al. crowd:
Gee, what would you expect them to do when "well-funded, well-connected" 'researchers' are trying to destroy your work?

This defense is so contrary to human nature I wonder at the species of the people making it.

When I'm shooting the breeze with my friends, I'm willing to toss out guesses, and half remembered bits of data in order to forward a discussion. OTOH, when I'm arguing with someone who strongly disagrees with me, I make sure of everything before I say it, and think carefully about my logic before I send my comments off.

So, what we're basically hearing is one of three possibilities:

1: The peer review process for "climate science" is a complete joke. Jones, Mann, etc. know that they can send out papers, and no one will actually try to challenge anything they say. IOW, they're "among friends", and can get away with whatever they want.

If this is true, we need to junk the entire "climate science" literature, and start over again from scratch.

2: Jones, Mann, et. al. are frauds, their papers are total garbage, they destroyed all the data behind them to keep anyone from figuring out what they did.

3: Jones, Mann, et. al. are complete incompetents. They don't know the first thing about how to do real science, and that's why they never saved the raw data, didn't freeze their data sets for each paper (and record their contents), and why they sound like schoolyard thugs in their emails.

Note that these people were the top names in the field. If they are incompetent hacks, what's that say about the rest of the field?

In any event, for 2 or 3 the answer is the same: retract every single one of their papers, and every paper that references any of their papers. Scrap everything produced by the IPCC, it's all tainted crap.


Those are the conclusions to be drawn from what the "defenders" are arguing.

Permalink to Comment

29. Greg Q on December 2, 2009 9:09 PM writes...

Rob,

One last thought: The amount of time the CRU hacks wasted fighting FOI and other information requests greatly exceeded the amount of time it would have taken them to package up the data and put it on a website where anyone could download it. Heck, they already did that for various other "favored" researchers (see the emails about the perils of leaving data available on FTP). So your whines about the time and effort involved are off point.

It is their refusal to share data that led to the emails being released. If they'd simply acted like honest scientists, and put the information out on the web, no one would have been filing FOI requests in the first place. (At least, not until the found obvious fraud in Jones', Mann's and friends' treatment of that data.)

Permalink to Comment

30. weirdo on December 2, 2009 11:21 PM writes...

So now Rob's primary source is Wikipedia?

I guess we now know what he considers quality science.

Oh, and by the way, for any issued U.S. Patent, you have every right to demand all (ALL) of the raw data that is disclosed in that patent by challenging the patent. Go ahead. It ain't cheap, but generic companies do it all the time. Don't pretend it ain't so.

Why, oh why, am I wasting time on strawmen? Bored on a Wednesday night, I guess.

Permalink to Comment

31. cliffintokyo on December 3, 2009 1:18 AM writes...

Final Word?

MAY THE SOURCE BE WITH YOU!

Permalink to Comment

32. SteveSC on December 3, 2009 7:45 AM writes...

To most people I know doing 'real research', the raw data is precious, not something you get by going to Wikipedia. It takes a LOT of time to plan, collect, and organize, and if you believe it, you are happy to share it with others (once you have published, of course).

Heck, we carted my wife's dissertation data (the real raw stuff, the paper forms with inked characters, not electrons) around in a book box for 20 years through 8 moves; I still feel guilty about dumping it even though she is totally out of the field.

As I see it, you are reluctant to share data when:
1) you haven't published yet,
2) you are squeezing the data for 'one more pub',
3) your boss has told you not to share,
4) you don't really believe in the data (e.g., you know there are errors or perhaps even fraud that someone else might find).

Regarding #2, there are some legitimate (I guess) researchers who have built a dataset and spent the rest of their life milking it for pub after pub. In my experience these people are, at best, mediocre scientists and the reason they don't share the original data is that they know other, more talented, scientists will quickly squeeze every publishable idea out of the data and leave them with nothing left to do (except attempt some more original research of course ;-)

One problem with Federally funded research now (especially the NIH, which is where I have my experience) is that a lot of money is going to fund development and prolonged analysis of 'proprietary' data sets that are outside of the public domain for decades. I have argued in review groups that once a Federally-funded data set has published a single article the data should be made public, but it usually receives some covert eye-rolling because many of the reviewers have (or hope to get) funding for similar gravy-trains.

Permalink to Comment

33. tgibbs on December 3, 2009 12:33 PM writes...

Where is the raw data for the VIGOR studies? (The ones which Merck scientists interpreted as showing a protective for naproxen.) Please supply a link. And, if this is available, please explain why no one reexamined this data before prescribing Vioxx on a scale large enough to cause ~100k cardiac events.

While this doesn't invalidate the point that you are making, what the data showed (and what the paper reported) was an increased frequency of cardiovascular events in the rofecoxib group than in the naproxen group. The authors speculated that there might be a protective effect of naproxen that was absent for rofecoxib. Estimates of the number of cardiac events caused by rofecoxib will probably have to be scaled down still further, as patients who took rofecoxib would almost certainly otherwise have have taken another NSAID, and it is becoming increasingly clear that virtually all NSAIDs carry some cardiovascular risk at the elevated doses commonly used by patients with arthritis,

Permalink to Comment

34. tgibbs on December 3, 2009 1:57 PM writes...

One last thought: The amount of time the CRU hacks wasted fighting FOI and other information requests greatly exceeded the amount of time it would have taken them to package up the data and put it on a website where anyone could download it.

However, one has to consider the fact that it was not actually their data. The raw data belonged to the various meteorological services that provided it, and apparently was subject to various agreements, both formal and informal, regarding redistribution. I might be very free in handing out my own data, but more hesitant to hand out data belonging to a collaborator, especially to somebody whose intention seemed to be fault-finding rather than actual science--I'd be more likely to say, "If you want their data, you really should ask them for it."

Permalink to Comment

35. Dana H. on December 3, 2009 2:23 PM writes...

"I'm not sure if you read the whole sentence from the purloined email that contains the phrase 'hide the decline', but if you did you'd know it doesn't refer either to temperatures or to data. Instead it refers to construction of a graph."

It's more complex than that. The data being plotted were proxy temperature data from tree rings. Over the period during which the ring widths could be compared to instrumental temperature data, they appeared to correlate well -- up to a certain time. After this time, the tree ring widths diverged strongly downward from measured temperatures. This is the "divergence problem" that anyone in the field was aware of long before the climategate story broke -- but which was not widely publicized.

"Hide the decline" refers to the trick of grafting the tree ring data onto the measured temperature data at the point the divergence appears. This results in a graph in which the "divergence problem" no longer appears. This is not an honest presentation of the data.

The real problem here is that the presence in the tree ring data of the "decline" raises the question: If tree rings aren't even a good proxy for current temperatures, why should we assume that they can track temperatures going back 1000 years? These data were key to the claim that the current observed warming is "unprecedented".

Permalink to Comment

36. Tok on December 3, 2009 4:52 PM writes...

Greg Q - "When I'm shooting the breeze with my friends, I'm willing to toss out guesses, and half remembered bits of data in order to forward a discussion." - which is quite a bit like these emails, comments made to friends that might not have even been serious or followed through on, rather just someone venting steam.

As for me, I still like "innocent until proven guilty" and so I'll hold my judgment until the official investigation is completed and we get all sides of the story.

Permalink to Comment

37. Pedro S on December 4, 2009 4:45 AM writes...

"innocent until proven guilty" should also be applied to skeptics. However, James Hansen, Al Gore, Michael Mann et al. have too often tarred all critics of catastrophic AGW as pseudo-scientists shills for the oil industry. And many (if any) of the present defenders of the CRU authors have claimed that skeptics should be considered "innocent until proven guilty" then.

Permalink to Comment

38. PedroS on December 4, 2009 5:01 AM writes...

Oops... I meant
"And not many (if any)"

instead of
"And many (if any)"

Permalink to Comment

39. Hap on December 4, 2009 12:57 PM writes...

I can understand not wanting to hand out someone else's data, but they did publish using the data, and so they should have either made it available through their collaborators (who probably should either have been noted as the sources of the data or invoked as coauthors) or made it available themselves (with the agreement of the original data sources). In either case, the data needed to be available to others, considering the importance of their work and the importance of the conclusions made from it. Basing really important decisions based on secret data (and either a model which we can't necessarily validate or a person/organization whose rational though processes are in question) is largely responsible for our current position in Iraq, among other (at least questionable) decisions.

"Trust us, the data says what we claim" is not an acceptable response.

Permalink to Comment

40. Greg Q on December 7, 2009 1:27 PM writes...

Tok,

Thye problem isn't their emails, the problem is their papers. If their papers were done with scientific rigor, they wouldn't need to hide the data behind those papers. They wouldn't have spent years fighting FOI requests.

These guys wrote "scientific papers" that could not stand up to scrutiny by anyone who didn't already agree with them. That is what we're learning from the source code to their modeling programs. We're learning the same from their emails, not from their word choices, but from their actions as described in the emails.

Permalink to Comment

41. Greg Q on December 7, 2009 1:48 PM writes...

TGibbs,

1: If you're going to publish a paper based upon data, you damn well ought to make sure you can publish the data before you try to use it in a paper.

2: At a minimum, they should have said "here's all the freely available data, here's how we modified that data, and here are our other sources of data, who you can contact for permission for us to give you the data we used, because we have a written contract with them (included as an attachment in this email) forbidding us to give their data to anyone else.

They didn't do that, because those "agreements" were merely an excuse for why they didn't pass out the data.

3: You wrote "I might be very free in handing out my own data, but more hesitant to hand out data belonging to a collaborator, especially to somebody whose intention seemed to be fault-finding rather than actual science"

This, more than anything else, shows your ignorance about "actual science". 99% of "actual science" is looking at other people's work, and then trying to find fault with it.

"Where did they stop short?" "What questions haven't they answered?" "Do the data really support the graphs they're showing?" "Do these graphs actually mean what they say they mean?" "Did they really include all the information I need to reproduce their results?"

Those are just some of the questions an "actual scientist" would ask when reading a paper. Those questions can be the basis for future research, for deciding the direction a lab's research will take (holes in other people's work == papers you can write).

Jones et. al. don't hate the "skpetics" because the skpetics "aren't real scientists, they hate them because they are real scientists, not groupies who will slavishly support anything Jones et. al. say and do.

Permalink to Comment

42. tgibbs on December 9, 2009 4:00 PM writes...

If somebody with no credentials in my area of research contacts me asking for a collaborator's raw data, I'll probably take a few minutes to draft a polite reply explaining that he is barking up the wrong tree, and directing him to my collaborator. If he persists in barraging me with similar requests, I'll likely figure that he is some kind of crank and add him to my spam filter. I most certainly would not ask lab personnel to take time away from productive research to dig through the lab's archives looking for a copy of the data to send him. Why would anybody want my secondary copy of the raw data, anyway? If you want the original data, you go to the source. That way, you are less likely to trip over data processing errors or data corruption.

I don't agree that 99% of actual science is "finding fault." This is an error of thought that I often encounter in beginning graduate students. The problem is that if you look hard enough, you can almost always find some sort of fault or limitation in any study, and human nature being what it is, it is natural to look harder for faults in results that you'd prefer not to believe, and then use the existence of flaws as an excuse to dismiss the results. What scientists learn in their early training is to focus on the positive: what conclusions can be reached from the results? If there are weaknesses (as there almost always are), how specifically do they impact the conclusions? What other hypotheses are compatible with the result? Is there other work in the field that can distinguish among the hypotheses? Has anybody else carried out similar investigations, ideally using different methodology (and hence subject to different flaws and artifacts), and are those results consistent with the same conclusions? If there are still unanswered questions, are there studies that we can do that could resolve the issue?

Permalink to Comment

43. Sili on December 20, 2009 1:16 PM writes...

Thanks, Derek, this was a lot more reasoned than I'd feared.

As for 'fighting FOI requests for years' [citation needed], the demands for raw data in that case reminds me of Andy Schafly and his desire to get his hands on Lenski's samples of citrate metabolising E. coli.

Permalink to Comment

POST A COMMENT




Remember Me?



EMAIL THIS ENTRY TO A FRIEND

Email this entry to:

Your email address:

Message (optional):




RELATED ENTRIES
Weirdly, Tramadol Is Not a Natural Product After All
Thiola, Retrophin, Martin Shkrell, Reddit, and More
The Most Unconscionable Drug Price Hike I Have Yet Seen
Clinical Trial Fraud
Grinding Up Your Reactions
Peer Review, Up Close and Personal
Google's Calico Moves Into Reality
Reactive Groups: Still Not So Reactive