Corante

About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: derekb.lowe@gmail.com Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Emolecules
ChemSpider
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
PubChem
Not Voodoo
DailyMed
Druglib
Clinicaltrials.gov

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
Kilomentor
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
ChemBark
Realizations in Biostatistics
Chemjobber
Pharmalot
ChemSpider Blog
Pharmagossip
Med-Chemist
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
SimBioSys
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Business|Bytes|Genes|Molecules
Eye on FDA
Chemical Forums
Depth-First
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa


Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
FuturePundit
Aetiology
Gene Expression (I)
Gene Expression (II)
Sciencebase
Pharyngula
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net


Medical Blogs
DB's Medical Rants
Science-Based Medicine
GruntDoc
Respectful Insolence
Diabetes Mine


Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem


Politics / Current Events
Virginia Postrel
Instapundit
Belmont Club
Mickey Kaus


Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Climategate and Scientific Conduct | Main | Copyright 1671: I Like the Sound of That »

December 2, 2009

Data, Raw and Otherwise

Email This Entry

Posted by Derek

Perhaps I should talk a bit about this phrase "raw data" that I and others have been throwing around. For people who don't do research for a living, it may be useful to see just what's meant by that term.

As an example, I'll use some work that I was doing a few years ago. I had an reaction that was being run under a variety of conditions (about a dozen different ways, actually), but in each case was expected to either do nothing or produce the same product molecule. (This was, as you can see, a screen to see which conditions did the best job at getting the reaction to work). I set this up in a series of vials, taking care to run everything at the same concentration and to start all the reactions off at as close to the same time as I could manage.

After a set period, the reaction vials were all analyzed by LC/MS, a common (and extremely useful) analytical technique. I'd already given the folks running that machine a sample of the known product I was looking for, and they'd worked up conditions that reproducibly detected it with high sensitivity. They ran all my samples through the machine, and each one gave a response at the other end.

And those numbers were my raw data - but it's useful to think about what they represented. The machine was set to monitor a particular combination of ions, which would be produced by my desired product. As the sample was pumped through a purification column, the material coming out the far end was continuously monitored for those specific ions, and when they showed up, the machine would count the response it detected and display this as a function of time: a flat line, then a curvy, pointed, peak which went up and then came back down as the material of interest emerged from the column and dwindled away again.

So the numbers the machine gave me were the area under the curve of that peak, and that means, technically, that we're one step away from raw numbers right there. After all, area-under-the-curve is something that's subject to the judgment of a program or a person - where, exactly, does this curve start, and where does it end? Modern analytical machines are quite good at judging this sort of thing, but it's always good to look over their little electronic shoulders to make sure that their calls look correct to you. If you want to be hard-core about it, the raw data would be the detector response for each individual reading, at whatever frame rate the machine was sampling at. That's even more raw than most people need - actually, while writing this, I had to think for a moment to picture the data at that level, because it's not something I'd usually see or worry about. For my purposes, I took the areas that were calculated, since the peak shapes looked good, and the machine's software was able to evaluate them consistently and didn't have to apply any sort of correction to them to meet its own quality standards.

So there's one set of numbers. But the person running the machine had taken the trouble (as they should have) to run a standard curve using my supplied reference compound. That is, they'd dissolved it up to a series of ever-more-dilute solutions, and run those through the machine beforehand. This, plotted as peak area versus the real concentration, gave pretty much a straight line (as it should), and the machine's software was set up to use this information to also calculate a concentration for every one of my product peaks. So the data set that I got had the standard plot, followed by the experiments themselves, with both the peak areas and the resulting calculated amounts. Since these were related by what was very nearly a straight line, I probably could have used either one. But it's important to realize the difference: by using the calculated concentrations, I could either be correcting for a defect in the machine (if its detector response really wasn't quite linear), or I could be introducing error (if the standard solutions hadn't been made up quite right) It's up to you, as a scientist, to decide which way to go. In my case, I worked up the data both ways, and found that the resulting differences were far too small to worry about. So far, so good.

But there's another layer: I had done these experiments in triplicate. There were actually thirty-six vials for the twelve different conditions, because I wanted to see how reproducible the experiments were. For my final plots, then, I used the averages of the three runs for each reaction, and plotted the error bars thus generated to show how tight or loose these values really were. That's what I meant about the area numbers versus the concentration numbers question not meaning much in this case. Not only did they agree very well, but the variations between them were far smaller than the variations between different runs of the same experiments, and thus could safely be put in the "don't worry about it" category while interpreting the data.

What I did notice while doing this, though, was something else that was significant. My mass spec colleague had done something else which was very good practice: including a standard injection every so often during the runs of experimental determinations. Looking these over, I found that this same exact sample, of known concentration, was coming out as having less and less product in it as the process went on. That's certainly not unheard of - it usually means that the detector was getting less sensitive as time went on due to some sort of gradually accumulating gunk from my samples. Those numbers really should have been the same - after all, they were from the same vial - so I plotted out a curve to see how they declined with time. I then produced another column of numbers where I used that as a correction factor to adjust the data I'd actually obtained. The first runs needed little or no correction, as you'd figure, but by the end of the run, there were some noticeable changes.

So now I had several iterations of data for the same experiment. There was the raw raw data set, which I never really saw, and which would have been quite a pile if printed out. This was stored on the mass spec machine itself, in its own data format. Then I had numbers that I could use, the calculated areas of all those peaks. After that I had the corresponding concentrations, corrected for by the standard concentration curve run before the samples where injected. Then I had the values that I'd corrected for the detector response over time. And finally, once all this was done, I had the averages of the three duplicate runs for each set of conditions.

When I saved my file of data for this experiment, I took care to label everything I'd done. (I was sometimes lazier about such things earlier in my career, but I've learned that you can save ten minutes now only to spend hours eventually trying to figure out what you've done). The spreadsheet included all those iterations, each in a labeled column ("Area" "Concentration" "Corrected for response"), and both the standard curves and my response-versus-injection-number plots were included.

So how did my experiments look? Pretty good, actually. The error bars were small enough to see differences in the various conditions, which is what I'd hoped for, and some of those conditions were definitely much better than others. In fact, I thought I saw a useful trend in which ones worked best, and (as it turned out), this trend was even clearer after applying the correction for the detector response. I was glad to have the data; I've had far, far worse.

When presenting these results to my colleagues, I showed them a bar chart of the averages for the twelve different conditions, with the associated error bars plotted, which was good enough for everyone in my audience. If someone had asked to see my raw data, I would have sent them the file I mentioned above, with a note about how the numbers had been worked up. It's important to remember that the raw data are the numbers that come right out of the machine - the answers the universe gave you when you asked it a series of questions. The averages and the corrections are all useful (in fact, they can be essential), but it's important to have the source from which they came, and it's essential to show how that source material has been refined.

Comments (43) + TrackBacks (0) | Category: Life in the Drug Labs


COMMENTS

1. RB Woodweird on December 2, 2009 9:51 AM writes...

A goes to B. How to best make B from A - this is what we do all day long. But as long as you are using this as an analogy to the recent climate science brouhaha:

You present your conclusions at a meeting in your company. You get several responses.

"A goes to B? We don't believe that A can go to B, therefore you must be lying."

"A goes to B? But the existence of B will reduce our profits, so upper management says that you are mistaken. By the way, you are fired."

"A goes to B? It doesn't matter. The Bible says that God gave man dominion over B."

"A goes to B? Maybe in your limited set of experiments. Where is the data for running the reaction at 56.7 degrees? Suspiciouly missing from your presentation."

"A goes to B? Unlikely, because we had a guy here who worked with another guy who had a lab where they found that A never went to B."

"A goes to B? No way. I have a bottle of A in my lab right now, and it's not going to B. As a matter of fact, it is going to C. You need to rerun all your experiments to find out why A goes to C instead of B."

Permalink to Comment

2. Chemjobber on December 2, 2009 10:13 AM writes...

Presumably, Derek still has his lab notebook (or e-version), with all the attached LC/MS files. If he had scaled up his experiments, I would hope that he would have PDFs of the 1H NMRs of his products, and maybe also the FIDs.

So I guess the key question is this, re Climatedatagate: what do they not have anymore?

Permalink to Comment

3. gyges on December 2, 2009 10:35 AM writes...

Haven't NASA lost tapes of the moon landing?

Permalink to Comment

4. Anon on December 2, 2009 11:16 AM writes...

Machines do work. Instruments (LC-MS) measure things

Permalink to Comment

5. Hap on December 2, 2009 11:24 AM writes...

RBW: the fact that people are dishonest (in lots of ways) about AGW and in pointing out its (perceived) flaws doesn't excuse one from trying to be honest. Since most people aren't going to be able to analyze the climate data (either raw or analyzed) or the methods used to obtain and analyze it, people are relying on only the conclusions, which depend significantly on how the data is operated on and analyzed. If the people (at least some of them) are dishonest in how they do so, it threatens their ability (and that of others) to mitigate the problems they foresee.

In addition, with something whose consequences (and the consequences of any mitigation) are likely to substantially effect human civilization (and other life forms on Earth), one would figure that the people studying it would be both careful and honest, to make sure at least to themselves that they have the correct conclusions and not just the ones they want. And if one is anticipating the political consequences of one's work, and knowing that those who oppose it will pick any real or imagined bit of dishonesty in order to allow others to ignore their work, you would also figure that honesty and clarity (backed of course by data) would be both effective strategy to get people to believe their work and a requirement for them to do so. That they didn't, and were cavalier with a significant part of the data set they use for their research, seems like a problem to me.

"In the end, your word is all there is, really."

Permalink to Comment

6. rob on December 2, 2009 12:58 PM writes...

Derek,

Let's keep going with your description of raw data a bit.

First, as a good scientist, I would expect that you have all data generated by experiments performed by your employer in in the early '80s. (In the event that your employer wasn't around then, you may substitute the raw data recorded by grad students working the labs that later gave rise to your company.)

Some of the data may have stored on floppy disks. Naturally, you'll be expected to provide the data in the original floppy-disk format for us to examine. You must also provide it in a more readable format that its stored in now. You must also provide the continuous chain-of-custody information describing how the data was transformed from the floppy-disk storage to (perhaps) tape disk storage to whatever form it is in now.

Because this data likely backs up patents (long since published) I would expect you will make all of it available online. Now. Otherwise, your patents simply aren't valid.

The only possible reasons why you might not do any of these things are (depending on exactly which therapy area you work in) because 1) your employer knows about but is actively suppressing a cure for cancer, or 2) you know that HIV really isn't the cause of AIDS. (Actually, lots of scientists believe both points, but they can't get their papers published in good journals due to a vast scientific conspiracy.)

Oh, and while we're on the subject, please provide a copy of your company's entire internal email system for the last ten or fifteen years so we can look for further evidence of the conspiracy. And please understand that if we find any emails that look at all questionable-- or if any of your key data sets are missing-- that our allies in the media will ensure it is front page news.

Finally, please be prepared to stop whatever you're doing this week and prove to us that you are not suppressing a cancer cure. Please keep in mind that we don't know a ribozyme from a ribosome, and we simply don't have the time to learn or even pick up an introductory biology textbook. (But we can point you to this great blog which talks about ribozymes all the time, and which thinks you, personally, are part of the vast scientific conspiracy.)

And we-- and our well-funded, well-connected friends-- will get very mad if you don't treat us with the respect we deserve.

Permalink to Comment

7. MarkySparky on December 2, 2009 1:15 PM writes...

#1 RBW:

Is the International A goes to B Committee meeting to decide the economic future of the human race? If so, the scientific validity of A going to B is subordinate to the political reality of the day.

Climategate will likely have very little impact on the scientific questions. Doesn't matter. Reshuffling trillions of dollars needs the absolute ethical high ground, or else it is just plain old stealing in the eyes of the public. Screaming "denier" is just as pointless as screaming "conspiracy".

Permalink to Comment

8. Greg Q on December 2, 2009 1:59 PM writes...

RBW:

It's too bad you can't even get your analogies right. Since you left out the most important one:

You're measuring what makes A -> B? Didn't half the A in your reaction vessels convert to B before you did anything? What makes you think that what you did had any effect, especially since you only did one run of the experiment?

The current warming trend started in 1850. The planet heated more from 1900 to 1950 than it did from 1950 to 2000. CO2 levels are even more elevated, but the temp has stayed the same / gone down over the last decade, contrary to the predictions of the models.

In short, the critics are right to question and challenge the AGW fantasists.

And if they, and you, were real scientists, they've focus on doing real science, which is to say transparent, reproducible experiments, rather than on playing political games.

If the claims were valid, then the CRU people would have had nothing to fear by releasing their data for others to play with.

If "the truth" (according to the CRU dogma) could only be "seeb" when you stood on one leg, with your hand over your right eye, and hopped with a certain frequency, then the problem isn't with the "deniers" who demanded to see the data for themselves, the problem is with the dishonest hacks who pretended that their method of hopping was the only valid way to look at the data.

And that, dear sir, is the nicest possible way to look at what the CRU people are doing.

Permalink to Comment

9. Chemjobber on December 2, 2009 2:09 PM writes...

Rob: Can you answer my question? You sound like you can.

Permalink to Comment

10. Jay on December 2, 2009 2:17 PM writes...

Derek, I mostly share your view but I would add a few minor revisions.

I think we all understand that there is no ideal concept of "raw data", however the word still has meaning in conversation.

The "raw" data was the millivolt reading coming out of the counter. This was processed into a count, and this was further processed by comparing it against the deflector's output to calculate the molecular weight of the ion, and so forth.

However, what is important for your work to be considered "peer reviewed", or "published", is that you provide us with enough information for us to reproduce the experiment. In this case, it's enough to say "I ran it through a mass spec and this is the output", because we all know what that means and it's enough for us to reproduce your results.

Now, unfortunately, it's not really possible to analogize this much furthee, but that won't stop me from trying. It would be most analogous to inventing the first mass spectrometer and publishing the results.

In this case, the CRU invented a "world thermometer", and this thermometer has yet to be calibrated, but they published their results anyway. Some people question their calibration method and would like them to explain the method by which they converted the "raw" readings of many sensors into the output, but the CRU simply would not provide their readings, or their calibration method, or how the device worked.

That is what really strikes to the heart of the request for the "raw data", and that is that the device is not a commercial-off-the-shelf (COTS) machine, it is a one-of-a-kind invention and there is no real evidence that it works properly.

There is nothing wrong with asking a scientist how he processed the data. That topic should be his area of expertise.

Permalink to Comment

11. Greg Q on December 2, 2009 2:22 PM writes...

Hey Rob,

What's it like to be totally in the wrong, know that you're totally in the wrong, but be unwilling to admit it? Does your stomach churn with acid all the time? Must be unpleasant.

Now, let's look at the CRU people and their situation. They collected a lot of data, and then modified that data. Now, this is not a "one off" event. If you're getting information from a weather station, and modifying it to get it normalized, you're going to have to apply those modifications every time you get data from that station. Which means, if you're at all competent, you're going to have to keep track of exactly what you did. Furthermore, at some point in the future you may discover that you made a mistake. In which case, it would be good to go back and run your new, better, normalizations over the data.

Hard to do if you've thrown it all away.

Which is why no honest and competent scientist would throw that data away.

Now, let's consider the situation they were working in. They knew that what they were working on was controversial. They knew that people were going to question and challenge their conclusions. So, what did they do?

They threw out the data that questioners would need in order to challenge their results.

Now, imagine you're in their shoes, you're competent, honest, and have done everything in a correct and reasonable manner. Are you going to throw out your supporting data?

Or are you going to keep it, and have a great deal of fun giving it to critics, and then mocking them when they come up with the same answers you got?

What's that? You say you fear that your critics would come up with different answers from your data? Well, one possibility is that they will do that by screwing up, in which case you get to mock them in published papers (a win - win for you).

The other is that they will expose holes in your work, invalid, or at least not necessarily valid, assumptions you made, use of the wrong statistical analysis, etc. In other words, they might do real science with the data, but their real science might show that you don't, in fact, own the One True Way.

And the fear of that happening is the most favorable explanation for what the CRU people did.

And if you don't find that indefensible, then I'm glad you're on the other team.

Permalink to Comment

12. Mark on December 2, 2009 2:27 PM writes...

Practical question:

How do you decide how to label the columns? How do you keep track of what you've done to the raw numbers?

I'm trying to get better in this, but I go back a few years to old stuff, and I'm trying to remember what I mean by "Correction Applied" column. Or that I have a piece of code that transforms count-per-day into count-per-person. Or converts data from UTC to local time, based on the individual user. And it's not clear where to document all of that. (We don't have "lab notebooks" where I am)

Are there standards in your group? Any that you can talk about?

Permalink to Comment

13. Greg Q on December 2, 2009 2:38 PM writes...

RBW and Rob,

Have you guys read "Harry's" Readme? Think of how much happier he would have been, if the CRU people had had the basic competence to save off what they were doing, and how they were doing it. Consider how much the quality of their work has been degraded by the fact that they have a bunch of numbers, and programs, and procedures, where nobody knows what they're doing, or why they're doing it that way.

This isn't "oh, they did it the 'wrong' way." This is "these people are freaking incompetents who set up a magical black box and danced about it doing a ritual chant, rather than actually doing anything worth being called 'science'."

This is not just "no one else can reproduce their results", this is "they can't reproduce their own results." They've flat out admitted that. They don't know what data they used when.

Heck, it's not just an "admission", it's their defense. "Hey, we can't give you the data we used to write these papers. No, it's not because we committed fraud, it's because we're so incompetent that we didn't bother to do anything to let us know what data was in our database at the time we wrote the paper."

This is "first year undergraduate" levels of incompetence, from people who are supposed to be world leaders in the field.

And people are jumping up to defend this.

So, what's that tell us about the quality of the work of everyone else in the field?

Permalink to Comment

14. Derek Lowe on December 2, 2009 3:05 PM writes...

Rob, if I were still publishing papers based on that data from the 1980s, I would definitely have had it transferred to another medium by now. In fact, since these numbers would presumably be irreplaceable, I would have made sure to have multiple backups. It should go without saying that if I were making my data an important part of an international advisory recommendation that could upend the world economy that I would want to make sure that I still had copies. Of everything. At every stage.

And no, I don't see the current situation as asking someone to produce some beige-colored 1980s floppy disks. I see it as (to use my real-life example in the post), my furnishing people with data, some of which (but not all) has been adjusted for detector response, some of which has an N of 1 and some of which has an N of 3, and some of which was obtained on a different machine entirely. And not telling people any of that. As one observer has put it:

"Datasets that were processed and developed decades ago and that are now regarded as essential elements of the climate data record often contain elements whose raw data or metadata were not preserved (this appears to be the case with HADCRUT). The HADCRU surface climate dataset needs public documentation that details the time period and location of individual station measurements used in the data set, statistical adjustments to the data, how the data were analyzed to produce the climatology, and what measurements were omitted and why. If these data and metadata are unavailable, I would argue that the data set needs to be reprocessed (presumably the original raw data is available from the original sources)."

I hope that last sentence is accurate, too.

Permalink to Comment

15. milkshake on December 2, 2009 3:15 PM writes...

As someone pointed out, if the data was mangled and the tracks aggressively covered like in this case - but by a pharma company for the purpose of drug approval, the evil company would end up out of business and the people responsible in jail. Look how many billions it cost Merck to settle their tiny little Vioxx data subterfuge.

Permalink to Comment

16. Hap on December 2, 2009 3:51 PM writes...

I thought that most journals didn't accept work based on datasets if the datasets were not publically available (with exceptions - the early human genome papers, for example). If the data is unavailable, then readers have only your word as to what is going on - your paper isn't subject to independent testing, which is a key part of the scientific enterprise. Since, at least in some matters, the word of some of the people studying AGW is not sufficient, the data (and its independent validation) is kind of important.

In addition, you would figure that if you are concerned with, I don't know, the future of human existence and purpose on earth, and your research impinges directly on that, you might take care of making sure you have data. People do transfer data between source materials, since the constant copying of data is the only really way to assure the persistence of data, technical or otherwise, so not having done so seems a significant error. It's hard and costly to do so (hence some of NASA's data losses), but if your research is as important as you say, then it's necessary. If you couldn't be bothered to do that, then you're either fatally stupid or substantially dishonest, either of which is fatal to the credibility of the work in question.

Permalink to Comment

17. enzgrrl on December 2, 2009 4:31 PM writes...

Derek,
I read your post with bated breath, hoping you would finally tell us the rest of the "vial 33" story.
You're such a tease.

Permalink to Comment

18. rob on December 2, 2009 5:32 PM writes...

Derek,

While I can't comment on your personal practices, I sometimes wonder if we work in the same industry.

To take just one example, was the raw data for Scott Reuben's Vioxx studies (the ones subsequently found to be fraudulent) made available? If so, please supply a link. If not, why didn't the lack of raw data raise eyebrows?

Where is the raw data for the VIGOR studies? (The ones which Merck scientists interpreted as showing a protective for naproxen.) Please supply a link. And, if this is available, please explain why no one reexamined this data before prescribing Vioxx on a scale large enough to cause ~100k cardiac events.

And when I say raw data, I mean *raw data.* The medical records of every single patient. Their family histories. The evidence that the randomization was appropriately conducted.

But, of course, this isn't really a very good analogy. Everybody knows that tens of thousands of lives (and billions of dollars) ride on the outcome of certain drug trials, so this absolutely should have been made available. By contrast, nobody had much of a clue back in the early '80s that a bunch of random weather records from all over the world were very important.

Climate scientists today are doing the best job they have with old data gathered on a series of instruments designed for a very different purpose. They are very upfront about this. And they hold themselves (and are held to) much higher standards than scientists in most other fields-- and *far* higher than those who advocate the unrestricted dumping of CO2 into the atmosphere.

There are mountains of other evidence for the incredibly harmful effects dumping CO2 into the air will cause. This evidence has been completely ignored in the current debate for a variety of reasons, including a belief that all of climate science is one vast