Perhaps I should talk a bit about this phrase "raw data" that I and others have been throwing around. For people who don't do research for a living, it may be useful to see just what's meant by that term.
As an example, I'll use some work that I was doing a few years ago. I had an reaction that was being run under a variety of conditions (about a dozen different ways, actually), but in each case was expected to either do nothing or produce the same product molecule. (This was, as you can see, a screen to see which conditions did the best job at getting the reaction to work). I set this up in a series of vials, taking care to run everything at the same concentration and to start all the reactions off at as close to the same time as I could manage.
After a set period, the reaction vials were all analyzed by LC/MS, a common (and extremely useful) analytical technique. I'd already given the folks running that machine a sample of the known product I was looking for, and they'd worked up conditions that reproducibly detected it with high sensitivity. They ran all my samples through the machine, and each one gave a response at the other end.
And those numbers were my raw data - but it's useful to think about what they represented. The machine was set to monitor a particular combination of ions, which would be produced by my desired product. As the sample was pumped through a purification column, the material coming out the far end was continuously monitored for those specific ions, and when they showed up, the machine would count the response it detected and display this as a function of time: a flat line, then a curvy, pointed, peak which went up and then came back down as the material of interest emerged from the column and dwindled away again.
So the numbers the machine gave me were the area under the curve of that peak, and that means, technically, that we're one step away from raw numbers right there. After all, area-under-the-curve is something that's subject to the judgment of a program or a person - where, exactly, does this curve start, and where does it end? Modern analytical machines are quite good at judging this sort of thing, but it's always good to look over their little electronic shoulders to make sure that their calls look correct to you. If you want to be hard-core about it, the raw data would be the detector response for each individual reading, at whatever frame rate the machine was sampling at. That's even more raw than most people need - actually, while writing this, I had to think for a moment to picture the data at that level, because it's not something I'd usually see or worry about. For my purposes, I took the areas that were calculated, since the peak shapes looked good, and the machine's software was able to evaluate them consistently and didn't have to apply any sort of correction to them to meet its own quality standards.
So there's one set of numbers. But the person running the machine had taken the trouble (as they should have) to run a standard curve using my supplied reference compound. That is, they'd dissolved it up to a series of ever-more-dilute solutions, and run those through the machine beforehand. This, plotted as peak area versus the real concentration, gave pretty much a straight line (as it should), and the machine's software was set up to use this information to also calculate a concentration for every one of my product peaks. So the data set that I got had the standard plot, followed by the experiments themselves, with both the peak areas and the resulting calculated amounts. Since these were related by what was very nearly a straight line, I probably could have used either one. But it's important to realize the difference: by using the calculated concentrations, I could either be correcting for a defect in the machine (if its detector response really wasn't quite linear), or I could be introducing error (if the standard solutions hadn't been made up quite right) It's up to you, as a scientist, to decide which way to go. In my case, I worked up the data both ways, and found that the resulting differences were far too small to worry about. So far, so good.
But there's another layer: I had done these experiments in triplicate. There were actually thirty-six vials for the twelve different conditions, because I wanted to see how reproducible the experiments were. For my final plots, then, I used the averages of the three runs for each reaction, and plotted the error bars thus generated to show how tight or loose these values really were. That's what I meant about the area numbers versus the concentration numbers question not meaning much in this case. Not only did they agree very well, but the variations between them were far smaller than the variations between different runs of the same experiments, and thus could safely be put in the "don't worry about it" category while interpreting the data.
What I did notice while doing this, though, was something else that was significant. My mass spec colleague had done something else which was very good practice: including a standard injection every so often during the runs of experimental determinations. Looking these over, I found that this same exact sample, of known concentration, was coming out as having less and less product in it as the process went on. That's certainly not unheard of - it usually means that the detector was getting less sensitive as time went on due to some sort of gradually accumulating gunk from my samples. Those numbers really should have been the same - after all, they were from the same vial - so I plotted out a curve to see how they declined with time. I then produced another column of numbers where I used that as a correction factor to adjust the data I'd actually obtained. The first runs needed little or no correction, as you'd figure, but by the end of the run, there were some noticeable changes.
So now I had several iterations of data for the same experiment. There was the raw raw data set, which I never really saw, and which would have been quite a pile if printed out. This was stored on the mass spec machine itself, in its own data format. Then I had numbers that I could use, the calculated areas of all those peaks. After that I had the corresponding concentrations, corrected for by the standard concentration curve run before the samples where injected. Then I had the values that I'd corrected for the detector response over time. And finally, once all this was done, I had the averages of the three duplicate runs for each set of conditions.
When I saved my file of data for this experiment, I took care to label everything I'd done. (I was sometimes lazier about such things earlier in my career, but I've learned that you can save ten minutes now only to spend hours eventually trying to figure out what you've done). The spreadsheet included all those iterations, each in a labeled column ("Area" "Concentration" "Corrected for response"), and both the standard curves and my response-versus-injection-number plots were included.
So how did my experiments look? Pretty good, actually. The error bars were small enough to see differences in the various conditions, which is what I'd hoped for, and some of those conditions were definitely much better than others. In fact, I thought I saw a useful trend in which ones worked best, and (as it turned out), this trend was even clearer after applying the correction for the detector response. I was glad to have the data; I've had far, far worse.
When presenting these results to my colleagues, I showed them a bar chart of the averages for the twelve different conditions, with the associated error bars plotted, which was good enough for everyone in my audience. If someone had asked to see my raw data, I would have sent them the file I mentioned above, with a note about how the numbers had been worked up. It's important to remember that the raw data are the numbers that come right out of the machine - the answers the universe gave you when you asked it a series of questions. The averages and the corrections are all useful (in fact, they can be essential), but it's important to have the source from which they came, and it's essential to show how that source material has been refined.