Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases.
To contact Derek email him directly: derekb.lowe@gmail.com
Twitter: Dereklowe

That's the take-home of this post by Adam Feuerstein about La Jolla Pharmaceuticals and their kidney drug candidate GCS-100. It's a galectin inhibitor, and it's had a rough time in development. But investors in the company were cheered up a great deal by a recent press release, stating that the drug had shown positive effects.

But look closer. The company's bar-chart presentation looks reasonably good, albeit with a binary dose-response (the low dose looks like it worked; the high dose didn't). But scroll down on that page to see the data expressed as means with error bars. Oh dear. . .

Update: it's been mentioned in the comments that the data look better with standard error rather than standard deviations. Courtesy of a reader, here's the graph in that form. And it does look better, but not by all that much:

Um, those error bars were (incorrectly) built using the standard deviation rather than the standard error - they don't take into account the sample size. As dodgy as the 'feedback loop' explanation and the lack of dose-response is, the actual numbers seem reasonably legitimate.

If you were to plot the results of virtually any clinical trial as mean +/- SD, you would get plots looking like the one produced here - massive error bars apparently showing no difference between groups. Error bars must be based around standard error, which takes account of sample size. This whole article is a statistically invalid attempt to criticise the results. If there are problems with them, this is not the way to show it!

Where these SD of the averages or true SD of all the data points? I don't have the math but I think that SD of the averages that may be the same as SE mean?- can someone confirm?

This SD vs SE is a red herring. The treated group did not surpass the baseline placebo group in any measure. The randomization procedure at the start of the trial led to a significant difference at baseline and all we see during treatment is a regression to mean.

Perhaps employers and grant funding bodies should check papers to see whether candidate scientists have the first clue about stats, rather than number of citations? It's easy to publish a paper with big headlines and crap stats that will get cited (torn apart) by many subsequent papers... Like the recent AD blood test, for example.

Isn't it a problem when the standard error is twice as big as the effect you're trumpeting? That, and the strange dose-response effect, makes the results look like a fluke and not a real effect.

What is the reason the company used doses of 1.5 mpk and 30 mpk (20x low dose) as their dosages? Why not 1.5mpk and 3 mpk? Next time?

hmm, the anonymous comments here are the first time I've ever seen the statement that the SD is an 'incorrect' measure of sample variability, and that the SE is actually the correct one. That is actually the exact opposite of the truth.

Standard deviation: A measure of the dispersion of a set of data from its mean.
Standard error: the standard deviation of the sampling distribution of a statistic.

Which one of these two statistics looks more appropriate to evaluate the variability of a sample ? Here's a hint: it's probably the one for which you can actually understand the definition.

There are practically no statistically-based reasons to use the SE in standard medical research; the only reason everyone uses it is that it makes data look better.

The formula for calculating SE is actually the SD divided by the square root of group size. Which means that if you have differing group sizes, like here, that will affect the SE, in ways that have nothing to do with sample spread. It's not a kind of normalization for group size, as one of the commenter's states; the formula for the SD already takes into account group size.

@johnnyboy: I think the s.e. is much more appropriate than the s.d. since what we want to estimate is the "effect size" of the drug.

We want to know if a mean metric in the drugged population differs from the same mean metric in the placebo population. Since we are interested in the difference in means (i.e. the population mean is the stastistic we are estimating by sampling), the s.e. is the appropriate measure of uncertainty.

The standard deviation would be appropriate if we wanted to determine which population *an individual patient* was from based on that individual's personal metric.

...but let's assume (incorrectly) that there IS a non-random difference between the two groups. Do we really care that we improved GFR by 3% (29.3 to 30.3)? I'm not a nephrologist, but I'll bet not.

I'm not taking any position as to whether the statistically significant result is medically significant. It could be statistically sound but medically useless. I have no expertise to resolve that particular issue.

But statistically, I'm holding my ground. The s.e. is the relevant parameter to characterize uncertainty in effect sizes. LeeH and johnnyboy are wrong.

1. See the definition of "t statistic" at http://en.wikipedia.org/wiki/T-statistic. See the "s.e." in the denominator? Note that it doesn't say "s.d." Every time a t-test is performed, the standard error in the parameter is estimated.

2. The standard error in the parameter, not the standard deviation, is returned by most statistical software packages, including [R] (lm function) and Excel (linest function). So whoever wrote those functions thought (rightly) that the s.e. is more important.

3. Bessel's correction has nothing to do with s.d. vs. s.e., so that issue is a red herring. Bessel's-corrected *variance* (not s.d.) as calculated from a sample is an unbiased estimator of the *population* variance. The uncorrected sample variance is a biased estimator of population variance. But that is a different issue than s.d. vs. s.e. S.d.s are for estimating dispersion in a population, while s.e.s are for estimating uncertainties in the parameters that characterize that distribution, such as the mean.

4. Think of the example of sex differences in height. With modest sample sizes, say of 20 men and 20 women, we would likely find a significant difference in the mean height of men and women. Here's a random example I found that illustrates some data nicely. http://www.theatlantic.com/sexes/archive/2013/01/why-its-so-rare-for-a-wife-to-be-taller-than-her-husband/272585/ . But mean height difference is less than one standard deviation large. That means that if I told you a particular person was 5'9" you wouldn't be very able to identify their sex, but it doesn't change the fact that men and women have statistically different heights, on average.

The top graph at the link says the change in eGFR is -0.58 but it went from 30.5 to 29.3 which is a change of -1.2 making the placebo have as large a change (in the opposite direction) as the low dose which changed +1.26.

Is the -.58 making it onto the graph a typo, data from somewhere else or shady?

"hmm, the anonymous comments here are the first time I've ever seen the statement that the SD is an 'incorrect' measure of sample variability, and that the SE is actually the correct one. That is actually the exact opposite of the truth."

The SD of the sample is the measure of sample variability. The SE of the estimated mean is the measure of uncertainty in the estimate. So it's a matter of which you want.

To look at whether an effect exists, you'd choose the SE, to see whether the shift in means is probably non-zero. To look at the practical significance of an effect, you'd choose the SD, e.g. to compute Cohen's d: (shift in means) / SD.

Plotting with "plus or minus SD" is certainly a valid thing to do, for a visual of how the control and experimental distributions look next to each other, how much they overlap. But plotting "plus or minus SE" is for a visual of whether the distributions are different at all. Can we all get along with that?

(But whichever plot you use, it will be confused with the other, and I don't blame people at all. Is there any standard graphical convention for showing which one you're plotting, or to show both? I don't know one.

Maybe I'll mock up a box-and-whiskers plot, with the mean "x" in a different color, and the error on the mean also in that color, and see if people get that. It doesn't show the standard deviation per se, but that's good.)

Perhaps the most relevant and important "metric" here is the magnitude of the effect x the probability that the effect is real? Or some weighted average of that value?

In other words, what is the probability-weighted average effect across all potential scenarios of test vs control? Note this is not the same as the difference between mean values.

SE rather than SD would be more commonly used in preclinical studies in my experience in presentations of data such as this. However, I would argue in favor of using 95% confidence intervals rather than either SE or SD. I see 95% C.I. used more commonly in clinical data. In this case, use of 95% C.I. would indicate no significant effect. As an estimate, 95% C.I would be about 1.96 times SE.

Another important thing to consider is the normality of the data. If the data are not Gaussian, then all standard statistics are useless. One or two outliers will throw everything off and in my practice people hardly ever bother to check how well their distribution matches the normal curve and go on doing all kinds of conclusions out of t-tests etc, when none of them are statistically legal... I would love to see how these data distributions look like.

...and if you follow tangent's advice, the difference in averages/SD is significantly less than 1 [from Feuerstein's post, it looks like about 0.2 - delta(avg) = 2, SD = 10]. That seems to imply that differences in individual populations swamp the differences in effects. Alternatively, the difference in means looks to be about the standard error, so that the difference in averages could easily be due to chance. That doesn't look like success to me.

"So, if we want to say how widely scattered some measurements are, we use the standard deviation. If we want to indicate the uncertainty around the estimate of the mean measurement, we quote the standard error of the mean. The standard error is most useful as a means of calculating a confidence interval. For a large sample, a 95% confidence interval is obtained as the values 1.96×SE either side of the mean."

Unpaired t test results
P value and statistical significance:
The two-tailed P value equals 0.2902
By conventional criteria, this difference is considered to be not statistically significant.

Confidence interval:
The mean of Group One minus Group Two equals -0.9600
95% confidence interval of this difference: From -2.7546 to 0.8346

Intermediate values used in calculations:
t = 1.0648
df = 79
standard error of difference = 0.902

Review your data:
Group Group 1 Group Two
Mean 29.3 30.26
SD 2.9000 4.9300
SEM 0.4585 0.7699
N 40 41

So, about a 30% chance that the differences are due to random chance. Far higher than the 5% that is generally used.

"Plotting with "plus or minus SD" is certainly a valid thing to do, for a visual of how the control and experimental distributions look next to each other, how much they overlap. But plotting "plus or minus SE" is for a visual of whether the distributions are different at all. Can we all get along with that?"

I can get along with that. But the thing is, as your average non-statistician researcher, when I look at a graph purporting to show the effect of a treatment, I look to the error bars for a quick visual that illustrates the spread of the values and possible overlap, period. I am not looking for error estimates of the sample distributions (I barely understand what that means anyway). And perhaps I'm being presumptuous (though I really don't think so) but I assume that 99% of other average non-statician researchers look at these error bars the same way I do. You say that the SE is appropriate to look at whether an effect exists, but isn't that what statistical testing is for ? So I think the people justifying their use of SE in such studies with statistical-based rationalizations are being disingenuous. You can bet that if the SE happened to be a metric that gave larger values than the SD, then no one would ever use it.

@johnnyboy: If you are really trying to look at graphs that try to "show the effect of a treatment", you really should be focusing on standard errors (or even better, confidence intervals -- which will be multiples of the standard *error* where the multiple depends on the confidence that you want), not standard deviations.

The effect of a treatment is different than the spread in values in a population. Depending on sample size, it may be possible to estimate very small but nonzero treatment effects even in the presence of many other sources of variability.

If you don't understand or barely understand what error estimates of sample distributions are, you should study that topic more. When people use terminology you don't understand (or barely understand), it doesn't mean they are being disingenuous. It might just mean that you don't understand.

In my humble opinion, Curt F. is technically correct but Johynnyboy point is much more solid and street wise.
Basically, it does not matter how accurately you measure minuscule differences to boost SE, "effect of the treatment" is the ONE and ONLY thing which should matter to the the people/insurers who pay $$$ for the treatment. Judicious use of SE in press releases is very helpful in $$$ business by making tiny tidy error bars to promote stock evaluations and attract partnership/acquisitions while delivering nothing of value.
Just IMHO again.

It's hard to understand (and with reference to Comment #30) how, if the standard deviations nearly overlap, the data could have been originally reported as p = .07 (see graphics on the linked page / Feuerstein).

@33 Anonymous. Thanks for weighing. It is certainly true that the standard deviation is a more conservative metric than standard error for characterizing differences in populations.

However, there are certainly times where it is too conservative. For example, if there were a drug that made everyone an inch taller than they used to be, if we used standard deviations as advocated by many in this thread, we'd be forced to conclude that the drug had no significant effect, since the height distribution of drugged and undrugged populations would stills significantly overlap. Using s.d.s means that non-drug sources of variation in height -- which are much greater than one inch -- would dominate over the "small" one inch boost provided by the drug.

The question of interest to drug developers is often not whether the drug is the *only* source of significant variation between two population, but whether it *affects* significantly the difference in two populations.

Ahem. Guys. These 2 samples are NOT significantly different. Period.

And if you want to use SE to visually see if two samples are different, you should be plotting (as Curt F. notes) 1.97*SE, not just 1*SE, to view the 95% confidence limits. If the mean of either group falls inside the limits of the other, you cannot show statistical significance (at p

And even IF there was a statistically significant difference, the effectt is so small to be useless.

SD vs SEM is always a thorny issue. You are fooling no one when you use SEM to make data "prettier" since they're usually smaller. You always have to basically double SEM bars in your head to observe SD.

Confidence intervals are much better than SD or SEM for data that contains more than a few measurements.

johnnyboy said: "But the thing is, as your average non-statistician researcher, when I look at a graph purporting to show the effect of a treatment, I look to the error bars for a quick visual that illustrates the spread of the values and possible overlap, period. [...] You say that the SE is appropriate to look at whether an effect exists, but isn't that what statistical testing is for?"

Oh absolutely, plotting the means with SE is just a poor man's statistical test. And strictly inferior to plotting a real confidence interval. (Being a programmer I like resampling-based CIs; good fit for our simple minds and fast hardware.)

It always has driven me crazy that people quote p-values a hundred times more often than they quote any measure of effect size, because effect size is often what matters, so I would never argue with your looking at effect size first. And yeah, people shoot for "statistically significant" because it's easier to hit than "substantial effect size". But I have to admit solid work showing statistical significance as legitimate to publish even if the effect size is small. It might still be good for something, even if it's weak as a drug. Maybe you can eventually stack five of them together to get a noticeable effect size, or maybe the small effect is a public-health intervention that helps on a population level, etc.

Are people being disingenuous beyond just going straight for the least publishable unit? In my field I don't think most people know about statistics to be disingenuous. Am I cynical to think most people follow the technique they learned was the way it's done, and when that has limitations on its validity they don't *realize* they're doing it wrong? But if somebody uses an atypical technique that gives a favorable gloss to their result, that is suspicious.

If you want to show sample dispersion, it seems to me you should be plotting histograms of your two samples - shows the sample spread and distribution both. I'd then put a vertical line with bars to indicate the mean and standard error so the viewer could easily judge whether there's a statistically significant difference as well as (what the histograms give you) how the drug effect compares to intrinsic sample variance. But then I'm coming from radio/X-ray astronomy, where you have to average the living daylights out of anything to get a meaningful measured quantity, so quoting the un-averaged standard deviation means almost nothing.

COMMENTS1. Anonymous on March 11, 2014 10:36 AM writes...

Um, those error bars were (incorrectly) built using the standard deviation rather than the standard error - they don't take into account the sample size. As dodgy as the 'feedback loop' explanation and the lack of dose-response is, the actual numbers seem reasonably legitimate.

Permalink to Comment2. myma on March 11, 2014 10:40 AM writes...

and its not even advanced statistics that cause the "oh dear", its basic math of plotting a bar chart with the error bars.

Permalink to Comment3. alig on March 11, 2014 10:48 AM writes...

There was a larger difference between groups (placebo vs treatment) prior to dosing than post dosing. This drug had zero effect.

Permalink to Comment4. petros on March 11, 2014 10:51 AM writes...

One look at the SDs given makes it clear that there is no significant effect at any dose!

Still at least La Jolla showed the SD values

Permalink to Comment5. Anonymous on March 11, 2014 11:01 AM writes...

If you were to plot the results of virtually any clinical trial as mean +/- SD, you would get plots looking like the one produced here - massive error bars apparently showing no difference between groups. Error bars must be based around standard error, which takes account of sample size. This whole article is a statistically invalid attempt to criticise the results. If there are problems with them, this is not the way to show it!

Permalink to Comment6. Curt F. on March 11, 2014 11:34 AM writes...

I have to agree with the others harping on the difference between standard deviation and standard error.

LJPC did put p-values on their tables.

Permalink to Comment7. sd on March 11, 2014 11:52 AM writes...

Where these SD of the averages or true SD of all the data points? I don't have the math but I think that SD of the averages that may be the same as SE mean?- can someone confirm?

Permalink to Comment8. alig on March 11, 2014 12:21 PM writes...

This SD vs SE is a red herring. The treated group did not surpass the baseline placebo group in any measure. The randomization procedure at the start of the trial led to a significant difference at baseline and all we see during treatment is a regression to mean.

Permalink to Comment9. Anonymous on March 11, 2014 1:11 PM writes...

Perhaps employers and grant funding bodies should check papers to see whether candidate scientists have the first clue about stats, rather than number of citations? It's easy to publish a paper with big headlines and crap stats that will get cited (torn apart) by many subsequent papers... Like the recent AD blood test, for example.

Permalink to Comment10. Hap on March 11, 2014 1:43 PM writes...

Isn't it a problem when the standard error is twice as big as the effect you're trumpeting? That, and the strange dose-response effect, makes the results look like a fluke and not a real effect.

What is the reason the company used doses of 1.5 mpk and 30 mpk (20x low dose) as their dosages? Why not 1.5mpk and 3 mpk? Next time?

Permalink to Comment11. Anonymous on March 11, 2014 2:00 PM writes...

Let's short it!

Permalink to Comment12. annon fore on March 11, 2014 2:28 PM writes...

Good enough for at least one big Pharma to take it to Phase III.

Permalink to Comment13. johnnyboy on March 11, 2014 2:38 PM writes...

hmm, the anonymous comments here are the first time I've ever seen the statement that the SD is an 'incorrect' measure of sample variability, and that the SE is actually the correct one. That is actually the exact opposite of the truth.

Standard deviation: A measure of the dispersion of a set of data from its mean.

Standard error: the standard deviation of the sampling distribution of a statistic.

Which one of these two statistics looks more appropriate to evaluate the variability of a sample ? Here's a hint: it's probably the one for which you can actually understand the definition.

There are practically no statistically-based reasons to use the SE in standard medical research; the only reason everyone uses it is that it makes data look better.

The formula for calculating SE is actually the SD divided by the square root of group size. Which means that if you have differing group sizes, like here, that will affect the SE, in ways that have nothing to do with sample spread. It's not a kind of normalization for group size, as one of the commenter's states; the formula for the SD already takes into account group size.

Permalink to Comment14. Curt F. on March 11, 2014 2:47 PM writes...

@johnnyboy: I think the s.e. is much more appropriate than the s.d. since what we want to estimate is the "effect size" of the drug.

We want to know if a mean metric in the drugged population differs from the same mean metric in the placebo population. Since we are interested in the difference in means (i.e. the population mean is the stastistic we are estimating by sampling), the s.e. is the appropriate measure of uncertainty.

The standard deviation would be appropriate if we wanted to determine which population *an individual patient* was from based on that individual's personal metric.

Permalink to Comment15. Anonymous on March 11, 2014 4:33 PM writes...

@14: That's what Bessel's correction is for in the true formula of SD...

Permalink to Comment16. LeeH on March 11, 2014 4:43 PM writes...

Curt - johnnyboy's got it right...

...but let's assume (incorrectly) that there IS a non-random difference between the two groups. Do we really care that we improved GFR by 3% (29.3 to 30.3)? I'm not a nephrologist, but I'll bet not.

Permalink to Comment17. Curt F. on March 11, 2014 6:06 PM writes...

I'm not taking any position as to whether the statistically significant result is medically significant. It could be statistically sound but medically useless. I have no expertise to resolve that particular issue.

But statistically, I'm holding my ground. The s.e. is the relevant parameter to characterize uncertainty in effect sizes. LeeH and johnnyboy are wrong.

1. See the definition of "t statistic" at http://en.wikipedia.org/wiki/T-statistic. See the "s.e." in the denominator? Note that it doesn't say "s.d." Every time a t-test is performed, the standard error in the parameter is estimated.

2. The standard error in the parameter, not the standard deviation, is returned by most statistical software packages, including [R] (lm function) and Excel (linest function). So whoever wrote those functions thought (rightly) that the s.e. is more important.

3. Bessel's correction has nothing to do with s.d. vs. s.e., so that issue is a red herring. Bessel's-corrected *variance* (not s.d.) as calculated from a sample is an unbiased estimator of the *population* variance. The uncorrected sample variance is a biased estimator of population variance. But that is a different issue than s.d. vs. s.e. S.d.s are for estimating dispersion in a population, while s.e.s are for estimating uncertainties in the parameters that characterize that distribution, such as the mean.

4. Think of the example of sex differences in height. With modest sample sizes, say of 20 men and 20 women, we would likely find a significant difference in the mean height of men and women. Here's a random example I found that illustrates some data nicely. http://www.theatlantic.com/sexes/archive/2013/01/why-its-so-rare-for-a-wife-to-be-taller-than-her-husband/272585/ . But mean height difference is less than one standard deviation large. That means that if I told you a particular person was 5'9" you wouldn't be very able to identify their sex, but it doesn't change the fact that men and women have statistically different heights, on average.

Permalink to Comment18. cpchem on March 11, 2014 7:33 PM writes...

I think you've forgotten to close an

Permalink to Commentsomewhere... everything's gone italic!!19. cpchem on March 11, 2014 7:35 PM writes...

(i or emph tag is what should have been there...)

Permalink to Comment20. Nick K on March 11, 2014 9:06 PM writes...

The strange inverse dose-response curve of this drug is disturbing...

Permalink to Comment21. Matt on March 12, 2014 1:21 AM writes...

The top graph at the link says the change in eGFR is -0.58 but it went from 30.5 to 29.3 which is a change of -1.2 making the placebo have as large a change (in the opposite direction) as the low dose which changed +1.26.

Is the -.58 making it onto the graph a typo, data from somewhere else or shady?

Permalink to Comment22. tangent on March 12, 2014 2:21 AM writes...

"hmm, the anonymous comments here are the first time I've ever seen the statement that the SD is an 'incorrect' measure of sample variability, and that the SE is actually the correct one. That is actually the exact opposite of the truth."

The SD of the sample is the measure of sample variability. The SE of the estimated mean is the measure of uncertainty in the estimate. So it's a matter of which you want.

To look at whether an effect exists, you'd choose the SE, to see whether the shift in means is probably non-zero. To look at the practical significance of an effect, you'd choose the SD, e.g. to compute Cohen's d: (shift in means) / SD.

Plotting with "plus or minus SD" is certainly a valid thing to do, for a visual of how the control and experimental distributions look next to each other, how much they overlap. But plotting "plus or minus SE" is for a visual of whether the distributions are different at all. Can we all get along with that?

(But whichever plot you use, it will be confused with the other, and I don't blame people at all. Is there any standard graphical convention for showing which one you're plotting, or to show both? I don't know one.

Maybe I'll mock up a box-and-whiskers plot, with the mean "x" in a different color, and the error on the mean also in that color, and see if people get that. It doesn't show the standard deviation per se, but that's good.)

Permalink to Comment23. Anonymous on March 12, 2014 5:39 AM writes...

Perhaps the most relevant and important "metric" here is the magnitude of the effect x the probability that the effect is real? Or some weighted average of that value?

Permalink to Comment24. Anonymous on March 12, 2014 5:50 AM writes...

In other words, what is the probability-weighted average effect across all potential scenarios of test vs control? Note this is not the same as the difference between mean values.

Permalink to Comment25. Robert on March 12, 2014 6:57 AM writes...

SE rather than SD would be more commonly used in preclinical studies in my experience in presentations of data such as this. However, I would argue in favor of using 95% confidence intervals rather than either SE or SD. I see 95% C.I. used more commonly in clinical data. In this case, use of 95% C.I. would indicate no significant effect. As an estimate, 95% C.I would be about 1.96 times SE.

Permalink to Comment26. Dr. Z on March 12, 2014 8:23 AM writes...

Another important thing to consider is the normality of the data. If the data are not Gaussian, then all standard statistics are useless. One or two outliers will throw everything off and in my practice people hardly ever bother to check how well their distribution matches the normal curve and go on doing all kinds of conclusions out of t-tests etc, when none of them are statistically legal... I would love to see how these data distributions look like.

Permalink to Comment27. Hap on March 12, 2014 9:28 AM writes...

...and if you follow tangent's advice, the difference in averages/SD is significantly less than 1 [from Feuerstein's post, it looks like about 0.2 - delta(avg) = 2, SD = 10]. That seems to imply that differences in individual populations swamp the differences in effects. Alternatively, the difference in means looks to be about the standard error, so that the difference in averages could easily be due to chance. That doesn't look like success to me.

Permalink to Comment28. Anonymous on March 12, 2014 10:23 AM writes...

@23, 24: The best metrics for comparing 2 data sets for statistical difference would be:

1. Average difference (mean value of difference distribution) across all combinations of test vs control

2. Standard deviation of difference distribution across all combinations of test vs control

Furthermore, one could also look at the skewness and kurtosis of the difference distribution.

Permalink to Comment29. from BMJ on March 13, 2014 2:50 AM writes...

"So, if we want to say how widely scattered some measurements are, we use the standard deviation. If we want to indicate the uncertainty around the estimate of the mean measurement, we quote the standard error of the mean. The standard error is most useful as a means of calculating a confidence interval. For a large sample, a 95% confidence interval is obtained as the values 1.96×SE either side of the mean."

Permalink to Comment30. LeeH on March 13, 2014 12:51 PM writes...

Here are the results, according to a t-test web page from GraphPad (assuming the 2 samples are normally distributed).

http://www.graphpad.com/quickcalcs/ttest1/?Format=SD

Unpaired t test results

P value and statistical significance:

The two-tailed P value equals 0.2902

By conventional criteria, this difference is considered to be not statistically significant.

Confidence interval:

The mean of Group One minus Group Two equals -0.9600

95% confidence interval of this difference: From -2.7546 to 0.8346

Intermediate values used in calculations:

t = 1.0648

df = 79

standard error of difference = 0.902

Review your data:

Group Group 1 Group Two

Mean 29.3 30.26

SD 2.9000 4.9300

SEM 0.4585 0.7699

N 40 41

So, about a 30% chance that the differences are due to random chance. Far higher than the 5% that is generally used.

Permalink to Comment31. johnnyboy on March 13, 2014 1:11 PM writes...

"Plotting with "plus or minus SD" is certainly a valid thing to do, for a visual of how the control and experimental distributions look next to each other, how much they overlap. But plotting "plus or minus SE" is for a visual of whether the distributions are different at all. Can we all get along with that?"

I can get along with that. But the thing is, as your average non-statistician researcher, when I look at a graph purporting to show the effect of a treatment, I look to the error bars for a quick visual that illustrates the spread of the values and possible overlap, period. I am not looking for error estimates of the sample distributions (I barely understand what that means anyway). And perhaps I'm being presumptuous (though I really don't think so) but I assume that 99% of other average non-statician researchers look at these error bars the same way I do. You say that the SE is appropriate to look at whether an effect exists, but isn't that what statistical testing is for ? So I think the people justifying their use of SE in such studies with statistical-based rationalizations are being disingenuous. You can bet that if the SE happened to be a metric that gave larger values than the SD, then no one would ever use it.

Permalink to Comment32. Curt F. on March 13, 2014 1:36 PM writes...

@johnnyboy: If you are really trying to look at graphs that try to "show the effect of a treatment", you really should be focusing on standard errors (or even better, confidence intervals -- which will be multiples of the standard *error* where the multiple depends on the confidence that you want), not standard deviations.

The effect of a treatment is different than the spread in values in a population. Depending on sample size, it may be possible to estimate very small but nonzero treatment effects even in the presence of many other sources of variability.

If you don't understand or barely understand what error estimates of sample distributions are, you should study that topic more. When people use terminology you don't understand (or barely understand), it doesn't mean they are being disingenuous. It might just mean that you don't understand.

Permalink to Comment33. Anonymous on March 13, 2014 5:47 PM writes...

In my humble opinion, Curt F. is technically correct but Johynnyboy point is much more solid and street wise.

Permalink to CommentBasically, it does not matter how accurately you measure minuscule differences to boost SE, "effect of the treatment" is the ONE and ONLY thing which should matter to the the people/insurers who pay $$$ for the treatment. Judicious use of SE in press releases is very helpful in $$$ business by making tiny tidy error bars to promote stock evaluations and attract partnership/acquisitions while delivering nothing of value.

Just IMHO again.

34. bruce on March 13, 2014 9:13 PM writes...

It's hard to understand (and with reference to Comment #30) how, if the standard deviations nearly overlap, the data could have been originally reported as p = .07 (see graphics on the linked page / Feuerstein).

Permalink to Comment35. Curt F. on March 13, 2014 11:05 PM writes...

@33 Anonymous. Thanks for weighing. It is certainly true that the standard deviation is a more conservative metric than standard error for characterizing differences in populations.

However, there are certainly times where it is too conservative. For example, if there were a drug that made everyone an inch taller than they used to be, if we used standard deviations as advocated by many in this thread, we'd be forced to conclude that the drug had no significant effect, since the height distribution of drugged and undrugged populations would stills significantly overlap. Using s.d.s means that non-drug sources of variation in height -- which are much greater than one inch -- would dominate over the "small" one inch boost provided by the drug.

The question of interest to drug developers is often not whether the drug is the *only* source of significant variation between two population, but whether it *affects* significantly the difference in two populations.

Permalink to Comment36. LeeH on March 14, 2014 9:09 AM writes...

Ahem. Guys. These 2 samples are NOT significantly different. Period.

And if you want to use SE to visually see if two samples are different, you should be plotting (as Curt F. notes) 1.97*SE, not just 1*SE, to view the 95% confidence limits. If the mean of either group falls inside the limits of the other, you cannot show statistical significance (at p

And even IF there was a statistically significant difference, the effectt is so small to be useless.

Permalink to Comment37. Anonymous on March 14, 2014 5:21 PM writes...

SD vs SEM is always a thorny issue. You are fooling no one when you use SEM to make data "prettier" since they're usually smaller. You always have to basically double SEM bars in your head to observe SD.

Confidence intervals are much better than SD or SEM for data that contains more than a few measurements.

Permalink to Comment38. tangent on March 18, 2014 2:23 AM writes...

johnnyboy said: "But the thing is, as your average non-statistician researcher, when I look at a graph purporting to show the effect of a treatment, I look to the error bars for a quick visual that illustrates the spread of the values and possible overlap, period. [...] You say that the SE is appropriate to look at whether an effect exists, but isn't that what statistical testing is for?"

Oh absolutely, plotting the means with SE is just a poor man's statistical test. And strictly inferior to plotting a real confidence interval. (Being a programmer I like resampling-based CIs; good fit for our simple minds and fast hardware.)

It always has driven me crazy that people quote p-values a hundred times more often than they quote any measure of effect size, because effect size is often what matters, so I would never argue with your looking at effect size first. And yeah, people shoot for "statistically significant" because it's easier to hit than "substantial effect size". But I have to admit solid work showing statistical significance as legitimate to publish even if the effect size is small. It might still be good for something, even if it's weak as a drug. Maybe you can eventually stack five of them together to get a noticeable effect size, or maybe the small effect is a public-health intervention that helps on a population level, etc.

Are people being disingenuous beyond just going straight for the least publishable unit? In my field I don't think most people know about statistics to be disingenuous. Am I cynical to think most people follow the technique they learned was the way it's done, and when that has limitations on its validity they don't *realize* they're doing it wrong? But if somebody uses an atypical technique that gives a favorable gloss to their result, that is suspicious.

Permalink to Comment39. Anne on March 18, 2014 6:17 AM writes...

If you want to show sample dispersion, it seems to me you should be plotting histograms of your two samples - shows the sample spread and distribution both. I'd then put a vertical line with bars to indicate the mean and standard error so the viewer could easily judge whether there's a statistically significant difference as well as (what the histograms give you) how the drug effect compares to intrinsic sample variance. But then I'm coming from radio/X-ray astronomy, where you have to average the living daylights out of anything to get a meaningful measured quantity, so quoting the un-averaged standard deviation means almost nothing.

Permalink to Comment