Corante

About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: derekb.lowe@gmail.com Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Emolecules
ChemSpider
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
PubChem
Not Voodoo
DailyMed
Druglib
Clinicaltrials.gov

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
Kilomentor
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
ChemBark
Realizations in Biostatistics
Chemjobber
Pharmalot
ChemSpider Blog
Pharmagossip
Med-Chemist
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
SimBioSys
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Business|Bytes|Genes|Molecules
Eye on FDA
Chemical Forums
Depth-First
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa


Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
FuturePundit
Aetiology
Gene Expression (I)
Gene Expression (II)
Sciencebase
Pharyngula
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net


Medical Blogs
DB's Medical Rants
Science-Based Medicine
GruntDoc
Respectful Insolence
Diabetes Mine


Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem


Politics / Current Events
Virginia Postrel
Instapundit
Belmont Club
Mickey Kaus


Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Ariad Loses on Appeal | Main | Sir James Black, 1924-2010 »

March 22, 2010

Benford's Law, Revisited

Email This Entry

Posted by Derek

I mentioned Benford's Law in passing in this post (while speculating on how long people report their reactions to have run when publishing their results). That's the rather odd result that many data sets don't show a random distribution of leading digits - rather, 1 is the first digit around 30% of the time, 2 leads off about 18% of the time, and so on down.

For data that come from some underlying power-law distribution, this actually makes some sense. In that case, the data points spend more time being collected in the "lag phase" when they're more likely to start with a 1, and proportionally less and less time out in the higher-number-leading areas. The law only holds up when looking at distributions that cover several orders of magnitude - but all the same, it also seems to apply to data sets where there's no obvious exponential growth driving the numbers.

Lack of adherence to Benford's Law can be acceptable as corroborative evidence of financial fraud. Now a group from Astellas reports that several data sets used in drug discovery (such as databases of water solubility values) obey the expected distribution. What's more, they're suggesting that modelers and QSAR people check their training data sets to make sure that those follow Benford's Law as well, as a way to make sure that the data have been randomly selected.

Is anyone willing to try this out on a bunch of raw clinical data to see what happens? Could this be a way to check the integrity of reported data from multiple trial centers? You'd have to pick your study set carefully - a lot of the things we look for don't cover a broad range - but it's worth thinking about. . .

Comments (9) + TrackBacks (0) | Category: Clinical Trials | In Silico | The Dark Side


COMMENTS

1. sgcox on March 22, 2010 1:14 PM writes...

Would be interesting to look on the numbers of comments on this blog (health reform post is skyroketing, almost as fast as XMRV controversy ) and see how it follows the power law. May tell what stories make people interested.

Permalink to Comment

2. LeeH on March 22, 2010 3:54 PM writes...

Solubility data is a bad example for this exercise. Most solubilities are measured in a range of about 2, or at most 3 orders of magnitude (in the drug discovery world anyway), so a a model that spans 9(?) orders of magnitude would be useless.

Permalink to Comment

3. milkshake on March 22, 2010 6:20 PM writes...

most clinical trials perhaps are not big enough for Benford

Permalink to Comment

4. hibob on March 22, 2010 6:45 PM writes...

if you are just eyeballing a column of data, I could see using Benford's law as a proxy for whether data follows a power law or if there is something fishy going on. But if the situation is more important/formal, why not just see if the data follows a power law directly by fitting the data?

Permalink to Comment

5. Curt F. on March 22, 2010 8:12 PM writes...

For data that come from some underlying power-law distribution, this actually makes some sense. In that case, the data points spend more time being collected in the "lag phase" when they're more likely to start with a 1, and proportionally less and less time out in the higher-number-leading areas. The law only holds up when looking at distributions that cover several orders of magnitude - but all the same, it also seems to apply to data sets where there's no obvious exponential growth driving the numbers.

This seems a bit muddled to me.

Benford's law often applies to power-law distributions, as you correctly note. Power-law distributions are scale-free, meaning that the value of the parameters characterizing the distribution do not depend on the scale of the data (for example, whether you measure in eons or nanoseconds).

Exponential growth does not result in power-law distributions. Exponential curves have a built-in characteristic scale. The values that parameterize exponential distributions change values when changing scale. E.g., a first-order rate constant has a different numerical value when measured in ns^-1 vs when measured in eons^-1.

Benford's law only makes sense for scale-free data. (I.e., data following Benford's law when expressed in nanoseconds had better also follow it when expressed in eons.) I don't think that means that all Benford-law data sets are power law distributions, but I could be wrong about that.

Permalink to Comment

6. QSAR_skeptic on March 23, 2010 8:19 AM writes...

Benford's law only applies to scale-free / power law distributions. In happens that financial transactions follows a power distribution. (For example my own household buys things for small amounts regularly (for example food) and rarely but often enough buys larger things (like cars) and then even more rarely even larger things (like houses).

If you make the hypothesis that the distribution of biochemical potency follows a power law. In operational terms what does this actually mean?
In particular what is your X-axis?

In finance your x-axis is time and the y is the size of the transaction. In geography your x-axis is longitude and latitude and the height of the mountain is the y-axis.

Permalink to Comment

7. sgcox on March 23, 2010 8:32 AM writes...

x-axis - potency/affinity
y-axis - number of targets falling into the interval

Permalink to Comment

8. MattF on March 23, 2010 10:01 AM writes...

One way to think of Benford's Law is that it's a consequence of "conservation of relative accuracy." If you have measurements that all have about the same number of significant digits, and if your number-crunching preserves significant digits (as it should) then your results will follow Benford's Law.

Another way of saying this is to note that one has, e.g., a gut feeling that a leading '1' in a number isn't really a significant digit-- Benford's Law is the precise statement of that feeling.

Permalink to Comment

9. Robert Harder on November 18, 2010 5:25 PM writes...

For all the research and "verifying" that's out there for Benford's Law, I was hoping to find a way to generate random numbers that "comply" with Benford's Law -- you know, just in case you wanted to generate data sets for, uh, pedagogical purposes. Anyway I wrote a script to generate random numbers that pass a Benford's Law test. Use only for good. http://iharder.sourceforge.net/benford.php

Permalink to Comment

POST A COMMENT




Remember Me?



EMAIL THIS ENTRY TO A FRIEND

Email this entry to:

Your email address:

Message (optional):




RELATED ENTRIES
The Worst Seminar
Conference in Basel
Messed-Up Clinical Studies: A First-Hand Report
Pharma and Ebola
Lilly Steps In for AstraZeneca's Secretase Inhibitor
Update on Alnylam (And the Direction of Things to Come)
There Must Have Been Multiple Chances to Catch This
Weirdly, Tramadol Is Not a Natural Product After All