Corante

About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: derekb.lowe@gmail.com Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Emolecules
ChemSpider
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
PubChem
Not Voodoo
DailyMed
Druglib
Clinicaltrials.gov

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
Kilomentor
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
ChemBark
Realizations in Biostatistics
Chemjobber
Pharmalot
ChemSpider Blog
Pharmagossip
Med-Chemist
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
SimBioSys
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Business|Bytes|Genes|Molecules
Eye on FDA
Chemical Forums
Depth-First
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa


Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
FuturePundit
Aetiology
Gene Expression (I)
Gene Expression (II)
Sciencebase
Pharyngula
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net


Medical Blogs
DB's Medical Rants
Science-Based Medicine
GruntDoc
Respectful Insolence
Diabetes Mine


Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem


Politics / Current Events
Virginia Postrel
Instapundit
Belmont Club
Mickey Kaus


Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Nitrogen Heterocycles Ahoy | Main | A New Way to Study Hepatotoxicity »

March 24, 2014

Google's Big Data Flu Flop

Email This Entry

Posted by Derek

Some of you may remember the "Google Flu" effort, where the company was going to try to track outbreaks of influenza in the US by mining Google queries. There was never much clarification about what terms, exactly, they were going to flag as being indicative of someone coming down with the flu, but the hype (or hope) at the time was pretty strong:

Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. . .

So how'd that work out? Not so well. Despite a 2011 paper that seemed to suggest things were going well, the 2013 epidemic wrong-footed the Google Flu Trends (GFT) algorithms pretty thoroughly.

This article in Science finds that the real-world predictive power has been pretty unimpressive. And the reasons behind this failure are not hard to understand, nor were they hard to predict. Anyone who's ever worked with clinical trial data will see this one coming:

The initial version of GFT was a particularly problematic marriage of big and small data. Essentially, the methodology was to find the best matches among 50 million search terms to fit 1152 data points. The odds of finding search terms that match the propensity of the flu but are structurally unrelated, and so do not predict the future, were quite high. GFT developers, in fact, report weeding out seasonal search terms unrelated to the flu but strongly correlated to the CDC data, such as those regarding high school basketball. This should have been a warning that the big data were overfitting the small number of cases—a standard concern in data analysis. This ad hoc method of throwing out peculiar search terms failed when GFT completely missed the nonseasonal 2009 influenza A–H1N1 pandemic.

The Science authors have a larger point to make as well:

“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. Elsewhere, we have asserted that there are enormous scientific possibilities in big data. However, quantity of data does not mean that one can ignore foundational issues of measurement and construct validity and reliability and dependencies among data. The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.

The quality of the data matters very, very, much, and quantity is no substitute. You can make a very large and complex structure out of toothpicks and scraps of wood, because those units are well-defined and solid. You cannot do the same with a pile of cotton balls and dryer lint, not even if you have an entire warehouse full of the stuff. If the individual data points are squishy, adding more of them will not fix your analysis problem; it will make it worse.

Since 2011, GFT has missed (almost invariably on the high side) for 108 out of 111 weeks. As the authors show, even low-tech extrapolation from three-week-lagging CDC data would have done a better job. But then, the CDC data are a lot closer to being real numbers. Something to think about next time someone's trying to sell you on a BIg Data project. Only trust the big data when the little data are trustworthy in turn.

Update: a glass-half-full response in the comments.

Comments (18) + TrackBacks (0) | Category: Biological News | Clinical Trials | Infectious Diseases


COMMENTS

1. RM on March 24, 2014 12:35 PM writes...

There was never much clarification about what terms, exactly, they were going to flag as being indicative of someone coming down with the flu,

If my understanding of these Google-type big data analysis projects is correct, they never specify what terms they're looking for because they don't have an a priori list of terms they're imposing. Instead they look at the complete set of data, and see what pops up.

For example you would think that medical-related terms may be decent predictors ("fever", "runny nose", "coughing"), but "daycare" and "sprite" might be even better positive predictors, and things like "concert tickets" might be really good negative predictors. Depending on what they're using (which may be something like a deep belief artificial neural network), things that are relevant might have complex relationships and might not even make direct sense (e.g. "car" might come up in a complex relationship). My guess is even if they told you what terms are indicative of flu in their model, most of them wouldn't make immediate sense.

This "unbiased" approach allows them to pick up things they might not have expected, but has the downside of possibly giving spurious correlations and over-fitting. They try to correct this with advanced statistical criteria, but this doesn't always work.

Permalink to Comment

2. Robb on March 24, 2014 12:55 PM writes...

If the problem really was that the GFT was over fitting search terms then Google must have neglected to validate their classifier on a separate dataset. That's a common mistake, particularly in medicine, and is the primary reason why clinical criteria never seem to work as well in practice as they do in the original paper. You absolutely must validate your classifier in different data than you used to train it. Even if that classifier is a set of diagnostic criteria to be applied by a human.

Alternately, the GFT may just be confused by non-seasonal flu. If you train your classifier on seasonal flu it's not surprising that it would perform poorly on the non-seasonal variety. Particularly if you went and eliminated correlated but seemingly unrelated seasonal search terms.

Train, then test. Then publish.

Permalink to Comment

3. emjeff on March 24, 2014 12:56 PM writes...

Does this mean we might start to see the end of the term "big data"? Because I've been hearing how big data are going to make the sun shine brighter, candy taste sweeter and the whole world better for a few years now, but have not actually seen any real results. Like so many buzzwords, this is probably another where the real promise (and I am sure there is some merit in big data techniques, my snarky comments above notwithstanding) does not come close to the hype.

Permalink to Comment

4. oldnuke on March 24, 2014 1:25 PM writes...

No surprise -- sorting through a big pile of garbage won't usually give you any better information than a somewhat smaller pile.

On the other hand, it sells bigger computers and drives more "research".

I'm not so sure that this was ever needed - I thought that the Centers for Disease Control had a pretty good system (based on facts) already.

Permalink to Comment

5. mittimithai on March 24, 2014 2:25 PM writes...

I've got a set of big data comics planned which I will hopefully complete for the big data bubble bursts...but hopefully tis provides entertainment in the meantime:

http://mittimithai.com/2014/03/the-big-data/

Permalink to Comment

6. gippgig on March 24, 2014 3:10 PM writes...

Off topic but may be of interest:
Frontline on PBS is supposed to cover drug-resistant tuberculosis. Check your local TV listing.

Permalink to Comment

7. PorkPieHat on March 24, 2014 4:05 PM writes...

Reminds me of the Combichem days...just substitute "somewhere in this big library(s) is the drug and we'll find it" with "somewhere in this big dataset is the pattern and we'll track it".

Permalink to Comment

8. jbosch on March 24, 2014 7:15 PM writes...

Big data, that rings a bell - a NSA bell to be precise, I wonder if they have the same trouble.

Permalink to Comment

9. pgwu on March 24, 2014 8:39 PM writes...

I went to a SQL Saturday two weeks ago in Mountain View, CA. One big data consultant gave a session on big data. After his talk, I asked about his view on Google Flu. He said it's pretty good, maybe 10-15% off. I then told him about the Science article on the inaccuracies. He said it's just another opinion. The arrogant and ignorant attitude is probably not the exception from the big data crowd.

Permalink to Comment

10. Am I Lloyd peptide on March 24, 2014 8:46 PM writes...

This is what happens when technology overtakes our abilities to interpret it.

Permalink to Comment

11. I Prefer "Husky" Data, Thank You Very Much on March 25, 2014 1:32 AM writes...

A lot of response (including the linked article) seems to be a variation on The Robots Are Coming to Take Our Jobs. But it's a strawman. I don't see any grand claims from Google (or anyone else) that this predictor should replace real-time flu surveillance. It's just another tool in the toolbox, and as this article shows the tool can improve other predictors even for its flaws.

I think the predictor absolutely should be critically evaluated and I wish Google did more to publish their methods. But I'm struck by how much the authors missed the forest for the trees. Exhibit A:

http://www.sciencemag.org/content/343/6176/1203/F2.expansion.html

The authors claim that the graph shows GFT reports overly high flu prevalence most of the time. It does, but they seem overly concerned about the amplitude while ignoring the overall correlation.

My takeaway from that graph is that GFT seems to accurately reflect overall flu trends about as well as real CDC data. The overlap and first derivative seem pretty darn good to me. This is despite the fact that the model is naive, based on almost no medical knowledge, and was entirely trained on pre-2010 data.

When did we get so cynical? That a model could do so well with such noisy, uninformed data is pretty amazing if you ask me. Contrary to the authors' conclusion I think it represents a pretty convincing validation that so-called Big Data has an important role to play in improving predictions of societal and medical trends. Not the only role, but who is suggesting that?

Permalink to Comment

12. Anonymous BMS Researcher on March 25, 2014 7:38 AM writes...


Actually, one of the best ways to predict not only the level of flu in an upcoming North American flu season but also which subtypes will be the most common is to look at the Australian health service flu reports from their prior flu season. There are several reasons for this:

1. Australia's flu seasons are of course roughly halfway between ours.

2. The Australian public health authorities do an excellent job of collecting and collating their data.

3. Australia is geographically close to Asia while it has deep cultural ties to other English-speaking countries. Thus, Australians regularly go basically everywhere. Therefore, the spectrum of viruses to which Australians are exposed represent a good sample of current flu strains from all over the world.

Permalink to Comment

13. Anon on March 25, 2014 9:29 AM writes...

So what does this say about Calico and all the other tech influenced startups?
This may make one question how effective Google is at it's bread and butter. They are supposedly able to track people that search through their browser, build a profile, and sell access to advertisers. The GFT project should be no different than what they do day-to-day, in that a person searching for certain terms is a prime candidate for posting a Nyquil advertisement (what they call targeted advertising).
This really should be a big red light as this technique is supposed to be why they are valued at nearly 400 billion USD.

Permalink to Comment

14. Anonymous on March 25, 2014 11:12 AM writes...

@12: That all sounds very logical ... except that by the same logic, the reverse is also true, so you could predict Australian flu by what happens earlier in the US. And thus you end up with a system where you're just making predictions based on earlier flu seasons, like saying the flu will be bad this year because it was bad last year, which is hardly a prediction at all.

Permalink to Comment

15. Anonymous on March 25, 2014 7:35 PM writes...

@14 New flu strains are thought to mostly originate in S.E. Asia, so it seems reasonable that the Australians would tend to get them first. I don't have any data to back that up though

Permalink to Comment

16. Anonymous on March 26, 2014 4:35 AM writes...

Douglas Merrill, former CIO/VP of Engineering at Google, has issued an important warning about big data:

“With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… On net, having a degree in math, economics, AI, etc., isn’t enough. Tool expertise isn’t enough. You need experience in solving real world problems, because there are a lot of important limitations to the statistics that you learned in school. Big data isn’t about bits, it’s about talent.”

Permalink to Comment

17. Nony on March 26, 2014 8:28 AM writes...

There is a lot of "big data" talk that is just hype. People referring to it who don't even have a definition. Reminds me of TQM, lean, etc.

In some cases, the examples are just examples of data analysis itself. (So if you are inexperienced with data analysis, then you may think "yippee" with the premise of big data, but it's not anything new.) Look at companies with the six sigma fad.

In some cases, it really does have premise, but it won't work everywhere. For instance, retail purchases or google searches may have a high number of data pieces and of factors. But commodity pricing may be more suitably analyzed by cost curves and demand curves with a limited number of segments. In other cases, the hype seems to ignore "big data" that was doing big data stuff long before the buzzword came out. For example, actuarial science!

Permalink to Comment

18. Anonymous on March 26, 2014 9:02 AM writes...

#17 Nony, There are a lot of job titles on linked that read something like "IT Lead for R&D Big Data Strategy at ". In such instances, you might also notice that these people have additional experience in Six/Lean Sigma, TQM, etc. Coincidence?

Permalink to Comment

POST A COMMENT




Remember Me?



EMAIL THIS ENTRY TO A FRIEND

Email this entry to:

Your email address:

Message (optional):




RELATED ENTRIES
XKCD on Protein Folding
The 2014 Chemistry Nobel: Beating the Diffraction Limit
German Pharma, Or What's Left of It
Sunesis Fails with Vosaroxin
A New Way to Estimate a Compound's Chances?
Meinwald Honored
Molecular Biology Turns Into Chemistry
Speaking at Northeastern