Corante

About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: derekb.lowe@gmail.com Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Emolecules
ChemSpider
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
PubChem
Not Voodoo
DailyMed
Druglib
Clinicaltrials.gov

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
Kilomentor
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
ChemBark
Realizations in Biostatistics
Chemjobber
Pharmalot
ChemSpider Blog
Pharmagossip
Med-Chemist
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
SimBioSys
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Business|Bytes|Genes|Molecules
Eye on FDA
Chemical Forums
Depth-First
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa


Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
FuturePundit
Aetiology
Gene Expression (I)
Gene Expression (II)
Sciencebase
Pharyngula
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net


Medical Blogs
DB's Medical Rants
Science-Based Medicine
GruntDoc
Respectful Insolence
Diabetes Mine


Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem


Politics / Current Events
Virginia Postrel
Instapundit
Belmont Club
Mickey Kaus


Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Ups and Downs | Main | Imperfect Pitch »

May 16, 2011

A Google Oddity

Email This Entry

Posted by Derek

A comment to the last post mentioned that if you search the word "biotechnology" in Google's Ngram search engine, something odd happens. There's the expected rise in the 1970s and 80s, but there's also a bump in the early 1900s, for no apparent reason. Curious about this, I ran several other high-tech phrases through and found the exact same effect.

Here's a good example, with some modern physics phrases. And you get the same thing if you search "nanotechnology", "ribosome", "atomic force microscope", "RNA interference", "laser", "gene transfer", "mass spectrometer" or "nuclear magnetic resonance". There's always a jump back in exactly the same period on the early 1900s.
dark%20matter.jpg
So what's going on? I can understand some OCR errors, but why do these things show up in this specific Edwardian-age window? Can anyone at Google shed any light on this?

Comments (28) + TrackBacks (0) | Category: General Scientific News | The Scientific Literature


COMMENTS

1. Chemjobber on May 16, 2011 10:11 AM writes...

The sadly short steampunk age, naturally. Who's up for a steampunk NMR?

Permalink to Comment

2. mph on May 16, 2011 10:21 AM writes...

Probably catalog errors, where books from 200x are cataloged as 190x, due to algorithmic mishandling of two-digit years.

Please don't disregard my comment in the other thread about case-sensitivity. Your "merck/pfizer/novartis" graph is totally bogus because of this.

Permalink to Comment

3. Tony on May 16, 2011 10:28 AM writes...

My guess? All of this is searched in a database, and it is common for an unknown date to be stored as 1/1/1900..or 12/31/1900...or other dates all around then. My guess is its a sign of lots of undated material and the program that creates the chart not taking that into account. Just a guess though.

Permalink to Comment

4. Derek Lowe on May 16, 2011 10:31 AM writes...

The database error explanation sounds plausible. And mph, I'm fixing that cap/lower case issue now, adding new graphs to the earlier post. Thanks!

Permalink to Comment

5. SVI on May 16, 2011 10:32 AM writes...

In the period prior to WWI there was a "Futurist" literary movement. A lot of the writing dealt with the advent of new technologies, and involved imagining new technologies and branches of science. It is possible that some of these terms were first used by the futurists, and later adopted by actual scientists.

The futurists sincerely believed that the new century would bring a period peace and prosperity courtesy of the radical technological transformation. Most of them either died in the trenches, or were soured by the pointy ends of technology. However it is my understanding that a lot of midcentury scientists were inspired by the futuris writings in their youth to actually take up science (von Braun would be one example).

Permalink to Comment

6. Gav on May 16, 2011 10:41 AM writes...

Someone apparently discovered graphene in the late 1800s too.

http://twitter.com/#!/Gav33/status/15761496826974208

Permalink to Comment

7. Curious on May 16, 2011 11:00 AM writes...

Perhaps the blip around 1905 is a real Year 2K effect making itself manifest somewhere in the document database - perhaps a group of douments were dated incorrectly as 1905 when they should have been dated as 2005 when they were logged.

Permalink to Comment

8. Anonymous on May 16, 2011 11:00 AM writes...

y2k compliance or lack thereof?

In grad school, I used a GC whose computer routinely dated traces to the second century AD. The 100s obviously followed the 90s, at least in Win 3.1.

If you don't know the origin of "carbene," then that search could really blow your mind.

http://ngrams.googlelabs.com/graph?content=carbene&year_start=1820&year_end=2008&corpus=0&smoothing=0

Permalink to Comment

9. Lev on May 16, 2011 11:56 AM writes...

I'm 99.9% (or should it be 1999.9? ;-) sure it's the Y2K problem manifestation. Some of the '01' entries are wrongfully assigned to the 19.. instead of 20...

Permalink to Comment

10. Stiv on May 16, 2011 12:48 PM writes...

you also see the ~Y1900 bump with the term "microbrewery".....who knew?

Permalink to Comment

11. dvanbaak on May 16, 2011 12:53 PM writes...

If you play with the 'smoothing' control of Google's Ngram Viewer, you'll find the premature appearances of 'string theory' and the like are all found in the year 1901 -- which others have hypothesized are citations from 2001, mis-allocated to 1901.

Permalink to Comment

12. Anon anon anon on May 16, 2011 2:19 PM writes...

Suppose `9' -> `0' is a frequent OCR error. It's not so absurd, given the large overlap in the glyphs. That would move some fraction of events from 1997 events to 1907. It would also explain why the bump disappears at 1910.

We could test this theory, if we see a similar bump from 1997 becoming 1097 and 1007, but the data only goes back to 1500.

Permalink to Comment

13. RB Woodweird on May 16, 2011 2:50 PM writes...

SVI said "The futurists sincerely believed that the new century would bring a period peace and prosperity courtesy of the radical technological transformation. Most of them either died in the trenches, or were soured by the pointy ends of technology. However it is my understanding that a lot of midcentury scientists were inspired by the futuris writings in their youth to actually take up science (von Braun would be one example)."

If an educated person from say 1880 woke up today would they be amazed at our progress or disappointed that we haven't gone farther than we have?

http://www.scribd.com/doc/13561852/Breakfast-in-the-Next-Century-an-original-screenplay

Permalink to Comment

14. Henrik Olsen on May 16, 2011 3:10 PM writes...

The strange thing is that when you try to actually find any books where the phrase is used, in the 1899-1911 period, Google comes up empty.

I'm voting bug.

Permalink to Comment

15. Henrik Olsen on May 16, 2011 3:16 PM writes...

Fast update, a large amount of the weirdness is due to the smoothing function used, which will make a single mention in a single book look like the phrase was used repeatedly in a 7 year period. Try the searches again with smoothing set to 0.

Permalink to Comment

16. dave w on May 16, 2011 4:00 PM writes...

My guess would be 2-digit year ambiguities, as others have suggested: items dated, e.g., 4/30/05 (intended as April 30, 2005) being counted as "1905" dates etc. - it wouldn't affect anything of 1999 or earlier, and the current year just rolled over from 2010 to 2011 (so there isn't anything later than 2011, and not much yet with that date)... so the net effect is going to be that stuff from 2000-2010 would show up (if affected by the ambiguities) as 1900-1910 search hits (which is roughly the visible bump in the graph).

I suspect that similar search graphs made 5 years from now will show a bump from 1900-1916, if affected by the same issue.

Permalink to Comment

17. Esteban on May 16, 2011 4:04 PM writes...

Searching on 'internet' also gives the pronounced bump in the 190x's, with nothing on either's side of it. The two-digit year hypothesis seems like the only possible explanation.

Permalink to Comment

18. victor on May 16, 2011 7:38 PM writes...

if you look farther back in time, there are even more oddities that are most likely due to the smoothing function or incorrectly entered data (e.g. volume 1900 entered as year 1900).

I may have also stumbled on the first documented instances jokes being told "too soon."

http://bit.ly/lb2VWW

i wonder what the joke was and if the offended party was british.

Permalink to Comment

19. Neil on May 17, 2011 1:20 AM writes...

Whatever the reason, it's a good illustration that "big data" is not necessarily "good data." I've seen so many posts where people are practically wetting themselves over Google n-gram, with no regard for the possibility that a lot of the data may simply be complete rubbish.

Permalink to Comment

20. Donough on May 17, 2011 2:30 AM writes...

From google about ngram

Publishing was a relatively rare event in the 16th and 17th centuries. (There are only about 500,000 books published in English before the 19th century.) So if a phrase occurs in one book in one year but not in the preceding or following years, that creates a taller spike than it would in later years.

Plateaus are usually simply smoothed spikes. Change the smoothing to 0.

Permalink to Comment

21. Terry on May 17, 2011 7:27 AM writes...

There's an old saying from the database industry world - "garbage in, garbage out".

I agree with some of the others here - most likey the system is reading data with typos.

Permalink to Comment

22. CR on May 17, 2011 11:44 AM writes...

@Terry and others...

There is another old saying...'Who cares?'

Permalink to Comment

23. Slurpy on May 17, 2011 3:07 PM writes...

Definitely a date-parsing issue. "Nanotechnology," "dendrimer," "meitnerium" and "fullerene" were not terms coined in the 1900s.

Permalink to Comment

24. PB on May 17, 2011 8:27 PM writes...

@8. Here's a rather interesting take on the origins of carbenes. :-)

http://www.cracked.com/article_19021_5-amazing-things-invented-by-donald-duck-seriously_p2.html

Permalink to Comment

25. dataguy on May 18, 2011 2:00 PM writes...

This is a common problem in public databases of all kinds. According to the FDA BMIS data set, there were a small number of form 1572 (clinical investigator) filings back in the days of William the Conqueror.

Permalink to Comment

26. Pavel Kasík on May 19, 2011 7:36 AM writes...

I suspect that this is a bad year recognition. 2008, written as 08, recognized as 1908. Explains why this goes from 1900 to 1910. Right?

Permalink to Comment

27. JM on June 6, 2011 5:25 PM writes...

One of the worst spikes occurs with "Google" itself!

Permalink to Comment

28. Logan on March 3, 2014 3:17 PM writes...

Its actually not a mistake at all!

If you look each of them individually, it makes sense.

Dark matter is referring to a dark material found near the red spot of jupiter which was published about in those years.

String theory is about the biology theory about basilar membranes.

Nuclear fusion is also about biology, the fusion of two nuclei of cells.

Permalink to Comment

POST A COMMENT




Remember Me?



EMAIL THIS ENTRY TO A FRIEND

Email this entry to:

Your email address:

Message (optional):




RELATED ENTRIES
How Not to Do It: NMR Magnets
Allergan Escapes Valeant
Vytorin Actually Works
Fatalities at DuPont
The New York TImes on Drug Discovery
How Are Things at Princeton?
Phage-Derived Catalysts
Our Most Snorted-At Papers This Month. . .