About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
Not Voodoo

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
Realizations in Biostatistics
ChemSpider Blog
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Eye on FDA
Chemical Forums
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa

Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
Gene Expression (I)
Gene Expression (II)
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net

Medical Blogs
DB's Medical Rants
Science-Based Medicine
Respectful Insolence
Diabetes Mine

Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem

Politics / Current Events
Virginia Postrel
Belmont Club
Mickey Kaus

Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Honking, Squawking Chemical Ignorance | Main | Walking Away From the ACS »

September 13, 2012

ENCODE And What It All Means

Email This Entry

Posted by Derek

You'll have heard about the massive data wave that hit (30 papers!) courtesy of the ENCODE project. That stands for Encyclopedia of DNA Elements, and it's been a multiyear effort to go beyond the bare sequence of human DNA and look for functional elements. We already know that only around 1% of the human sequence is made up of what we can recognize as real, traditional genes: stretches that code for proteins, have start and stop codons, and so on. And it's not like that's so straightforward, either, what with all the introns and whatnot. But that leaves an awful lot of DNA that's traditionally been known by the disparaging name of "junk", and sure it can't just be that - can it?

Some of it does its best to make you think that way, for sure. Transposable elements like Alu sequences, which are repeated relentlessly hundreds of thousands of times throughout the human DNA sequence, must either be junk, inert spacer, or so wildly important that we just can't have too many copies of them. But DNA is three-dimensional (and how), and its winding and unwinding is crucial to gene expression. Surely a good amount of that apparently useless stuff is involved in these processes and other epigenetic phenomena.

And the ENCODE group has indeed discovered a lot of this sort of thing. But as this excellent overview from Brendan Maher at Nature shows, it hasn't discovered quite as many as the headlines might lead you to think. (And neither has it demolished the idea that all the 99% of noncoding DNA is junk, because you can't find anyone who believed that one, either). The figure that's in all the press writeups is that this work has assigned functions for 80% of the human genome, which would be an astonishing figure on several levels. For one thing, it would mean that we'd certainly missed an awful lot before, and for another, it would mean that the genome is a heck of a lot more information-rich than we ever thought it might be.

But neither of those quite seem to be the case. It all depends on what you mean by "functional", and opinions most definitely vary. See this post by Ed Yong for some of the categories. which range out to some pretty broad, inclusive definitions of "function". A better estimate is that maybe 20% of the genome can directly influence gene expression, which is very interesting and useful, but ain't no 80%, either. That Nature post provides a clear summary of the arguments about these figures.

But even that more-solid 20% figure is going to keep us all busy for a long time. Learning how to affect these gene transcription mechanisms is going should be a very important route to new therapies. If you remember all the hype about how the genome was going to unlock cures to everything - well, this is the level we're actually going to have to work at to make anything in that line come true. There's a lot of work to be done, though. Somehow, different genes are expressed at different times, in different people, in response to a huge variety of environmental cues. It's quite a tangle, but in theory, it's a tangle that can be unraveled, and as it does, it's going to provide a lot of potential targets for therapy. Not easy targets, mind you - those are probably gone - but targets nonetheless.

One of the best ways to get a handle on all this work is this very interesting literature experiment at Nature - a portal into the ENCODE project data, organized thematically, and with access to all the papers involved across the different journals. If you're interested in epigenetics at all, this is a fine place to read up on the results of this work. And if you're not, it's still worth exploring to see how the scientific literature might be presented and curated. This approach, it seems to me, potentially adds a great deal of value. Eventually, the PDF-driven looks-like-a-page approach to the literature will go extinct, and something else will replace it. Some of it might look a bit like this.

Note, just for housekeeping purposes - I wrote this post for last Friday, but only realized today that it didn't publish, thus the lack of an entry that day. So here it is, better late, I hope, than never. There's more to say about epigenetics, too, naturally. . .

Comments (16) + TrackBacks (0) | Category: Biological News | The Scientific Literature


1. PtX on September 13, 2012 8:00 AM writes...

One thing such repeated elements may provide is binding sites for proteins/RNA organizing 3D structure of DNA. Having them on transposons provides a quick and easy way to select beneficial 3D arrangements.

One other way in which non-coding DNA might determine structure is by having stalled polymerases whose RNA is bound by adapter proteins to nucleoskeleton.

Permalink to Comment

2. Curious Wavefunction on September 13, 2012 8:46 AM writes...

If you understand that chemistry and evolution are messy, it's not hard at all to believe that most of the genome would be junk. The ENCODE PR disaster is a reminder of how humans try to constantly find signal in the noise. But sometimes noise is just noise; that's the overarching lesson of biology.

Permalink to Comment

3. John Wayne on September 13, 2012 11:57 AM writes...

It never occurred to me that science would be muddied by press releases, but here we are.

Permalink to Comment

4. barry on September 13, 2012 12:24 PM writes...

We are going to have to reassess our molecular clocks for evolutionary distance again and again as we learn what parts of our DNA are truly free to mutate without cost and which are not. Surely that latter class includes more than the genes which get transcribed/translated to peptides/proteins. Exactly how much more isn't at all clear.

Permalink to Comment

5. MolecularGeek on September 13, 2012 2:30 PM writes...

Ash (@2),
In some ways, it's even worse than that. By evolution, homo sapiens is neurally wired as a pattern detection engine. As one author who I can't remember or find right now put it: "when your survival depends on noticing that there were leopard paw prints in the mud by the water hole yesterday, and Uncle Thakk disappeared shortly thereafter, you are rightfully cautious when you notice leopard tracks in the mud today." We try to see patterns everywhere. This doesn't serve us well when trying to make sense of truly random data.

Permalink to Comment

6. PtX on September 13, 2012 3:34 PM writes...

Actually, looking for pattern everywhere serves us remarkably well, the whole of scientific knowledge is composed of such patterns (that also passed empirical verification).

What doesn't serve us well is assuming from the get go there are no patterns to be found since something is be noise.

Permalink to Comment

7. PtX on September 13, 2012 3:36 PM writes...

just noise*

Permalink to Comment

8. Curious Wavefunction on September 13, 2012 4:13 PM writes...

True. The problem is that nature is smarter than we think. Sometimes the pattern is very significant but it's hidden in much noise. Sometimes only part of the noise is related to the pattern. In some sense of course it's impossible to be truly agnostic about patterns when we encounter noise, precisely because of our evolutionary history which you mentioned. What I find remarkable about junk DNA is that in its case, the noise actually makes evolutionary sense. I have a bit of this on my blog today.

Permalink to Comment

9. metaphysician on September 13, 2012 4:20 PM writes...

Actually, I can think of a very good reason why 99% of your DNA can't be completely useless: physiological burden. Copying DNA takes energy, and the more DNA you need to copy, the more energy burden. If 99% of your DNA were useless and irrelevant, than any copying error in the gametes that knocked out a random chunk would result in an evolutionary advantage, since you'd have less DNA to spend energy copying.

Sure, there could still be value: material for potential random mutations, space to make a copying error less likely to hit something important. But I'm pretty sure the balance between those advantages, and the costs of copying unnecessary DNA, would occur somewhere way the hell closer together than 99% useless.

Permalink to Comment

10. Anonymous on September 13, 2012 4:35 PM writes...


As you point out, the sequencing of the human genome was supposed to give us all the answers to every disease. It didn't - it showed us how little we understand about the human body. The same is true again with this ENCODE data.

And yet the temptation to speculate on 'potential new targets' is just too great. I'm sure some 'targets' will come out, a few biotechs will be set up, some IP sold to big Pharma and a lot of hype will fly around. But PhIII success - no.

'In theory' there is nothing wrong. But science is about data, and when we actually get some, the theories don't look so good so we adapt them. Then some more data and we adapt again. It takes a lot of time/$$/scientists to go from this kind of basic Biology data to understanding its relationship to physiology and diseases (that we also don't really understand). That is where we find medicines.

With my science hat on I love this stuff, stepping into the unknown, finding stuff out. With my pragmatic, applied science, hard-nosed Drug Discovery hat on, I know that it is not going to deliver over the time frame of any investment we can afford to make, so we should stay away.

However, in my big Pharma, senior leaders are already jumping up and down, fighting over who is going to lead the new initiative in this exciting new area, who is going to set up a new group, get new resources, set up collaborations, get promoted etc. Oh, and deliver candidates within 3 years.

Our response to new basic science is dumb and we are failing our investors and patients. And we don't learn.

Permalink to Comment

11. Am I Lloyd peptide on September 13, 2012 4:46 PM writes...

@9: "Copying DNA takes energy, and the more DNA you need to copy, the more energy burden."

This is true, but getting rid of extra DNA and reconfiguring your genome likely takes even more energy. If a pseudogene has been accidentally inserted into a sequence, I think it would be far less risky to just copy the pseudogene and retain it (especially if it's causing no harm) rather than delete it and deal with the break.

Permalink to Comment

12. Renegade Sci on September 13, 2012 4:54 PM writes...

To anon,

I have a BBA, a BS, and a few years of graduate research under my belt.

You sound like you have your science cap on, but let me put my business cap on. This sounds like enthusiasm to push a stock deal based on epigenetics hype.

Like with the current dotcom2.0 situation where new social media companies are advised to have ZERO revenue/ don't even try to make money before you go public. The point is to sell the company on "All the money that this will make" instead of the "the money we are currently making" with the latter being limited by reality.

I love epigenetics, and I think it holds the key for overall understanding of develop and aging. I agree it's a baby. Epigenetics isn't even in the word dictionary yet. It puts you in a bad spot. You're dead on correct it's way too soon.

I just disagree in the assessment that don't know exactly what they're doing jumping into the hype with both feet: they're selling stock. (Yes, I'm jaded)

Permalink to Comment

13. Anonymous BMS Researcher on September 13, 2012 10:40 PM writes...

I have a couple of 1 terabyte USB hard drives that I use for backing up computers at home. Probably 80% or more of the data on those drives will never be needed, but disks are so cheap that I am unwilling to spend hours deciding which files to keep: simpler to just maintain two sets of complete backups. Perhaps the genome is like that to some extent.

It is clear that epigenetics can do important things that we are just beginning to understand. Look up the Dutch Hunger Winter: near the end of WWII the Germans punished the Dutch Resistance by cutting off food shipments to the part of the Netherlands they still controlled. Millions suffered severe hunger, and thousands died. Studies of survivors have been very informative, because unlike most famine victims these people had good nutrition all their lives before and after one episode of starvation. Infants whose mothers starved during pregnancy showed lifelong epigenetic effects, as did the grandchildren of mothers who starved during pregnancy.

Permalink to Comment

14. Petec on September 14, 2012 8:14 PM writes...

@13: Your hard drive metaphor makes sense as long as it isn't affecting organisms' fitness or replication rate. Deleting one file from your hard drive would result in a slightly faster copying of the content to another hard drive (or the more hard drives the better). That's not a big deal when doing it once, but when copying the "childrens'" content to their "children" etc., one would end up with orders of magnitude more copies of the hard drive without that one file than when starting with completely full one.

I guess -as someone already mentioned- we're just not at the stage yet where one could say that 80% of the genome is junk DNA. We simply don't know it. For even if it was noise, it would be nice to have it proven.

Permalink to Comment

15. metaphysician on September 15, 2012 8:53 AM writes...


Well, actually, that's why I don't think it *is* junk. And its exactly those copying errors that I was factoring in, not any kind of energy intensive "junk removal" by the organism in question. *That* would be counterproductive, certainly, particularly since its not totally clear how an organism would tell junk from not. Whereas the hand of Random Copying Errors would do so quite efficiently: "If you live and prosper, it was junk. If it was necessary DNA, you die."

Permalink to Comment

16. anon1 on September 21, 2012 5:25 PM writes...

Doesn't the ENCODE research have substantial possible implications for the future of human society? (specifically, with respect to the ability to change human intelligence)

Intelligence researchers have been confronted with the predicament that human intelligence is
closely related to genetics, though after many years of genetic studies it has not been clear what genes are involved. The current research suggests that the effect sizes of individual genes are extreme small. If specific malfunctioning genes and proteins were involved, current gene therapy technology might be inadequate to effectively treat this problem.

However, if differences in human intelligence were more related to the non-coding parts of the genome (as that investigated by ENCODE), it might be easier to change transcription factors and thus profoundly change the distribution of human intelligence.

Permalink to Comment


Remember Me?


Email this entry to:

Your email address:

Message (optional):

The Last Post
The GSK Layoffs Continue, By Proxy
The Move is Nigh
Another Alzheimer's IPO
Cutbacks at C&E News
Sanofi Pays to Get Back Into Oncology
An Irresponsible Statement About Curing Cancer
Oliver Sacks on Turning Back to Chemistry