Last fall we had the landslide of data from the ENCODE project, along with a similar landslide of headlines proclaiming that 80% of the human genome was functional. That link shows that many people (myself included) were skeptical of this conclusion at the time, and since then others have weighed in with their own doubts.
A new paper, from Dan Graur at Houston (and co-authors from Houston and Johns Hopkins) is really stirring things up. And whether you agree with its authors or not, it's well worth reading - you just don't see thunderous dissents like this one in the scientific literature very often. Here, try this out:
Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that (at least 70%) of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions, or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly (1) by employing the seldom used “causal role” definition of biological function and then applying it inconsistently to different biochemical properties, (2) by committing a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,” (4) by using analytical methods that yield biased errors and inflate estimates of functionality, (5) by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical significance rather than the magnitude of the effect.
Other than that, things are fine. The paper goes on to detailed objections in each of those categories, and the tone does not moderate. One of the biggest objections is around the use of the word "function". The authors are at pains to distinguish selected effect functions from causal role functions, and claim that one of the biggest shortcomings of the ENCODE claims is that they blur this boundary. "Selected effects" are what most of us think about as well-proven functions: a TATAAA sequence in the genome binds a transcription factor, with effects on the gene(s) downstream of it. If there is a mutation in this sequence, there will almost certainly be functional consequences (and these will almost certainly be bad). If, however, imagine a random sequence of nucelotides that's close enough to TATAAA to bind a transcription factor. But in this case, there are no functional consequences - genes aren't transcribed differently, and nothing really happens other than the transcription factor parking there once in a while. That's a "causal role" function, and the whopping majority of the ENCODE functions appear to be in this class. "It looks sort of like something that has a function, therefore it has one". And while this can lead to discoveries, you have to be careful:
The causal role concept of function can lead to bizarre outcomes in the biological sciences. For example, while the selected effect function of the heart can be stated unambiguously to be the pumping of blood, the heart may be assigned many additional causal role functions, such as adding 300 grams to body weight, producing sounds, and preventing the pericardium from deflating onto itself. As a result, most biologists use the selected effect concept of function. . .
A mutation in that random TATAAA-like sequence would be expected to be silent compared to what would happen in a real binding motif. So one would want to know what percent of the genome is under selection pressure - that is, what part of it is unlikely to be mutatable without something happening. Those studies are where we get the figures of perhaps 10% of the DNA sequence being functional. Almost all of what ENCODE has declared to be functional, though, can show mutations with relative impunity:
From an evolutionary viewpoint, a function can be assigned to a DNA sequence if and only if it is possible to destroy it. All functional entities in the universe can be rendered nonfunctional by the ravages of time, entropy, mutation, and what have you. Unless a genomic functionality is actively protected by selection, it will accumulate deleterious mutations and will cease to be functional. The absurd alternative, which unfortunately was adopted by ENCODE, is to assume that no deleterious mutations can ever occur in the regions they have deemed to be functional. Such an assumption is akin to claiming that a television set left on and unattended will still be in working condition after a million years because no natural events, such as rust, erosion, static electricity, and earthquakes can affect it. The convoluted rationale for the decision to discard evolutionary conservation and constraint as the arbiters of functionality put forward by a lead ENCODE author (Stamatoyannopoulos 2012) is groundless and self-serving.
Basically, if you can't destroy a function by mutation, then there is no function to destroy. Even the most liberal definitions take this principle to apply to about 15% of the genome at most, so the 80%-or-more figure really does stand out. But this paper has more than philosophical objections to the ENCODE work. They point out that the consortium used tumor cell lines for its work, and that these are notoriously permissive in their transcription. One of the principles behind the 80% figure is that "if it gets transcribed, it must have a function", but you can't say that about HeLa cells and the like, which read off all sorts of pseudogenes and such (introns, mobile DNA elements, etc.)
One of the other criteria the ENCODE studies used for assigning function was histone modification. Now, this bears on a lot of hot topics in drug discovery these days, because an awful lot of time and effort is going into such epigenetic mechanisms. But (as this paper notes), this recent study illustrated that all histone modifications are not equal - there may, in fact, be a large number of silent ones. Another ENCODE criterion had to do with open (accessible) regions of chromatin, but there's a potential problem here, too:
They also found that more than 80% of the transcription start sites were contained within open chromatin regions. In yet another breathtaking example of affirming the consequent, ENCODE makes the reverse claim, and adds all open chromatin regions to the “functional” pile, turning the mostly true statement “most transcription start sites are found within open chromatin regions” into the entirely false statement “most open chromatin regions are functional transcription start sites.”
Similar arguments apply to the 8.5% of the genome that ENCODE assigns to transcription factor binding sites. When you actually try to experimentally verify function for such things, the huge majority of them fall out. (It's also noted that there are some oddities in ENCODE's definitions here - for example, they seem to be annotating 500-base stretches as transcription factor binding sites, when most of the verified ones are below 15 bases in length).
Now, it's true that the ENCODE studies did try to address the idea of selection on all these functional sequences. But this new paper has a lot of very caustic things to say about the way this was done, and I'll refer you to it for the full picture. To give you some idea, though:
By choosing primate specific regions only, ENCODE effectively removed everything that is of interest functionally (e.g., protein coding and RNA-specifying genes as well as evolutionarily conserved regulatory regions). What was left consisted among others of dead transposable and retrotransposable elements. . .
. . .Because polymorphic sites were defined by using all three human samples, the removal of two samples had the unfortunate effect of turning some polymorphic sites into monomorphic ones. As a consequence, the ENCODE data includes 2,136 alleles each with a frequency of exactly 0. In a miraculous feat of “next generation” science, the ENCODE authors were able to determine the frequencies of nonexistent derived alleles.
That last part brings up one of the objections that many people many have to this paper - it does take on a rather bitter tone. I actually don't mind it - who am I to object, given some of the things I've said on this blog? But it could be counterproductive, leading to arguments over the insults rather than arguments over the things being insulted (and over whether they're worthy of the scorn). People could end up waving their hands and running around shouting in all the smoke, rather than figuring out how much fire there is and where it's burning. The last paragraph of the paper is a good illustration:
The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.
Well, maybe that was necessary. The amount of media hype was huge, and the only way to counter it might be to try to generate a similar amount of noise. It might be working, or starting to work - normally, a paper like this would get no popular press coverage at all. But will it make CNN? The Science section of the New York Times? ENCODE's results certainly did.
But what the general public things about this controversy is secondary. The real fight is going to be here in the sciences, and some of it is going to spill out of academia and into the drug industry. As mentioned above, a lot of companies are looking at epigenetic targets, and a lot of companies would (in general) very much like to hear that there are a lot more potential drug targets than we know about. That was what drove the genomics frenzy back in 1999-2000, an era that was not without its consequences. The coming of the ENCODE data was (for some people) the long-delayed vindication of the idea that gene sequencing was going to lead to a vast landscape of new disease targets. There was already a comment on my entry at the time suggesting that some industrial researchers were jumping on the ENCODE work as a new area to work in, and it wouldn't surprise me to see many others thinking similarly.
But we're going to have to be careful. Transcription factors and epigenetic mechanisms are hard enough to work on, even when they're carefully validated. Chasing after ephemeral ones would truly be a waste of time. . .
More reactions around the science blogging world: Wavefunction, Pharyngula, SciLogs, Openhelix. And there are (and will be) many more.