Another cannon has gone off in the noncoding-genome wars. Here's a paper in PLOS Genetics detailing what the authors are calling Long Intergenic Noncoding RNAs (lincRNAs):
Known protein coding gene exons compose less than 3% of the human genome. The remaining 97% is largely uncharted territory, with only a small fraction characterized. The recent observation of transcription in this intergenic territory has stimulated debate about the extent of intergenic transcription and whether these intergenic RNAs are functional. Here we directly observed with a large set of RNA-seq data covering a wide array of human tissue types that the majority of the genome is indeed transcribed, corroborating recent observations by the ENCODE project. Furthermore, using de novo transcriptome assembly of this RNA-seq data, we found that intergenic regions encode far more long intergenic noncoding RNAs (lincRNAs) than previously described, helping to resolve the discrepancy between the vast amount of observed intergenic transcription and the limited number of previously known lincRNAs. In total, we identified tens of thousands of putative lincRNAs expressed at a minimum of one copy per cell, significantly expanding upon prior lincRNA annotation sets. These lincRNAs are specifically regulated and conserved rather than being the product of transcriptional noise. In addition, lincRNAs are strongly enriched for trait-associated SNPs suggesting a new mechanism by which intergenic trait-associated regions may function.
Emphasis added, because that's been one of the key points in this debate. The authors regard the ENCODE data as "firmly establishing the reality of pervasive transcription", so you know where their sympathies lie. And their results are offered up as a strong corroboration of the ENCODE work, with lincRNAs serving as the, well, missing link.
One thing I notice is that these new data strongly suggest that many of these RNAs are expressed at very low levels. The authors set cutoffs for "fragments per kilobase of transcript per million mapped reads" (FPKM), discarding everything that came out as less than 1 (roughly one copy per cell). The set of RNAs with FPKM>1 is over 50,000. If you ratchet up a bit, things drop off steeply, though. FPKM>10 knocks that down to between three and four thousand, and FPKM>30 give you 925 lincRNAs. My guess is that those are where the next phase of this debate will take place, since those expression levels get you away from the noise. But the problem is that the authors are explicitly making the case for thousands upon thousands of lincRNAs being important, and this interpretation won't be satisfied with everyone agreeing on a few hundred new transcripts. These things also seem to be very tissue-specific, so it looks like the arguing is going to get very granular indeed.
Here's a quote from the paper that sums up the two worldviews that are now fighting it out:
Almost half of all trait-associated SNPs (TASs) identified in genome-wide association studies are located in intergenic sequence while only a small portion are in protein coding gene exons. This curious observation points to an abundance of functional elements in intergenic sequence.
Or that curious observation could be telling you that there's something wrong with your genome-wide association studies. I lean towards that view, but the battles aren't over yet.