About this Author
DBL%20Hendrix%20small.png College chemistry, 1983

Derek Lowe The 2002 Model

Dbl%20new%20portrait%20B%26W.png After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases. To contact Derek email him directly: Twitter: Dereklowe

Chemistry and Drug Data: Drugbank
Chempedia Lab
Synthetic Pages
Organic Chemistry Portal
Not Voodoo

Chemistry and Pharma Blogs:
Org Prep Daily
The Haystack
A New Merck, Reviewed
Liberal Arts Chemistry
Electron Pusher
All Things Metathesis
C&E News Blogs
Chemiotics II
Chemical Space
Noel O'Blog
In Vivo Blog
Terra Sigilatta
BBSRC/Douglas Kell
Realizations in Biostatistics
ChemSpider Blog
Organic Chem - Education & Industry
Pharma Strategy Blog
No Name No Slogan
Practical Fragments
The Curious Wavefunction
Natural Product Man
Fragment Literature
Chemistry World Blog
Synthetic Nature
Chemistry Blog
Synthesizing Ideas
Eye on FDA
Chemical Forums
Symyx Blog
Sceptical Chymist
Lamentations on Chemistry
Computational Organic Chemistry
Mining Drugs
Henry Rzepa

Science Blogs and News:
Bad Science
The Loom
Uncertain Principles
Fierce Biotech
Blogs for Industry
Omics! Omics!
Young Female Scientist
Notional Slurry
Nobel Intent
SciTech Daily
Science Blog
Gene Expression (I)
Gene Expression (II)
Adventures in Ethics and Science
Transterrestrial Musings
Slashdot Science
Cosmic Variance
Biology News Net

Medical Blogs
DB's Medical Rants
Science-Based Medicine
Respectful Insolence
Diabetes Mine

Economics and Business
Marginal Revolution
The Volokh Conspiracy
Knowledge Problem

Politics / Current Events
Virginia Postrel
Belmont Club
Mickey Kaus

Belles Lettres
Uncouth Reflections
Arts and Letters Daily
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline

In the Pipeline

« Hybrid Biomolecules, Edible And Not | Main | Travel and Upcoming Posts »

July 1, 2013

Corroboration for ENCODE?

Email This Entry

Posted by Derek

Another cannon has gone off in the noncoding-genome wars. Here's a paper in PLOS Genetics detailing what the authors are calling Long Intergenic Noncoding RNAs (lincRNAs):

Known protein coding gene exons compose less than 3% of the human genome. The remaining 97% is largely uncharted territory, with only a small fraction characterized. The recent observation of transcription in this intergenic territory has stimulated debate about the extent of intergenic transcription and whether these intergenic RNAs are functional. Here we directly observed with a large set of RNA-seq data covering a wide array of human tissue types that the majority of the genome is indeed transcribed, corroborating recent observations by the ENCODE project. Furthermore, using de novo transcriptome assembly of this RNA-seq data, we found that intergenic regions encode far more long intergenic noncoding RNAs (lincRNAs) than previously described, helping to resolve the discrepancy between the vast amount of observed intergenic transcription and the limited number of previously known lincRNAs. In total, we identified tens of thousands of putative lincRNAs expressed at a minimum of one copy per cell, significantly expanding upon prior lincRNA annotation sets. These lincRNAs are specifically regulated and conserved rather than being the product of transcriptional noise. In addition, lincRNAs are strongly enriched for trait-associated SNPs suggesting a new mechanism by which intergenic trait-associated regions may function.

Emphasis added, because that's been one of the key points in this debate. The authors regard the ENCODE data as "firmly establishing the reality of pervasive transcription", so you know where their sympathies lie. And their results are offered up as a strong corroboration of the ENCODE work, with lincRNAs serving as the, well, missing link.

One thing I notice is that these new data strongly suggest that many of these RNAs are expressed at very low levels. The authors set cutoffs for "fragments per kilobase of transcript per million mapped reads" (FPKM), discarding everything that came out as less than 1 (roughly one copy per cell). The set of RNAs with FPKM>1 is over 50,000. If you ratchet up a bit, things drop off steeply, though. FPKM>10 knocks that down to between three and four thousand, and FPKM>30 give you 925 lincRNAs. My guess is that those are where the next phase of this debate will take place, since those expression levels get you away from the noise. But the problem is that the authors are explicitly making the case for thousands upon thousands of lincRNAs being important, and this interpretation won't be satisfied with everyone agreeing on a few hundred new transcripts. These things also seem to be very tissue-specific, so it looks like the arguing is going to get very granular indeed.

Here's a quote from the paper that sums up the two worldviews that are now fighting it out:

Almost half of all trait-associated SNPs (TASs) identified in genome-wide association studies are located in intergenic sequence while only a small portion are in protein coding gene exons. This curious observation points to an abundance of functional elements in intergenic sequence.

Or that curious observation could be telling you that there's something wrong with your genome-wide association studies. I lean towards that view, but the battles aren't over yet.

Comments (25) + TrackBacks (0) | Category: Biological News


1. C on July 1, 2013 11:50 AM writes...

Derek, genetics isn't your area. It is clear from GWAS, as well as other studies such as exome sequencing, that much of the risk for complex disease resides in non-coding portions of the genome. This shouldn't be so surprising. GWAS is a time-tested technique and has found many replicable associations.

Permalink to Comment

2. Puff the Mutant Dragon on July 1, 2013 12:04 PM writes...

IMHO, the problem with GWAS is the underlying assumption, the common disease common variant hypothesis. I think the common disease rare variant hypothesis is probably more accurate for many complex disorders (e.g. autism, schizophrenia etc.) where GWAS has failed

Permalink to Comment

3. puff the mutant dragon on July 1, 2013 12:04 PM writes...

although I am not a geneticist either ;)

Permalink to Comment

4. C on July 1, 2013 12:12 PM writes...

I think it is clear that a combination of common and rare variants are responsible for complex disease such as schizophrenia and autism and that GWAS is more sensitive to the common variants. However, rare variants are difficult to study using the tools of human genetics unless they are highly penetrant and confer a high degree of risk. The abundant amount of rare variants that confer modest degrees of risk that underlie many complex of genetics are simply difficult to study with statistical methods. This doesn't make GWAS a failure, though.

Permalink to Comment

5. ESIMS on July 1, 2013 12:55 PM writes...

always the good old question: yes it is statistically significant (by your method of choice), but is it also biologically relevant?

lincRNAs are cool, but as long as you don't show what they are doing in particular it might just be an artifact/misassignment of your sequence analysis.
And this "specifically regulated and conserved rather than being the product of transcriptional noise" part... based on the Kd of TFs, PolII assembly, insufficient termination... not really convinced

Permalink to Comment

6. RKN on July 1, 2013 2:49 PM writes...

Yet another -ome. Where will it end!

Permalink to Comment

7. Jonathan on July 1, 2013 3:12 PM writes...

@Puff the mutant dragon - that is looking less and less likely now:

The genomic component of common diseases is probably going to be the cumulative effect of a lot of weak, common variants, not highly penetrant but very rare ones.

Permalink to Comment

8. Mad Dog on July 1, 2013 3:46 PM writes...

@1 and 4

What is (are) your measure(s) of success or failure of GWAS? If your measure is the ability to quantify smaller and smaller amounts of varience, then GWAS blows away any competition. If however, your measure is the ability to help researchers find new targets for drugs (gene therapy incuded)..then GWAS is a long and painful road to a lot of publications that don't help drug hunters worth a darn.

Drug Discovery isn't your area.

Permalink to Comment

9. ESIMS on July 1, 2013 4:05 PM writes...

Even if you think about helping MDs, the algorithms (or the errors/bias produced by the HT sequencers) are so far simply not good enough.

What is the medical relevance if you sequence person X and in the end you can tell him you have a chance of 5% (and that is really a high number!) to develop disease Y. If you genome sequence reaches a higher medical value/importance than your family history - that would be a real leap forward.

Permalink to Comment

10. Anonymous on July 1, 2013 4:10 PM writes...

GWAS is a study of noise. There is no underlying science there. In a sense of proposing a hypotheses and then either prove or falsify it.

Permalink to Comment

11. C on July 1, 2013 4:37 PM writes...


Can you back up your assertion with any real evidence? Same for @8. This is why I stopped commenting on internet post years ago unless highly motivated. It's just a bunch of people talking out their ass.

Permalink to Comment

12. Mad Dog on July 1, 2013 5:39 PM writes...

@ 11

I think I politely asked you to back up your assertion that GWAS had value in drug discovery...I will go as far to say be medically relvant. I can back up "traditional" pharmacology based drug discovery with 60+ years of success most notably the work from James Black and colleagues.

Let's take a recent example of the power of genetic testing....and the reasons why there are genetic councelors. There are terrible diseases that can be predicted with a high degree of accuracy....and people make life decisions based on them. Downs syndrome and the recent HER2 diagnosis with Anjelina Jolie and others. There you are playing a clear risk reward analysis; but it is still a partial gamble (88% chance is not a guarantee).

Now look at Anne Wojcicki (Sergey Brin's wife)...the head of 23 and me. At one point she mentioned an analysis of her genome pointed to a large (~40 fold?) increase in her susceptibility to breast cancer (I think that was the disease). Literally three months later, a "new" analysis showed that she she may have a compensetory gene that partially ablated the risk now only to a ~2 fold risk....seriously, you call that science!!! (caveat....I may have the precise numbers wrong)

GWAS is like playing roullette in zero gravity with the light flashing on and off and the dice tuneling between dimensions.

Too bad Douglas Adams is not alive....he would love the pure uncertainty of GWAS.

Permalink to Comment

13. sgcox on July 1, 2013 6:20 PM writes...

#11 C
Can I politely notice that your very first post was an attack ad hominem on Derek. It did not improve much since (post #10 is me).
#12 MD
Angelina had a defective BRCA1 I think. If I am not mistaken, there is no HER2 cancer predisposition. But the overexpresion indicates poor prognosis and it is a target of Herceptin.

Permalink to Comment

14. gcc on July 1, 2013 6:58 PM writes...

"Almost half of all trait-associated SNPs (TASs) identified in genome-wide association studies are located in intergenic sequence while only a small portion are in protein coding gene exons. This curious observation points to an abundance of functional elements in intergenic sequence."

About this quote from the paper, doesn't that assume that the trait-associated SNPs themselves are functional? I'm not familiar with the recent GWAS literature, but I remember hearing a few years ago that some people felt that most of the SNPs associated with diseases didn't actually directly contribute to the diseases, but were simply located near the genetic variants that actually did have an effect on disease susceptibility.

Perhaps with more SNPs having being identified by in recent years, this is no longer the case, but I'd be interested to hear from anybody who knows the GWAS field better than I do. Are most recently discovered trait-associated SNPs believed to be functional or are they just surrogate markers for other nearby genetic variants?

Permalink to Comment

15. ESIMS on July 1, 2013 9:18 PM writes...

Those are the numbers that you hear at the end of talks when an MD asks the question...

But think of it like this: SNP A is enriched by a factor 300 in the patients vs control group. Does this mean SNP A is deterministic for the disease? Well if you have mutation X, Y, and Z at the same time maybe, otherwise the chance is really low.

Most diseases are multifactorial and we often have no clue why a certain patient develops it and others don't. Hopefully the accumulation of more sequenced genomes will help us in the future.

About cancer: Angelina's familial BRCA1 is the one and only 90% example, this single mutation alone is able to make you develop cancer at some point (but only in the breast tissue).

Cancers are heterogeneous and accumulate mutations over time. At moment we have identified 200-400 driver mutations and you need like 8-12 that are, in combination, sufficient to make you develop a certain cancer type. BTW those driver mutations are in fact exclusively(?) in genes that have assigned functions like KRAS, p53.

So future would be: sequence a tumor (higher sequencing power & more sample can at least partially get you around the heterogeneity), go only for the driver genes (cancers = high background mutations that play no role), have access to the data from other patients with those mutations (which therapies failed vs. which worked) = increase the chance for a successful treatment.

Permalink to Comment

16. Poul-Henning Kamp on July 2, 2013 1:08 AM writes...

Just because some part is transcribed doesn't automatically imply that the resulting compound has a or even any function.

The "one per cell" could easily be just noise, from a transcription-process "running the stopsigns" so to speak, it would be amazing if that never happened, given the sheer numbers.

Obviously, having random transcribed gunk lying around, even at very low concentrations is probably going to have some kind of effekt, but most of the time it might just make more work for the cleanup crew.

Permalink to Comment

17. Mad Dog on July 2, 2013 10:30 AM writes...

@15 ESIMS:

Yes, I agree that there is value in looking at gentic varients and correlating that with other readouts. I originaly chimed in because "GWAS is a time-tested technique and has found many replicable associations." is not accurate IMO. Or maybe more correctly has yet to show any value. GWAS may not be a "failure", but I have yet to hear any success other than the creation of Big Data.
I agree with @16 (PHK) that there is a problem with the noise in these studies. An analytical chemist once quipped that "Our ability to measure has far outpaced our ability to understand", which is my problem with GWAS. I take particular issue when this "data" could be used GATTACA style by the health industry; VERY slippery slope indeed as the landscape is constantly changing.

@11 C: Try to approach your online postings from an educational standpoint. Maybe you have a unique perspective that could further educate myself and other members of this forum. Provide evidance such as case studies or an example that highlights the benefits of GWAS. Better yet, spend a sabbatical working in the clincal or biomarkers group of a pharma company. Both parties may benefit from the experience if you are as willing to learn as you are to share.

Permalink to Comment

18. Frank Landis on July 2, 2013 1:48 PM writes...

I've been out of the game for almost a decade now, but is anyone looking at what those lincRNAs actually do?

Yes, it's entirely possible they do something.

Back awhile ago, I helped with a little project that demonstrated that ITS2 sequences were more-or-less species specific. As I recall from that research, ITS2, which is non-coding, did this neat folding trick that lined the two pieces of the ribosome sequence that were "coded" so that the ribosome could function. Since the key function of ITS2 was folding so that the ends matched up, it turned out that the second fold could evolve more or less at random, and that area turns out to be really useful for identifying eukaryotes (it doesn't always work, but it works most of the time).

Anyway, here's a modest proposal: all that lincRNA is scaffolding, and its job is to twist, torque, and tangle properly to do some plethora of short-term assembly jobs. Of course it's not expressed as a protein, but it's functional nonetheless.

Just a random hypothesis.

Permalink to Comment

19. biochem on July 2, 2013 2:16 PM writes...

@18: F. Landis "Anyway, here's a modest proposal: all that lincRNA is scaffolding, and its job is to twist, torque, and tangle properly to do some plethora of short-term assembly jobs. Of course it's not expressed as a protein, but it's functional nonetheless."

This is what I've been thinking for a while. At first people thought that the lincRNAs would likely function by targeting or recruitment. While this may still be the case, there appears to be little solid evidence despite considerable interest. RNA secondary structure could function as de facto architecture in order to induce protein allosteric change, thereby modulating protein enzyme or binding activity.

Permalink to Comment

20. Cellbio on July 2, 2013 3:29 PM writes...

As someone who spent a few years trying to commercialize value from GWAS, I reach the opposite conclusion than C regarding the value of GWAS in serving patient's need, either diagnostic, prognostic or in driving development of therapies.

First, a clear look at the literature reveals many associations that do not repeat, but also many that do. If one then asks about the value of the snps that are confirmed, there are in fact very few that prove useful, despite replicable association values due to the very small difference in frequencies between groups. What I mean is a frequency of an allele in one population, say the disease group, may be 43%,and in the non-disease controls, 38%. If n is large enough, p value rises, but value in the clinic is limited at best. Also, in the clinic, all are patients, so case-control defined snps are not the best starting point anyway.

Also, though hotly debated, the work of Mary-Claire King suggesting that GWAS results are largely cryptic heritage markers strikes me as much a plausible explanation for non-coding snp association with disease as invoking a functional role for those regions. Careful control of ethnicity by heritable marker analysis led to most everything disappearing in our case. Discovery of markers within tightly defined groups led to entirely non-overlapping sets of genes associated with disease phenotype. Plausible finding, but not very comforting about the ability to use common markers to identify single loci associated with disease or therapy outcomes. I think, as stated above, common disease is driven by a multi-gene interaction with environment that is very hard to clarify with GWAS.

Permalink to Comment

21. Peter Ellis on July 3, 2013 4:38 AM writes...

"a transcription-process "running the stopsigns" so to speak"

Or indeed the opposite! Quite a number of these lincRNAs seem to be transcribed from upstream enhancer elements, in the same direction as the nearest downstream transcript.

One very plausible hypothesis is that these enhancers function as a PolII delivery mechanism - recruit your polymerase and let it trundle off down the tracks to promoter of the target gene. Once you get there, terminate the (irrelevant) ncRNA and re-initiate at the protein-coding gene. If you need more transcription in a given tissue, stick in another upstream enhancer to deliver even more PolII.

This hypothesis explains the location, orientation and tissue specificity of the ncRNAs. The ncRNAs are in the main taken care of by nonsense mediated decay, we just happen to be able to now detect the slightly more abundant / long-lived ones before they get degraded.

Permalink to Comment

22. Mary Kuhner on July 4, 2013 12:48 PM writes...

I'd like to see a comparison of the frequency with which family studies implicate non-coding DNA compared to the frequency with GWAS. If there are a lot of important things in the non-coding, some of them should presumably pop out in family studies as well.

Of course it's possible that being in the non-coding fraction correlates strongly with having an effect too weak for family studies. But if so, that suggests that GWAS results will on average be less useful, since loci of small effect are hard to do anything with.

I am puzzled, though, by the idea that the uselessness of finding small effects is a criticism of GWAS. If a disease is, in fact, caused by multiple interacting loci of small effect, shooting the messenger won't make that fact change. And we can at least hope that the small-effect GWAS hits provide hints as to the underlying disease mechanism--that would be useful even if the specific SNPs are totally useless for patient prediction.

Permalink to Comment

23. JBosch on July 4, 2013 2:18 PM writes...

I guess this study was not done on single cell sequencing, so you will get an average over say 100-200 cells.
This year at the Biophysical Society meeting in Philadelphia there was an outstanding talk by
Chenghang Zong, Harvard University, Whole Genome Amplification and Sequencing of Single Human Cells.

Permalink to Comment

24. Paul on July 5, 2013 9:14 PM writes...

More on lncRNAs and the X chromosome (in next Science):

Permalink to Comment

25. matt on January 26, 2014 1:42 PM writes...

A year later, perhaps, but...

If I understand it correctly, though I admit I haven't read the references yet, they specifically picked lincRNAs whose sequence was not conserved. Doesn't that indicate they are outside the percentage of DNA that was considered evolutionarily useful? And what are the animals that show surprisingly little non-conserved DNA? Does this indicate they really are lacking some epigenetic chromatin-sorting/activation mechanisms that the "junkers" have?

Independently of how many lincRNAs to which this applies, sorting out what is going on here seems to be new and interesting territory.

Permalink to Comment


Remember Me?


Email this entry to:

Your email address:

Message (optional):

The Last Post
The GSK Layoffs Continue, By Proxy
The Move is Nigh
Another Alzheimer's IPO
Cutbacks at C&E News
Sanofi Pays to Get Back Into Oncology
An Irresponsible Statement About Curing Cancer
Oliver Sacks on Turning Back to Chemistry