Sunday, June 17, 2007

Time to Rethink the Gene?

After the tremendous discoveries in basic biology of the last 100 years, you might think that we would understand by now what a gene is. But the big news in genome biology this week is the publication of the results of the ENCODE project, a large scale experimental (as opposed to purely computational) survey of the human genome. The leaders of the ENCODE project suggest that we need to, yet again, rethink just what exactly a gene is.

I plan to cover this subject in two posts. Today I'll go over a very brief history of the gene and the basics of what the ENCODE project is doing. In a subsequent post, I'll dive into the ENCDE results, and tell you why I think the results are interesting, but not necessarily revolutionary.

A Brief History of the Gene

Mark Gerstein and his colleagues have written an interesting perspective piece on how the ENCODE results fit into our historical understanding of the gene. To put the ENOCODE results into perspective, here is a brief history (with some big gaps - go read Gerstein's paper, or check out Evelyn Fox Keller's book The Century of the Gene; and if you don't know the basics of the Central Dogma, check out my summary):

Something responsible for heritable traits: Beginning with Mendel (who did not use the word "gene"), geneticists at first thought of genes as something within an organism that makes fruit fly eyes red, or peas wrinkled, that is passed on to offspring. The key idea is that heritable traits were passed on as units of something, although of course no one knew what. Early in the 20th Century, some geneticists began to get the idea that genes were arrayed in a linear fashion, and thus were placed at various distances from each other.

Something that makes an enzyme: George Beadle and Edward Tatum, performing genetic studies on bread mold, worked out the idea that a gene somehow was responsible for making an enzyme. Their concept is sometimes referred to as the "one gene one enzyme" idea.

An open reading frame (ORFs): After the genetic code was worked out, a gene was recognized as a stretch of DNA that coded for protein, starting with the DNA bases ATG (which means 'start' in the genetic code) and ending with the sequence TAG, TAA or TGA (meaning, naturally, 'stop'). This concept is useful because you can look at a big chunk of DNA sequence and see where all of the protein coding regions are. Also included in this concept of a gene is the idea that DNA elements outside of the coding region regulate the transcription of the gene, especially the region immediately before the starting ATG.

Genes in pieces: a twist on the open reading frame idea, biologists discovered that protein-coding chunks of genes (called exons) were interspersed with long non-coding chunks, called introns. Before producing the final protein, exons have to get spliced together. In mammals, exons tend to be fairly short, while introns are extremely long, so a gene can be spread out over long stretches of DNA. An extra twist is that exons can get spliced together in a variety of different combinations, so that one gene, consisting of multiple exons, can produce many different proteins. In addition, we now know that the non-coding regulatory elements are dispersed much more widely than previously appreciated.

No protein needed: Not all genes code for proteins. MicroRNAs are genes which are transcribed and are flanked by regulatory elements just like ORFs, but they don't code for protein. They seem to be involved in regulating the transcription of other genes, and several hundred microRNA genes have been reliably confirmed in the human genome.

The ENCODE Project

A major goal of the ENCODE project is to identify all of the functional elements in the human genome. If one includes all of the known ORFs, regulatory elements, and microRNAs, they make up a few percent of the genome. The remaining DNA unquestionably includes a lot of junk, such as LINEs and SINEs and other DNA parasites that exist simply because they are able to perpetuate themselves. On rare occasions in evolutionary history, some of these parasites get recruited to perform a beneficial function. But most of the parasites are inactive, mere molecular fossils. Other molecular fossils include once-functional genes that have been irreparably scarred by mutation.

But we also know that there is more functional material there; for example about 5% of the genome shows evidence of being under natural selection, and this 5% covers more than just the functional elements we know about. So far, our best attempts to find functional elements have been based on computer searches to find DNA that has been conserved through evolution, and that resembles known functional elements. But the ENCODE research groups have now performed extensive experimental tests on 1% of the genome. 1% may not sound like a lot, but it is enough to give a good idea of what we're going to learn when results for more of the genome come out.

I'll go into more detail in my next post, but there are a few highlights that the ENCODE researchers have emphasized:

- Much more of the genome is transcribed than we previously knew about, although a lot of this may be unregulated, non-functional transcription. Many apparently functional transcripts are extremely long and transcripts of one gene frequently contain sequence that overlaps with another gene.

- Regulatory elements are frequently found both upstream and downstream of genes on the DNA strand; previously most (but not all) regulatory elements were thought to be upstream.

- There is more extensive gene splicing than we once thought - different exons are mixed up in previously unrecognized combinations.

- 5% of the genome is under the constraint of natural selection, and more than half of this consists of non-protein-coding elements.

What is the significance of all this? I'm not inclined to view it as revolutionary; it seems like much of this confirms many things we previously suspected about the genome, except perhaps that features we once thought were unusual are now known to be much more prevalent.

So this is the ENCODE project in context; tune in for the next post, in which I'll delve into the details some more and offer a much more opinionated outlook.


Dave Bridges said...

You'll probably have more on this in your next post, but the thing that amazes me is how much is transcribed, but is not under evolutionary constraint. And how much does it vary from person to person (since it varies so much from species to species)

Mike said...

The pervasive transcription is intriguing, and raises interesting questions about the regulatory strategy that our cells are using. Is random, excessive transcription just not worth repressing when tight post-transcriptional regulation is in place?

And what significance does this have for evolutionary plasticity? These results seem to mean that it's very easy to randomly gain a transcriptional start site.

In my next post I plan on indulging in some most likely incorrect speculation, but these things are fun to think about.