What has the ENOCODE project done, and how do their results change our understanding of the human genome? In the last post I put this project into perspective by briefly outlining some past concepts of the gene and highlighting some of the ENCODE findings. Now it's time to take a closer look at the results of the ENCODE project and their significance for our understanding of the human genome. ENCODE's genome snapshot is unquestionably fascinating, and it suggests that some features of genome regulation that were previously viewed as exceptions to the norm are really quite common. But are these results revolutionary? Do they overturn any long-cherished notions about genes that scientists have heavily relied on in their understanding of gene regulation, as some have suggested? And do they support intelligent design? I don't think so.
What ENCODE Did
In one sense, the ENCODE project can be thought of as the third big Human Genome Project - the first project being the actual genome sequencing, and the second being the HapMap Project to extensively study genome variation in different human populations. The ENCODE project is an effort to find and study, on an encyclopedic scale, all of the functional elements in the human genome.
For the first phase of this project, the ENCODE researchers examined a small but reasonably representative chunk of the human genome (roughly 1%, or 30 million DNA bases) by running that chunk through a battery of experimental tests and computational analyses. Most of the experimental techniques and results are unfortunately beyond the scope of this little summary. This first round of the ENOCDE project produced a big paper in Nature, and the journal Genome Research has devoted its entire June issue to papers from the ENCODE project. I'm going to winnow down this mass of material to two of the most interesting topics: transcription and evolution.
Transcription (if you don't know what transcription is, look here):
The researchers attempted to identify regions of DNA that were transcribed. Why? Because our presumption has generally been that most (note the qualifier!) transcripts contain some functional material, such as protein-coding genes or non-coding RNAs that have some functional role (such as miRNAs, snoRNAs, rRNAs, etc.). Therefore by looking for transcribed regions, we can find new functional portions of the genome.
The transcribed regions were identified using tiling arrays, which are DNA-chips, or microarrays, that cover the entire genome and thus can detect transcription from any place in the genome. This is in contrast to more traditional microarrays that only detect the transcription of known genes. Thus by using tiling arrays and a handful of other complementary techniques, the ENOCDE researchers found that a large fraction of the genome region in the study was transcribed, including many places that have no recognizable genes. They estimate that up to 93% of the genome is transcribed, although the evidence for much of this is indirect and other explanations of the experimental results are possible. The actual transcribed fraction may be substantially lower, although it is still likely to be large.
The most interesting finding of these transcription studies is that a lot of strange stuff is ending up in these RNA transcripts. We have long known that different protein-coding regions (exons) from a single gene can be spliced together in various combinations to create many different proteins. The ENCODE researchers confirmed this (the protein-coding genes they studied produce on average 5.4 differently spliced forms), but they also found that chunks of other sequence end up in the transcripts, such as coding and non-coding portions of neighboring genes. Why this is happening is not yet clear, although part of the explanation is surely that the transcription and splicing machinery are more noisy than we previously (and naively) appreciated.
Another major part of the ENOCODE project is to find out just where transcription starts. Transcription start sites (TSSs) are important, because key regulatory events take place there. Regulatory sequences in the DNA, together with regulatory proteins, act at TSSs to control the protein machinery that carries out transcription; this control is critical for deciding which genes in the cell are 'on' or 'off'.
The ENCODE researchers found many new TSSs, sometimes very far away from known genes. Interestingly, the TSSs far away from known genes had different characteristics from those close to known genes, suggesting two distinct functional roles. One possible role for these distant TSSs is to control the higher-order structure (i.e., chromatin structure) of big regions of the genome, and thus to some degree regulating entire sets of genes. This work lays a good foundation for studying these control systems.
The ENCODE researchers searched for regions of the human genome that have changed little throughout mammalian evolutionary history; these are the regions that have been constrained by natural selection. They compared portions of the human genome with the genomes of 14 other mammalian species, and found that 5% of the genome is under evolutionary constraint, a result that agrees with earlier studies.
The immediate question then is, how much of the 5% consists of known functional elements? The ENCODE researchers reported the following breakdown:
Of the 5% of the genome that is evolutionarily constrained:
- 40% consists of protein-coding genes
- 20% consists of known, functional, non-coding elements
- 40% consists of sequence with no known function
The sequence with no known function is not too surprising. Functional DNA elements other than protein-coding genes are difficult to find, and in spite of many recent studies we know we're missing a lot. These results tell us roughly how much more functional, non-coding sequence we need to find, and where it is probably located.
The ENCODE researchers also looked at evolutionary conservation from another angle: how much of known, functional DNA falls into conserved regions? Protein-coding genes and their immediate flanking regions are generally well-conserved, while known, non-coding functional elements are less conserved. Again, this is nothing too surprising; non-coding elements tend to be very short and have what is called 'low information content', and they are more easily created and destroyed by by random mutations.
Many potentially functional elements, picked up in the experimental data analyzed by the ENOCODE groups, are not evolutionarily constrained - about 50%, when these elements are compared across all mammalian genomes in the study. This means that there are regions of the genome that are bound by regulatory proteins or that are transcribed, but which have not been constrained by natural selection.
Intelligently Designed Transcription?
I need to pause here and answer the obvious question here that those of you who aren't molecular biologists are probably asking: So does this mean that evolution can't explain much of the functional parts of the genome? Intelligent design advocates are already on the web, misreading the ENCODE work and claiming that it somehow supports the fuzzy claims of intelligent design. My advice: don't believe what you hear about this from people who only have the vaguest understanding of how ENCODE's experiments and analyses work (and that includes biochemist Michael Behe).
The ENCODE results do not cast doubt on evolution. Here are some of the reasons why:
1. Just because something is transcribed or bound by a regulatory protein does not mean that it is actually functional. The machinery of the cell does not literally read the DNA sequence like you and I do - it reads DNA chemically, based on thermodynamics. As I mentioned before, DNA regulatory elements are short, and thus are likely to occur just by chance in the genome. An 8-base element is expected to show up just by chance every 65,000 bases, and would occur randomly over 45,000 times in a 3 billion base pair genome. Nature does work with such small elements, but their random occurrence is hard to control. In a genome as large and complex as ours, we should expect that there is a significant amount of random, insignificant protein binding and transcription. Incidentally, such random biochemical events probably make it easier for currently non-functional events to be occasionally recruited for some novel function. We already know from earlier studies that this kind of thing does happen.
2. To say that something is truly functional requires a higher standard of evidence than the ENCODE research provides. The ENCODE researchers did a fine job detecting transcription and regulatory protein binding with state-of-the-art experimental and computational techniques, but confirming a functional role for these elements will require more experiments aimed at addressing that issue.
3. Some of the functional elements that don't appear to be conserved really are conserved. When you're comparing a small functional element in a stretch of DNA between say, humans and mice, it is often difficult to find the corresponding region in each species. The mice and humans may have the same functional element, but in slightly different places. Thus conserved elements can be missed. The ENOCODE researchers note this, and people like myself who study these small elements know from experience that this happens frequently.
4. Despite what you may read, there is still a lot of junk DNA. The ENOCDE project does not "sound the death-knell for junk DNA." Our genomes are filled with fossils of genetic parasites, inactive genes, and other low-complexity, very repetitive sequence, and it's extremely clear that most of this stuff no functional role. Much of this sequence may be transcribed, but remember that the ENCODE evidence for most of this transcription is indirect - their direct measurements only detected transcripts for ~14% of the regions they studied. Even if much of it is transcribed, this mainly suggests that it is not worth expending energy to actively repress this transcription, since there are so many other controls in place to deal with unwanted transcripts in the cell.
Enlightening but not revolutionary
Moving on from intelligent design, some people, around the web and in a few journals, are making the ENCODE results out to be more revolutionary than they really are. For example, writing in a Nature piece stuffed with exaggerated claims about what our "preconceptions" supposedly are (subscription required), John Greally states that "Now, on page 799 of this issue, the ENCODE Project Consortium shows through the analysis of 1% of the human genome that the humble, unpretentious non-gene sequences have essential regulatory roles," and "the researchers of the ENCODE consortium found that non-gene sequences have essential regulatory functions, and thus cannot be ignored."
Every biologist I know could have told you that "non-gene sequences have essential regulatory roles," years ago, before ENCODE. Larry Moran, over at Sandwalk says that he hasn't "had a 'protein-centric' view of a gene since I learned about tRNA and ribosomal RNA genes as an undergraduate in 1967." Where has Greally been all this time? I'm not sure why he is so surprised.
Also, as I mentioned above, not all (or maybe not even most) of the transcribed, intergenic sequences found by ENCODE are believed to have "essential regulatory roles." Non-coding DNA regulatory elements have been the subject of intense study by many groups for many years now. To claim that we have not paid enough attention to them is wrong. None of the types of transcripts discovered by ENOCDE are really novel; we've seen examples in earlier studies of they found. What is significant about the ENCODE results is the extent of this unusual transcription; what were once thought to be exceptions are now seen to be much more common.
I'm happy to see the ENCODE results; many of us will use their results in our own research, and projects like this certainly help to make the human genome much less of a black box. But they haven't shattered any paradigms that weren't already on their way out, or revolutionized the field of genomics.