Tuesday, April 24, 2007

The Rhesus Macaque Genome - Can it help us learn about ourselves?

Just recently Science published the paper describing the latest primate genome - the rhesus macaque genome. (Check out Science's macaque website for some good (and free) articles on the subject.) Sequencing a large genome like this one is resource intensive (unlike microbial genomes, which are now easily and routinely sequenced), so why did scientists sequence yet another primate genome? In addition to the human genome we already have the chimp genome, and we also have several non-primate mammalian genomes - the mouse, rat, cow, dog, and opossum genomes. Is this a good use of our money? Why put in so much effort just to study evolution?

Evolution is worth studying in and of itself, however evolution is so tightly connected with every field of biology that it's hard to avoid evolution when you're studying anything else. We sequence these genomes because we know we can use evolutionary principles to understand the nuts and bolts of the genome. This strategy has been used already with great success in major genetic model organisms, including flies, worms, and yeast.

Most of us, I would bet, are more interested in the human genome than any other, and ultimately we sequence these primate genomes to understand our own genome. The chimp genome is helpful, but we need a more distantly related species to really enable us to effectively use genome comparison to learn about all those parts in our DNA. The rhesus macaque is a great pick, because it has been used extensively in medical research, and it is an Old World monkey - one of our closest relatives outside of the great apes.

Genome sequencing simply gives us raw sequence, such as this region from human chromosome 11:
(Sequence is read from left to right, line by line, like regular text.)

GAGGAGGCGGCGGAGGAGGGGCCGCCCGCGGCCCCCGGCTCACTCCGGCACTCCG
GGCCGCTCGGCCCCCATGCCTGCCCGACCGCGCTGCCGGAGCCCCAGGTCCGGGG
GCGGAGGGGAGCGCTGCCGCGGGGGTGGGCGGGCGGGGCGCGGGGGCCATGTGCG
AGCGCGGCAGGGAGGCGGGCGGGGCGGGCTGCAGGCGGGGTCCGACTCTGGGGCC
AGTCCGGGCCACGGTTGGGACCCAGTCGAGGGTCGGACTGGTCAGGGTTCAGGCG
GGATCCGGCGTCCGAGTCCTGGTGGGCCGGCCTGGGGCAGGATCTGGCTCTGGCT
GCGGGTCCTGACTCGGGTCAGGGTTGGGCCTCCGATCCAGCCCGCTCCGGGGCAG
GGTTCAATCCCGCATTTGCCGAAGTCCCTGGGGCTGGCCGGGGTGGAAGACGGGG
AGGGCTCTATGTCTGGGAAGGGGCTCTGAAGACCACGTGGGGGCGCTCGAAGGGG
CCTGGGGCCACCCTCCTCTCTGGGTCAAAGGTCATCGCACCGGCAGGGGAGAACT
TCCTCCTCCTTGGCTCTCCCCACTTACTTCCTGATAACCTGGTAGAGGTCTCCCG
CGGGCGGGGAGGGGGAGGCGTAGCAACTTTAGGCAACTTCCCAAAGGTGTGCGCA
GGTTGGGGGCGGGACGCGGCGCCCCGGGAGGTGGCGGCCTCTGCGACAGCGGGAG
TATAAGAGTGGACCTGCAGGCTGGTCGCGAGGAGGTGGAGCGGCGCCCGCCGTGT
GCCTGGGACCGGCATGCTGGGGCAGGAGGGCAGCCGCGTGTCAGGTGTGAAAAGC
TCTGGAGGTGTTTTCATGAGTCCGTGCCTGTGCGTGTGGATGTGGGGAGACCTAG
TGAGAGTGTGTGTGATCATGAGCCTTGACTGAGTTCGTGGATGGGGTGTGCGCTC
CAGGAGAAGTGTGTGAGCACAAGTGTGAGCAGGAGTGAGCACGGGTTTGGGAAGG
CCGGTGCAAGTGTGAAAGCCCTCAGCAGAGAGCGAGCCTGCGTGGGCTTGTGGGG
CTCCTGAGCACCCCGGTGAGTGGAGTGTGTGAACTCGGTGTGAGCACGTCCACTG
GCCTTGGGTCTGCTCTCCAATGCAGAATACCCAGATGAGGGCAGGGTCTCAGAGG
TCCCCCCAACATCTGGAGAAAACTGGGAAGTATCCTGCTCCTGGCTAGGGATTCC
AGGTGGGGTTGAAGGTTGCCTGGGGGCTACGGTTACCCTGCTCCCTGGCCTGGGT
GGGAGTAGGGGCTTTCTAAGCCTCCCCCAGGTTCCCAAGGGGGAGACCTGCTGTC
AGTTACTGGCCCTGAAGACTCTGTTTCCATGGCAACAGCTAGGAGGGGGCAGTGT
TCCTGGGCAGTCCTTCCTTGGACTCTGCCCCCCTTCTTCCCCACTTGCTGGGCTT
GGAAGCCTGGCCCTAGGCCCGAGGTTGGGCAACCCGTGTGGCAGGGTGTCTCCCA
TCCCCCATACCAGTGCTTTCCTGCGAACCTATGGGTCTCTCCGTGCAGGTGACCA
GCGCCATGTCCAGCCAGGTGGTGGGCATTGAGCCTCTCTACATCAAGGCAGAGCC
GGCCAGCCCTGACAGTCCAAAGGGTTCCTCGGAGACAGAGACCGAGCCTCCTGTG
GCCCTGGCCCCTGGTCCAGCTCCCACTCGCTGCCTCCCAGGCCACAAGGAAGAGG
AGGATGGGGAGGGGGCTGGGCCTGGCGAGCAGGGCGGTGGGAAGCTGGTGCTCAG
CTCCCTGCCCAAGCGCCTCTGCCTGGTCTGTGGGGACGTGGCCTCCGGCTACCAC
TATGGTGTGGCATCCTGTGAGGCCTGCAAAGCCTTCTTCAAGAGGACCATCCAGG
GTGAGCCCCCAGCCCACTCCCCTGTCCTTTGCCCTGCACCCTCTGGGTACACTGC
TGGGTGCAATAGGCCCCCTGATGGCTGTGGCACCGCTTGAGGCTAACAATCTGGT
GTTTCCAGTCCCTCTACCTCCCAGAGACACTCTTTCCCTGAGAAGTATGGTAAAA
GCACCGGGTGTGCTGATGCATTGCAGTGGATGTGAGTGAGTTCAGGGTACCACCT
GGGTACTCTAGGCCCAGCACCTTCTACAGTGGCTCTGAAAGAGTCCAAGGCAGCC
TCTGTCTGTTCCTAAGCTTTGTTCTTGTTTCTGGCAGCTTCTGACCTCTCCCCAG
CATAGAACATGTCCCCTTTTTGTTAATTTTCCCAAAGCAGCACCAACACAAGGCA
GATTTTAATTTTTTTTTTTTTGAGACAGAGTCTCACTCTGTTGTTCAGGCTAGAG
TGCAGTGGCACAATCTCTGCTCACTGCAACCTTTGCCCCTGGGTTCAAGAGATTC
TCCTGCCTCAGCCTCCTGAGTAGCTGAGACTGCAGGTGTGCACCACCACGCCCAG
CTAATTTTTGTATTTTTAGTAGAGACGACGTTTCACCATGTCGGCCAGGCTGGTC
TGGAATTCCTGACCACAAATGATCCACCTGCCTCGGCCTCCCAAAACAAGGCAGA
TTTTTATCAGTACTTGAGAGGGGCTACATCATAGTTTAGCACCCAACTTTAAAAA
GACTAACAGGCAAGGCCGGACACAGTTGCTCACACCTGTAATCCCAGCACTTTGG
GAGGCCAAGGTGGGCGGATCACCTGAGGTCAGGAGATCGAGACCAGCCTGGCCAG
GGTGGTGAAACCGCATCTCTACTAAAAATGCAAAAAATTAGCTGGGCATGGTGGC
TCGCGCCTGTAATCTCAGCTACTTGCTACTTGAGAGGCTGAGGCAGGAGAATTGC
TTGAACCCAGGAGGCAGAGGTTGCAGTGAGCCAAGATCACACCACTGTACTCCAG
CCTGGGTGACAGAGCGAGATTCCATCTCAAAAAAAAAAAAAAAAGGCCGGGCACT
GTGGCTCATGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCATGAGGTC
AGGAGATTGAGAACATCCTGGCTAACACGGTGAAACACTGTCTCTACTAAAAATA
CAAAAAATTAGCTGGGCATGGTGGCGGGCGCCTGTAATCCCAGCTACTTGGGAGG
CTGAGGCAGGAGAATGGCGTGAACCCAGGAGGCGGAGGTTGCAGTGAGCCAAGAT
CACGCCACTGCACTCCAGCCTGGGCGACAGAGTGAGACTCCGTCTCAAAAAAAAA
AAAAAAAAGGCTGGGCGCGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGC
CGAGACGGGCGGATCACCTGAGGTCAGGAGTTTGAGACCAGCCTGACCAATGTGA
TGAAACCCCGTCTCTACTGAAAATACAAAAATTAGCCAAGCATGGTGGCATGCGC
CTGTCATCCCACTCAAGAGGCTGAGACAGGAGAATTGCTTGAACCTGGGAGGCAG
AGGTTGCAATGAGCCCAGATCGCGCCATTGCACTCTAGCCTGCGCAACAAAAGTG
AAACTCCACCTCAAAAAACAAAAACAAAAACAAAAACAAAAAAACCCAAAAACGC
TGGGCTTGGTGGCTCATGGCCTGTAATCCCAGCACTTTGGGAGGCTGAGGCAGAC
GGATCACGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTAAAACCCCGTC
TCTACTAAAAATACAAAAATTAGCCGGGCGTGGTGGTGAGTGCCTGTAATCCCAC
TACTTGGGAGGCTGAGGCAGGAGAATTGCTTGAACCCGGGAGGCAGAGGTTGCAG
TGAGCTGAGATCATGCCACAGCACTCTAGTCTGGGCAACAGAATGAGACACTCTC
ATCTCAAAAAAAAAAAAAAAAGGACTTACAGGCATGTCTGCTCTTAAAAGTCACT
AATTTTTTTCTCACTCAGGAAAGCTTATCAGAATTTGGGGGAATGAGCAAGATGC
TGACATTAAGCATTGCCTGGGAAGGGCCTATTATTTCCGTTATTTCTGCTTTTAT
GTAACCATTGGTTACTTTGGGGGCTATAACACGTATAATTAAAAAAAAAAAAAAA
AAGGCCAAGTGTGGTGGCTCACACCTGTAATCTCAGCACTTTCGGAGGCTAAGAT
GGGAGGATCACAAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCC
TGTCTGTACTAGAAATACAAAAATTAGCCAGGTGTCGTGGTGGGTGCCTGTAGTC
CCAGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTGGAACCCAGGAGGCAGAGGT
TGGAGTTAGCCAAGATCGTGCCACTGCACTCCCAGCCTGGGTGACAGAGTGAGAG
TTCGTATCAAAAAAAAAAAAAAAAAAAAAATCTTGAGTGCTTACCTTGTGCTAGG
CACTGTATTCTTTTATGATCTCAGTTAGTCCCCACAGCAACCCTATAAGGTGTCA
GTACTGTTATAACTGAAACTAAGAGAGGCATTTGAAACTTTGTTGAAGTCTCACA
ACTAGGAAATGGCAGAACCAAGATTTGAACTTGGGTCAGTATAGGTCCAGAGCTG
AGCTCTTCAATGTTAGACTGCTTCCTCTGCTTATTACTAATAACACCGAACTTTG
GACAGACGCTGAATGACTGATTGTGACATTCCAGCACGTTTTTTTTTTTTTTTTT
GAGACAGTCTCGTGTGGTCGCCCAGGCTGGAGTGCAGTGGCACGATCTCGGCTCA
CTGCAAGCTCCGCCTCCCGGGTTCACACCATTCTCCTGCCTCAGCCTCCTGAGTA
GCTGGGACTACAGGTGCCCGCCACCACGCCTGGCTAATTTTTTGTACTTTTAGTA
GAGACGGGGTTTCAGCGTGTTAGCCAAGATGGTCTTGATTTCCTGACCTCGAGAT
CCACCTGCCTTGGACTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACTGCTCCT
GGCCAGGTTTTTTTTTTTTTTTTTTTTTTTTTTGAGATGGAGTTTTGCTCTTGTT
GTCCAGGCTGGAGTGCAACGGCCTGCAGTCGTGGTTCACTGCAACCTCTGCCTCC
CGGGTTCAAGCCATTCACCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGC
CTGCCACCATGCCCGGCTAATTTTTGTGTTTTTAGTAGGGATGGGGTTTCACCAT
GTTGGCCAGGCTGGCCTCAAACTCCTGACCTCAGGCGATCTGCCCTCCTCGGGCT
TCCAAAGTGCTGGGATTATAGGTGTGAGCCACTGCACCCCGCCAATCCAGCAAGT
TTTAACTTGGCCAAAATCCACCAATCTTAAACTTTGTGCACCCTTCCCACTCTGA
AGAACAGTGAGCCAGCCGGCCAGGGTGCGGGTATCTCCTACCTACCCTGGGGCCC
CTCACTGTATGTTGACTATTGACAAATATTTATTGTGTGCTGGCTGTGAATAGGA
CTTGTATATTGAGCACTTAGGTGTCATGAACCATGCTGGATGTTTTGACCATATT
ATCCCCTTTAATTCTCACGACCCAACTCTGTGGGGCACTTTTACAGCTGGGAAAC
TGAGGGTTCAAGGGGTTAGGTATGGGACTTGCCCAAGGTCATAAAGGTATGTGGT
AGCCAGAGTCCCTGTTCGGCACAGACCTGTTCTTTGCTGTCCTGGCCAGTGTTCC
AGGCCTTGGGGACATAGCTGGGGCTGAAGCAGGGCTGTTTCTGCCCTCAGGCAGT
TTACATCCTGGCAGAGGGGAGAGCTGGGCAACAGTGAGTTGCACAGACTTGTCTT
ATTACCGCTGTGGTATGTGCAGGAAGGGGAGGTGCTGGTTCTGAGGCTCCAGAGG
GCTTGTCTTTTTTTTTTTTTTTTTGAGACGGAGTCTCGCTTTGTTGCCCAGGCTA
GAGTCCAGTGGCGCGATCTCGGCTCAGTGCAAGCTCCGCCTCCCGGGTTCAAGCG
ATTCTCCTGCCTCAGCCTCCCCAATAGCTGGGATTACAGGCGCATGGCACCACGC
ACGGCTAATTTTGGTATTTTTAGTAGAGACTGGGTTTCACCATGTTAGCCAGGAT
GGTCTCGATCTCCTGACCTCGTGATCCACCCGCCTCGGCCTCCCAAAGTGCTGGG
ATTACGCTCCCGGCCTCTTTTTTTTTTTAGACAGAGTCTCACTCTGTTGCCAGGC
TATAGTACAGTGGCACGATCTCAGCTTACTGCAACCTCCGCCTCCCAGGTTCAAG
CGATTGTTCTCCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCACACGCCCAGCT
AATTTTTGTATTTTTAGTAGAGACAGGGTTTCACCGTGTTGGTCAGGCTGGTCTC
AAACTCCTCACCTCGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTATA
GGCGTGAGCCACTGCGCCTGGCCTTTTTTTTTTTTTGGTACAGAGTTTCGCTCTG
GTTGCCCAGGCTGGAGTGCAATGGCACGATCTTGGCTCACTGCAGCCTCTGCCTC
CCGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCGGAGCAGCTGGGATTACAGACA
TGCACCACCATGTCCGGCTAATTTTTTTTTTTCGAGATGGAGTCTCACTGTGTCA
CCCAGGCTGGAGTGCAGTGGCACAATCTCGGCTCACTGCAACCTCTGCCTCCCGG
GTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGGTGCCTG
CCACCACACCCAGCTAATTTTTGTACTTTTAGTAGAGACGGGGTTTTACCATGTT
GGCCAGGCTGGTCTTGAACTTCTGACCTCAGGTGATCCACCCACCTCGGTCTCCC
AAAGTGCTGGGATTACAGGCGTGAGCCACCGTGCCCGGCCGTGGTGTCTTGAGCT
GAGTGCAGAAGCGCAAATAGGGGGTAGGAGAAAATGCACCGCGAGGAGAAATGTG
CTGCGGGCCTGCTGTCTAGCTGTGTCATTTGGTCGTTGCGGGGCCCTGTGAGGCC
GGGAGGGCTGCCAGCACCCACCATGTGCCAGGCCTCGTTGCTAGTGCTGGGGCCA
GTTCCTGCCCCGGTGGAGCTGCCACTGAAGGGGGAGGCGTAATAAACAAGATAGG
TGAGTGCATATGCAGCGTGGTCTGTTGTGCTGAGGGCTGAAGAGAAACCAGAAGC
AGGGCTCAGAGGCCAGGAGGACTCTGCAAAGGGATTTGGCATTATCACAGGGTGG
CCAGGGAAGATCTTCAAGGTGACAGTGAGCAGAGGGAGGTGAGGGAGCCTGTGTG
GACTTCAGGACTAGAGCTCCAGGCAGGGCCTGTTTGAGGAACATGGAGGAGGCGA
GAGCAAGGAGTAGAGGTCAAAAGGAGGCAAGAAGCAGGGGCGTAGGCCTAGGAGG
ACATAGGTTCGCTTTGGCTTGGACTCAGAGAAGGGAAATCCCCAGAGGGTTTTGA
GAAGAGGAGGTACAGGATGTAATGGAGGCTTAATAGGACCCTCTTGGCTGCTGAG
TCGAGAACAGACTGGAGCAAGCAGGGACAGCCAAGCGAGGGGCGAGGTGACAGTG
ACTATCAGGTCAAGGGTGGAAGTAGTTGCCAGGGGCAGGAGGCGGATTCTGGACC
TTGGAGGAGGTAAAGCCCACCAGAATGTGTCGGTGGCTTGGATGTGGGGTGTGAG
AGGAACCAGAGATTCTGCCTAGGTTTCTTCTTGGGCAAGTGAACACGTGGAGTCC
ACGTAGGCTGTGTTCGGTCCGAGATGCCTTCTAGACATGCAGGATGTCAAGGAGG
CAGCTGGAGAGATGGGTCTGGAGCTCACAGCAAGTCCAGGCTAGAGGTAGAAACG
TGAGAGCCCCACGGCTGGGGAAGATTGCCATGGGATTGGAGATGAGCTCCAAGGA
CAGCCCTGGCAGTCTGGATGGAAGAGCTTGGGAAGATGCTCAGAAACCACAAAGT
GGCTGGTGCGGTGGGAGGAAAACCAGAGTGTATGCTGTCCTAGAAGCAAAAGAAG
AAAGTGTTTCAGTGTTTCTAGGAGCAGGAAGTGATCAACAGCCTTAGATCCTCCT
TTTAGGCCAAGTAACATGAGGACTAAGAATTGACCACTGGATTTAGCAATGCAGA
GGTCCTTGTGGCCCTTGATGTCGGCAGATGAGGGCAGTGTGGTCCAGAGATGAGG
CTTGGGGCTGAGATGCAGCCCCGCTGCCTGGTCCAGCTCCTCCCTCATCCAGGCA
GGGCTCCCCCGCCCAGCAGCCACTCCCCTCCCTGCCTGCTCATGGCCCCCTGCTC
TCCCTTTCCTCCCCATACCCCCAGACCTGTGCTTGCCCGGGGAGAGTCAGGGCTC
TCCTGTCAGCTGGGTCCCCTCCCAGCCCCGGGAGGCCGCCACTGGAGCCCTGCCT
CTTCCTGGCAGGGAGCATCGAGTACAGCTGTCCGGCCTCCAACGAGTGTGAGATC
ACCAAGCGGAGACGCAAGGCCTGCCAGGCCTGCCGCTTCACCAAGTGCCTGCGGG
TGGGCATGCTCAAGGAGGGTGAGCGCTGGGCAGGGGCTGGGCGAGGGCTGGGGGA
GTCGGGGACCCGGGCCAGGTGGGGGTGAGGCCTGGGAGTTCTGGTGAGTGGACTC
GGG

I purposely included a long chunk (actually it's really a very tiny piece of chromosome 11) just to convey what this vast sea of unannotated sequence looks like. About all you could do with this is use the genetic code to see if there is something that looks like a protein-coding segment in there. But actual protein coding regions are very sparse, and broken into fragments called exons, which are spliced together before the final protein is made.

What we really want to know is where the gene is (in this case, an estrogen receptor gene) and it's controlling elements are. You won't be able to see the details below, but here is the big picture:




And here we're looking for only a few elements - I haven't included promoter regions, enhancers, non-coding RNAs, transposable elements... To find these elements requires three things:

- computer tools to build models of these elements and search the sequence
- sequence from related species for comparison
- experiments to test your computer predictions.

We have these three elements for yeast flies and worms, but in the case of humans, we have sorely needed more sequence, from an animal like the Rhesus monkey.


I'll finish up with an example from my own work in yeast. Certain proteins, which are master regulators of cell division, modify target proteins at the sequence 'TP..any letter..R or K'. (Now we're talking about protein sequence, so we don't just have A's, T's, G's, and C's.) To understand how these master regulators carry out their role, we would like to know exactly which proteins are their targets. How do we find those targets? Easy - just look for any protein that has 'TP..any letter..R or K' in it, and you have a candidate protein that you can test in the lab!

Well, it turns out it's not so easy - many proteins have this 'TP..any letter..R or K' just by chance - too many to test in the lab. So we want to choose the most likely targets - those whose sequence has been conserved throughout evolution. You can line up the sequences from different species, and easily see the 'TP..any letter..R or K' which has been conserved over 100 million years of evolution:



The sequence on the top line is from baker's yeast, and the sequence on the bottom is from a yeast that shared an ancestor with baker's yeast 100 million years back.

Comparative genomics really works. It has helped us learn a tremendous amount about flies, worms, and yeast. With the macaque genome, we'll hopefully have the same success learning about our own genome.

No comments: