Sunday, November 12, 2006

The problem with computational biology papers

OK, my title is too general - it should be, "The problem with some computational biology papers that deal with certain research questions." There is a type of trendy science that frequently crops up in many journals (including good ones like Nature and Science. It basically goes like this (for a prime example, look here):

1. A computational biology lab sees one or more genomic-scale datasets that they can do some calculations on (usually microarray data).

2. The computational biologists come up with some algorithm that's supposedly better than what's out there, and they crunch the numbers on the genomic data. This results in some predictions of novel regulatory interactions - for example, they predict that certain transcription factors regulate certain genes involved in cell division. At this point we have no idea whether their predictions are right, or even persusive enough to be worth testing. But it's a start.

3. The computationl biologists "validate" their results by using (notoriously incomplete) database annotations about the genes in their predictions, or by a shallow, cursory scan of the experimental literature (which the authors are usually not that familiar with). They then state something like "75% of our predicted transcription factor-gene interactions have some basis in the literature." Up to this point things are fine (they have made predictions, and given us some reason to believe that the predictions have a chance of being right), but then they usually go on and say something like this: "Therefore, we have demonstrated that our algorithm has the ability to find new regulatory interactions..." They have demonstrated no such thing. They have made predictions, but haven't bothered to test them; instead, they do a crappy literature survey (usually with significant omissions).

The result is that you get different groups coming up with all sorts of new analyses of the same genomic data (in my field, cell cycle gene expression and genome-wide transcription factor binding data are big ones), but never really making any serious progress towards improving our understanding of the biological process in question. The worst part is that, over time, the researchers doing this kind of work start talking as if we are making progress in our understanding, even though we haven't really tested that understanding. You start getting an echo chamber resonating with these guys who are citing each other for validation more than they are citing the people actually study the relevant genes in the lab.

This means that the experimentalists ignore the echo chamber, and then computational biology becomes irrelevant to experimental biology - which is a sad thing. There are so many 'validated' predictions out there, that the experimentalists don't really know where to start, where the good predictions are. And the computational researchers don't care enough to really work with someone who will actually go test things in the lab, in spite of the fact that if these computational biologists did care enough, they would get more notice from the experimentalists.

The problem is bad enough that one journal, Nucleic Acids Research changed their policy on computational papers:

"Computational biology
Manuscripts will be considered only if they describe new algorithms that are a substantial improvement over current applications and have direct biological relevance. The performance of such algorithms must be compared with current methods and, unless special circumstances prevail, predictions must be experimentally verified. The sensitivity and selectivity of predictions must be indicated. Small improvements or modifications of existing algorithms will not be considered. Manuscripts must be written so as to be understandable to biologists. The extensive use of equations should be avoided in the main text and any heavy mathematics should be presented as supplementary material. All source code must be freely available upon request."

This is a move in the right direction. But until more journals adopt this stance, beware of researchers who claim to have calculated the gene regulatory network for this or that process, or have identified 'modules' of interacting proteins that perform a function in the cell. If these claims, usually based on noisy, less than ideal genomic data, haven't been tested with serious experiments, they remain unproven hypotheses.

2 comments:

Anonymous said...

I agree with most of what you said. For my take on the problem, see my blog Sandwalk. The real problem arises when we have to award Ph.D.'s in Biochemistry to students who are graduating from a bioinformatics lab.

Unknown said...

Someone who gets a degree in biochemistry definitely should know how to do biochemistry in the lab. I think comp. bio students who are in a biochemistry dept. sell themselves short if they don't use that opporunity to gain some wet lab skills as part of their thesis research - it will be much more difficult to pick up those skills later.

As biologists become more and more adept at working in both worlds, many of these computational guys who largely analyze other people's data will get squeezed out, unless they are really hardcore computer scientists or statisticians who can invent significant new techniques.

BTW, thanks for the link - a skeptical biochemist! You're a scientist after my own heart.