Science —

Changing software, hardware a nightmare for tracking scientific data

In a multipart series, we take a look at why a simple principle—scientific …

We've gone into all the problems involved with preserving and sharing scientific data in some detail, but the challenges don't end there. Typically, data doesn't speak for itself; it has to be analyzed and interpreted. And, these days, that analysis generally involves computer tools. Even basic images of cells can end up being processed to look for things like signal intensity and total area of signal. The results of that analysis may end up plugged into a spreadsheet and subjected to a further analysis. This general approach—a pipeline of software tools—makes documenting and reproducing exactly what happened to generate a final result.

We touched on some of these issues in our story on reproducibility in computer analysis, but they're worth emphasizing again.

Into the pipeline

Part I: Preserving science: what to do with raw research material?
Part II: Preserving science: what data do we keep? What do we discard?
Part III: Jaz drives, spiral notebooks, and SCSI: how we lose scientific data

Ensuring that it's possible to track everything about a computer analysis is incredibly challenging. My own experience in a lab bears that out. Of the 16 databases and online tools that I'd bookmarked while working on one project about five years ago, only nine are still accessible via the same URL. Of the remaining seven, only two had relocated in a way that enabled them to be found via a Google search.

Even over the span of a single project, versioning issues came into play. I first identified a gene that eventually turned out to be responsible for a mutation I was working on because it was predicted by an analysis program that had been used to predict possible genes in an early version of the mouse genome. Partway through the project, the group hosting the genome sequence changed its prediction tool, and the gene vanished—and didn't reappear for three iterations of the genome.

At other times, my work relied on desktop software packages that were discontinued, along with plenty of incompatible file formats. The key message is that, for even careful researchers, forces beyond their control can eliminate any chance of reproducing computerized analyses, sometimes within a matter of months.

As a result, having a fully reproducible analysis pipeline is much, much harder than it appears. Researchers have to document all the software involved, it's precise version, and the settings used during the process. To actually reproduce it, each of these software packages have to be archived in case updates alter the software's behavior. And, for operating systems where backwards compatibility isn't maintained (hello, Apple), the operating system and hardware to run it may need to be kept around.

Any of this might be possible in a single lab, but reproducibility doesn't simply mean showing that you can get the same results twice. Instead, other labs need to be able to apply the analysis to new data and extend it with additional tools in order to move the field forward. And reproducing the precise analysis environment—from processor to software tools—is an incredible challenge. Some research groups prefer different operating systems, or can't afford proprietary software tools; others don't have access to out-of-date versions of software.

Home-grown software

And that's just the commercial and open source software. Plenty of research projects have extremely specialized needs, and end up having to create their own code. Unfortunately, the number of individuals who have the requisite expertise and good coding habits tends to be very small, which means that a lot of specialized scientific code is the product of self-taught amateurs (including some of my own contributions to the field). Given the circumstances, the focus tends to be on getting the job done, not writing easy-to-maintain or well-commented code. And, most likely, that code will need to be maintained and improved over several generations of lab members.

All of these factors are a recipe for the sorts of situations that faced the now-famous Harry, who was forced to wade through the CRU's code (in FORTRAN, no less), and get it to play nice with an unruly data set. The frustrated note he left in the code doesn't suggest any deliberate fraud so much as someone confronted with the mess left behind after a decade or more of contributions from researchers of various skill levels.

Again, researcher teams face a series of tradeoffs when it comes to the code. Training scientists to program well takes a while, and keeps them from being as productive, research-wise, in the meantime. Spending time cleaning up and commenting code instead of focusing on using what works can also set research back. And, in the end, all that can mean fewer grants and publications. So, the incentives aren't there for ensuring that home-grown code meets the sort of standards that helps ensure consistent results that can be extended by other researchers.

This is not to say that there isn't some excellent, high quality code that's well-documented and available to the public. But it's not as common as it should be, and the incentives to make it more common simply haven't been there. However, as with the issue of data preservation, the scientific community is increasingly aware of the problems, and proposals are being floated to try to bring things into line. We'll take a look at these in our final installment.

Channel Ars Technica