Scattered Letters and Numbers — Can we risk losing the raw data future researchers will need to do their work, says Mark Hahnel. Photograph: William Whitehurst/Corbis

Open data: we need to share research results, even when they are wrong

This article is more than 10 years old

There are huge flaws in the way research data is uploaded, says Mark Hahnel, but how far are we from a universal solution?

As self-effacement goes, it's hard to beat Isaac Newton: "If I have seen further it is by standing on the shoulders of giants." Yet while modern scientists continue to build on the concepts and ideas of their forerunners, they face a unique problem that Newton or his peers would not have anticipated – the inability to access crucial research data generated by other people's work.

Scientific research today typically yields enormous volumes of information, with individual projects easily able to generate gigabytes or terabytes of data. The problem for the scientific community is that the vast majority of this information never makes it into published research which tends, by necessity, to be limited to topline conclusions or summaries of the key findings. The raw data – including the data from hundreds of unsuccessful experiments – is left out, and is lost to the scientific community, and future researchers.

One of the aims of the open science data movement is to eradicate these scientific blind spots by encouraging the sharing of research data using existing web-based technology, such as Flickr or blogs. Unfortunately, recent research by the Web Science and Digital Libraries Research Group found that 10% of the artefacts they were tracking vanished – in the space of a year. That loss rate is unacceptable for such valuable archival information.

Given that the large majority of scientific information is 'negative data' (where hypotheses are found to have failed), the loss of these findings to the wider community leads to the needless repetition of time-consuming experiments. This is a self-perpetuating problem. If researchers are basing their hypotheses on published literature alone, they will likely be wasting time and money on repeating failed experiments – at best, reproducing false data; at worst, creating a new set of false positives.

One area where this has the most impact is in clinical trials. A lack of comprehensive research data has led to drugs with placebo, or even detrimental effects being released to the market, at great profit. Subsequent researchers cannot make new analyses of the same data, including its combination with other data sets and for uses that may have been unanticipated by the original producer or collector.

Also of critical importance is the ability of other research groups to reproduce the published findings. This cannot be achieved if all of the information is not available. This information can include raw data, exact software used, and correct metadata for each file.

A 2012 paper published in Nature by C Glenn Begley attempted to look at the reproducibility problem, or at least establish if there was one in cancer research. The paper deemed 53 papers 'landmark' studies and acknowledged from the outset that some of the data might not hold up, because papers were deliberately selected that described something completely new, such as fresh approaches to targeting cancers or alternative clinical uses for existing therapeutics. Nevertheless, scientific findings were confirmed in only six (11%) cases. Even knowing the limitations of preclinical research, this was a shocking result.

There are clearly huge flaws in the way that research data is being uploaded and shared, and we are still at a very early stage in finding and agreeing on a universal solution for open data sharing. We are, however, beginning to head in the right direction, with national science academies and research organisations working regionally, nationally and globally to pursue greater public access to the results of publically funded research.

A good example of this in the UK is the Royal Society's landmark report, Science as an open enterprise, which gives a conceptual view of what academic papers could look like in the future; for example, giving readers the ability to access primary data and recompute it to validate the conclusions presented. Subject to the authors' approval, readers would also be able to obtain access to the underlying code of the experiments presented in the publication.

At the moment, the world is just beginning to experiment with different mixtures of technology and policy to find a solution to open data that will be acceptable to all. Since this is a matter of the greatest importance to the scientific community, I urge every interested party to join the conversation and cooperate in the development of common standards for data sharing to ensure that the scientific discoveries of our time will provide a platform for future generations.

Mark Hahnel is founder of Figshare, an open platform for sharing research data at Digital Science – follow it on Twitter @figshare, @Digitalsci and Mark @MarkHahnel

This content is brought to you by Guardian Professional. Looking for your next university role? Browse Guardian jobs for hundreds of the latest academic, administrative and research posts

Explore more on these topics

Reuse this content