LiveSerials: There is 99% more to do: realising the potential of scientific data

Publishers, says Peter Murray-Rust, are not helping in the dissemination of scholarly research: they are destroying it by locking it up or scambling the data in bad formats (PDFs) and by failing to publish so much of it. "There is 99% more to do," and if publishers aren't doing it then scientists will do it themselves. They're not interested in the human-readable discourse, they're interested in the data behind it. We're still emulating the way the Victorians shared information, and barely including any data in it, or enabling that which is there to be readily extracted and re-used.

Some publishers do publish the data alongside the articles - reams of additional information which help to prove that the ensuing article is scientifically accurate. In some cases the data is protected by copyright, or firewall barriers, which Murray-Rust considers inappropriate. "This is not a work of art to be copyrighted by the publisher; this is a scientific fact and should be freely available ... We have to get away from this culture of restricting the flow of scientific information."

"Young people", however, are not held back from the future and "have no fear of changing the world". Nick Day's Crystal Eye robot searches for crystallography on the web (this data is usually exposed explicitly, and not subject to copyright, but the data is not well structured). Nick has stored his data in RDF so that it can be mashed up with other data - plotted on a map, for example, to demonstrate the changing geographical balance of research output. "We've got to get into the habit of publishing data, as well as text." When the data is not published, we need to resort to text mining. The Oscar tool (written by Cambridge undergrads) "reads" documents and mines them for specific subjects (e.g. chemistry). But, copyright restrictions on scientific literature limit widescale data mining.

Murray-Rust admits to being "slightly polemic" as he accuses publishers of "being desperate" to prevent literature being more widely opened up. Talis and the Royal Society of Chemistry, on the other hand, he praises for the promulgation of open data and their support for some of the activities discussed. Semantic authoring presents a huge opportunity, but "science will be impoverished" until data is widely, openly available.

IOP's Jerry Cowhig (and chair of STM) tries to redress the balance: if everything scientists did was simply given away to each other, but what publishers contribute to the process does incur costs and all the giant leaps of recent years (move to online publishing) has been funded by publishers. With costs of $2-3000 per paper is it reasonable to expect publishers to then give the content away free? Scientists do acknowledge that publishers' role is important. PM-R, in reply, points to the Wellcome Trust's initiative to make data available and then pay for publishing upfront: funder, rather than reader, paying. CERNE too are funding SCOAP3 to support upfront open publishing. Publishers will be remunerated but at the opposite end of the process. It's not an easy transition to make but it is a viable way forward. JC agrees; a wider debate is necessary, but certainly the NIH model of mandated deposit is "nobody pays" rather than reader- or author-pays.

Labels: data mining, mashup, raw data, rdf, supplementary data, text mining

LiveSerials

Wednesday, April 09, 2008

There is 99% more to do: realising the potential of scientific data

2 Comments:

Previous Posts