Semantic Open Data in Scientific Publication - Peter Murray-Rust, University of Cambridge
The view is that journal articles is the means of scholarly communication but it is very much the case that journals are how your work is recognised. Data is really important though and, as far as Peter is concerned, publishers are a problem in terms of data. They mangle and restrict it.
A graph of carbon dioxide growth in the atmosphere (with no annotations) is up on screen and Peter is explaining why converting this sort of graph to a paper/electronic paper type format is a very inefficient way to do things. There is 99% more to do with scientific publications. Peter is showing (live) a paper from JoVE this is in the form of video. Unfortunately it's not working (maybe a publisher nobbled it?!) just at the moment. This video would give an idea of how scientists look and present their information (video and data).
At the moment journals contain human readable discourse (or Full Text) + facts and tables and this is usually only given on a subscription basis. Most scientists want the facts and the tables but the readable content is not as useful. Flicking through the electronic version of Nature you can see that it is inpenetrable - we emulate the victorians with our fomatting, abbreviattions, references etc. You can copy and paste but that's all. The hard information for reproducing experiments is discarded. Some journals require that information to be retained but this makes for a huge journal. But this data does help ensure that you have all the data and you can see how accurate the work is and judge it better. Some publishers reject this extra information or hide it behind a firewall, cover it in copyright notices etc. Peter contends that this material is not a work of art by the publisher but should be freely available to the world. Otherwise you spend your whole time photocopying and measuring this information. This is not fantasy - last year a student posted a graph with 10 data points on it on her blog and got a legal notice from a publisher.
Peter is showing us how young people - who use social networking sites and have no fear of changing the world - are using technology. The example is a robot built by a student to pick up information from across the web. The robot goes out at night and finds information on crystallography (Nick Days Crystal Eye robot) from tables of contents etc. It is almost maintenance and cost free though changes needed when changes to tables of contents and sources being searched.
Acta Crystalographica have done great work in creating a rich scientific item. We have to get into the habit of publishing data as well as full text. What Nick Day has done with this data is turn it into RDF and do mashups (mapping video shows one of these).
Text is important to the web. Robots can help filter through this information in pdfs - you can go through a thesis to find relavent information. The only stopper to go through and analyse and reuse data is the restrictions of publishers.
NIH mandate requires all research to be publically available for free. However you can only read it you can't reuse it or trawl it with robots and this is due to the destructive force of the publishers.
At the moment bioscientists spend enormous amounts of time and effort to annotate journals etc. Project Prospect is the first step in semantic publication - for instance it shows chemical compounds when you click on the name. Wouldn't it be great if this happens at the authoring stage? There is a huge opportunity in the semantic authoring of papers.
Open data is crucial to this whole process though. Restricted access helps publishers but cripples science. You also need to capture information on the fly and add to departmental repositories - projects ongoing to do this.
Funders should be requiring open data. It should not be held by publishers and you certainly shouldn't give your data to publishers to sell back to you. Use of scientific CC (Creative Commons) licences will go a long way to this.
Peter says if you want more info, Google: Murray Rust
Q & A
Q (from a publisher): There are costs associatted with putting data onto the web and the infrastructures put in place by publishers. If scientists were just exchanging different that would be different.
A: Answering this would get us into a long debate but I would point to an initiative from the Wellcome Trust (who are actually paying for the system). This gives a difficult transition but there is a model where the funder pays not the publisher. CERN is also investing in open publishing in the same sort of way. The SCOPE project looks towards funding of publication
A (response from questionner): to some extent I agree. The weakness of the NIH system is that no-one pays effectively!