Wednesday, April 09, 2008

There is 99% more to do: realising the potential of scientific data

Publishers, says Peter Murray-Rust, are not helping in the dissemination of scholarly research: they are destroying it by locking it up or scambling the data in bad formats (PDFs) and by failing to publish so much of it. "There is 99% more to do," and if publishers aren't doing it then scientists will do it themselves. They're not interested in the human-readable discourse, they're interested in the data behind it. We're still emulating the way the Victorians shared information, and barely including any data in it, or enabling that which is there to be readily extracted and re-used.

Some publishers do publish the data alongside the articles - reams of additional information which help to prove that the ensuing article is scientifically accurate. In some cases the data is protected by copyright, or firewall barriers, which Murray-Rust considers inappropriate. "This is not a work of art to be copyrighted by the publisher; this is a scientific fact and should be freely available ... We have to get away from this culture of restricting the flow of scientific information."

"Young people", however, are not held back from the future and "have no fear of changing the world". Nick Day's Crystal Eye robot searches for crystallography on the web (this data is usually exposed explicitly, and not subject to copyright, but the data is not well structured). Nick has stored his data in RDF so that it can be mashed up with other data - plotted on a map, for example, to demonstrate the changing geographical balance of research output. "We've got to get into the habit of publishing data, as well as text." When the data is not published, we need to resort to text mining. The Oscar tool (written by Cambridge undergrads) "reads" documents and mines them for specific subjects (e.g. chemistry). But, copyright restrictions on scientific literature limit widescale data mining.

Murray-Rust admits to being "slightly polemic" as he accuses publishers of "being desperate" to prevent literature being more widely opened up. Talis and the Royal Society of Chemistry, on the other hand, he praises for the promulgation of open data and their support for some of the activities discussed. Semantic authoring presents a huge opportunity, but "science will be impoverished" until data is widely, openly available.

IOP's Jerry Cowhig (and chair of STM) tries to redress the balance: if everything scientists did was simply given away to each other, but what publishers contribute to the process does incur costs and all the giant leaps of recent years (move to online publishing) has been funded by publishers. With costs of $2-3000 per paper is it reasonable to expect publishers to then give the content away free? Scientists do acknowledge that publishers' role is important. PM-R, in reply, points to the Wellcome Trust's initiative to make data available and then pay for publishing upfront: funder, rather than reader, paying. CERNE too are funding SCOAP3 to support upfront open publishing. Publishers will be remunerated but at the opposite end of the process. It's not an easy transition to make but it is a viable way forward. JC agrees; a wider debate is necessary, but certainly the NIH model of mandated deposit is "nobody pays" rather than reader- or author-pays.

Labels: , , , , ,

Wednesday, April 18, 2007

A beginner's guide to mining, and why you shouldn't do it anyway

Geoffrey Bilder contends that, when asked to deliver this session for UKSG, "I knew nothing about text mining". By the end of today's session, I suspected this was purely a comedy opener - either that, or he's really done his homework in the meantime.

Bilder promised to help us understand the concept of text mining and reach the stage where "you can avoid having to do it". He began by clarifying what data mining is *not*:
  • Data mining is not information retrieval. Tools which filter and refine searches to find specific bits of information are retrieving, not mining, data
  • Data mining is not information extraction. Tools that allow you to extract and normalise data from many sources, for further analysis, are extracting, not mining, data
  • Data mining is not information analysis. Tools that allow you to load, manipulate and analyse data are analysing, not mining data.
However - put these together and you may have something closer to the concept of data mining. Data mining collates information - masses of it, and perhaps seemingly disparate - and looks at it in a new way to reveal something new; something previously unknown. Bilder cites an apocryphal example of data mining which despite lack of veracity espouses the spirit: a supermarket discovered that people who buy nappies (sorry, Geoff; I can't bring myself to use the word d*apers) will also often buy beer. More prosaically, data mining helped researchers make the connection between magnesium deficiency and migraine.

Text mining is an extension of data mining. There's a false belief out there that people want to read scholarly articles - yet lots of evidence that suggests they are doing everything within their power to avoid reading, because they can't keep up with the literature. Text mining helps us to extract the core facts - from data that is designed for human, not machine, reading. It parses texts for data which can be reliably extracted and interpreted to create keyword-type labels for that text.

Bilder showcased the Gate tool (General Architecture for Text Engineering) and noted that it has more or less accuracy/value depending on the subject area and type of text being mined. But then comes the crunch: "the thing that keeps striking me is: if hiding information in unstructured text is a problem, shouldn't we be exploring new ways to publish?"

So Bilder proposes some new approaches which we could deploy to help users avoid text/data mining in future. He used an initial example of human reading being able to identify the different reasons why words in different types of phrase might be italicised (for emphasis; because the word is foreign; etc). He then showed the machine-readable version of the example, which would require the words not simply to be tagged with italic tags, but to be tagged with more useful, more granular tags denoting the different meanings intended by the italicisation. Bilder cited IngentaConnect's semantic tagging of data which can then be machine read by, for example, social bookmarking tools and RSS readers.

He then introduced Nature Publishing's Open Text Mining Initiative which moves beyond tagging of metadata to tagging of full text, to enable researchers to make use of a full article without necessarily having access to the human-readable full text. An OTMI file pre-identifies the number of times particular words appear in the article, and includes out of order snippets - so that a text mining tool can make use of the text, but humans cannot read it. OTMI thus allows providers to open up paid archives of content to allow machines to mine it, thus making it more useful for users.

But oh, says Bilder, so much more is possible (and everyone in the room sits wide-eyed with wonder at this emerging new dawn).

The semantic web, he reminds us, is "web as database", where every item of information is categorised to aid its integration and usage elsewhere in the web. Information items are identified as either subjects (Bill), predicates (is the brother of) or objects (Ben), which are then linked together in a simple data structure called a "triple" (Bill is the brother of Ben). A query language (such as SPARQL) can be pointed at an RDF data file (made up of triples) thus enabling the web to be queried in a way that was previously restricted to databases.

Given that we *can* provide data in such a well-tagged and structured way, users shouldn't *have* to data mine. It's like the early evolution of publishing - once we had created the concept of page numbers and tables of contents, wasn't it only logical to then implement these in order to make life as easy as possible? "Before we go out and get everybody text mining, I think we should ask ourselves the question: why are we publishing text? We can also publish data. We don't have to strip it out, we can supplement it and help our users."

For a full moment there was an awestruck silence - and then, as testament to Bilder's ability to make non-technical audiences comprehend densely technical subjects, the questions came.

Where does this RDF data might come from - who has to create it? Bilder replies that publishers generally have it and are already doing things with it e.g. sending it to CrossRef. Plus Nature's OTMI has a tool that can convert data from the PubMed DTD to OTMI.

How many researchers are attempting to do text analysis in this way - is it a small number but likely to grow, or? Bilder says "a lot of organisations [e.g. PubMedCentral] justify what they do on the basis that the data they collate will be data mined". He notes that it's not, of course, necessary for data to be gathered in one place, as machines that can read data can also retrieve it.

What's the typical publisher policy, given that text mining activities have in the past set off the security systems and brought up IP blocks? Bilder notes that agreements may be necessary between miner and provider to ensure the activity can take place. Any interface can create an area for this kind of usage of its data.

Labels: , , , , , ,