LiveSerials

Wednesday, April 09, 2008

There is 99% more to do: realising the potential of scientific data

Publishers, says Peter Murray-Rust, are not helping in the dissemination of scholarly research: they are destroying it by locking it up or scambling the data in bad formats (PDFs) and by failing to publish so much of it. "There is 99% more to do," and if publishers aren't doing it then scientists will do it themselves. They're not interested in the human-readable discourse, they're interested in the data behind it. We're still emulating the way the Victorians shared information, and barely including any data in it, or enabling that which is there to be readily extracted and re-used.

Some publishers do publish the data alongside the articles - reams of additional information which help to prove that the ensuing article is scientifically accurate. In some cases the data is protected by copyright, or firewall barriers, which Murray-Rust considers inappropriate. "This is not a work of art to be copyrighted by the publisher; this is a scientific fact and should be freely available ... We have to get away from this culture of restricting the flow of scientific information."

"Young people", however, are not held back from the future and "have no fear of changing the world". Nick Day's Crystal Eye robot searches for crystallography on the web (this data is usually exposed explicitly, and not subject to copyright, but the data is not well structured). Nick has stored his data in RDF so that it can be mashed up with other data - plotted on a map, for example, to demonstrate the changing geographical balance of research output. "We've got to get into the habit of publishing data, as well as text." When the data is not published, we need to resort to text mining. The Oscar tool (written by Cambridge undergrads) "reads" documents and mines them for specific subjects (e.g. chemistry). But, copyright restrictions on scientific literature limit widescale data mining.

Murray-Rust admits to being "slightly polemic" as he accuses publishers of "being desperate" to prevent literature being more widely opened up. Talis and the Royal Society of Chemistry, on the other hand, he praises for the promulgation of open data and their support for some of the activities discussed. Semantic authoring presents a huge opportunity, but "science will be impoverished" until data is widely, openly available.

IOP's Jerry Cowhig (and chair of STM) tries to redress the balance: if everything scientists did was simply given away to each other, but what publishers contribute to the process does incur costs and all the giant leaps of recent years (move to online publishing) has been funded by publishers. With costs of $2-3000 per paper is it reasonable to expect publishers to then give the content away free? Scientists do acknowledge that publishers' role is important. PM-R, in reply, points to the Wellcome Trust's initiative to make data available and then pay for publishing upfront: funder, rather than reader, paying. CERNE too are funding SCOAP3 to support upfront open publishing. Publishers will be remunerated but at the opposite end of the process. It's not an easy transition to make but it is a viable way forward. JC agrees; a wider debate is necessary, but certainly the NIH model of mandated deposit is "nobody pays" rather than reader- or author-pays.

Labels: data mining, mashup, raw data, rdf, supplementary data, text mining

Translating Geek to English: exploring the possibilities of the Semantic Web

"I've made a good business of translating Geek to English," says Geoff Bilder (who hates buzzwords and can't remember submitting a paper with the title Web 3.0 - mea culpa, possibly).

Back in 2004, Geoff talked at UKSG about mash-ups, syndication, RSS and FOAF. Those were the balmy days when the term Web 2.0 had not been coined and we could talk about these individual technologies - and let them get on with changing the web - without lumping them together in a faceless buzzword bundle.

We can draw analogies between our current situation and the huge explosion of content that occurred shortly after the invention of the printing press. But if you compare the timelines, we're still in the primitive stages of developing our technology - "we haven't reached our Martin Luther moment". And just as we are uploading facsimiles of printed works onto the web, early modern European printers illuminated their incunabula to make them more palatable to an audience bred on monk-y manuscripts.

But we're uploading masses of this stuff. Too much. Who can read the glut of data that is available - and relevant - to them? Researchers are inundated. "People would really like to try to avoid reading," in order to get on with research rather than background tasks. Web 2.0's "read + write" capabilities help researchers to help each other find what's out there. Blogs are ubiquitous and emerging tools are enabling easier distinction between research-related and other postings. Social bookmarking allows us to share with others, quickly and easily, the information we are interested in. Tagging enables filtering of bookmarks; ultimately it's a process of subscribing to a colleague's brain.

Web 3.0 takes us beyond "read + write" to "read + write + identity + compute": it promises that we don't need to strip data out of published articles (extracting HTML from a print facsimile), and analyse it before stuffing it back in ... we'll create consistent metadata, structure it, share it in easily-computer-digestible forms (standard ones) and make better use of the content that is out there: it's the semantic web. Storing data in formats such as RDF allows for modeling of relationships between data; metadata encoded in this way allows HTML pages to be queried (using Sparql) to extract metadata NOT by harvesting and parsing (unreliable, prone to error) but by extracted tagged fields: the page is not only human-readable, but also machine-readable. This machine-compatibility is key to the semantic web and to Web 3.0 (whatever that is). Just as tables of contents, page numbers and many other tools were developed - over centuries - to make printed content more accessible and useful, so we are now developing new tools that make our current content formats more accessible, more useful.

Richard Gedye asks whether the technology exists to track how many times an article is bookmarked across multiple social bookmarking sites (answer: yes) and to drill down and explore who has bookmarked it (yes, theoretically, but there are privacy issues).

Mark Ware has been exploring Geoff's del.icio.us page during the presentation, and has picked up an article entitled "Scientists shun Web 2.0". Connotea has 50,000 users averaging fewer than 10 tags per user; Ginsparg's review of social bookmarking shows low uptake. Why? Answer: We had the same reaction to personal computers, to email, and to many other technologies at this stage of their development. [We're still fairly early on the adoption curve]. Some things like RSS have only really become useable in the last year or so, as browsers become more intelligent. Only when technologies mature and people recognise the value they add will there be good uptake. Mark responds that scientists don't see that value yet - no time is being saved, they think. Geoff says it IS more efficient; don't knock it till you've tried it. Our current means of interacting as a community, and sharing information, is going to conferences and networking with our peers. That's a much higher-bandwidth method than sharing content digitally.

Labels: incunabula, publishing technology web 2.0, rdf, sparql, web 3.0

Wednesday, April 18, 2007

A beginner's guide to mining, and why you shouldn't do it anyway

Geoffrey Bilder contends that, when asked to deliver this session for UKSG, "I knew nothing about text mining". By the end of today's session, I suspected this was purely a comedy opener - either that, or he's really done his homework in the meantime.

Bilder promised to help us understand the concept of text mining and reach the stage where "you can avoid having to do it". He began by clarifying what data mining is *not*:

Data mining is not information retrieval. Tools which filter and refine searches to find specific bits of information are retrieving, not mining, data
Data mining is not information extraction. Tools that allow you to extract and normalise data from many sources, for further analysis, are extracting, not mining, data
Data mining is not information analysis. Tools that allow you to load, manipulate and analyse data are analysing, not mining data.

However - put these together and you may have something closer to the concept of data mining. Data mining collates information - masses of it, and perhaps seemingly disparate - and looks at it in a new way to reveal something new; something previously unknown. Bilder cites an apocryphal example of data mining which despite lack of veracity espouses the spirit: a supermarket discovered that people who buy nappies (sorry, Geoff; I can't bring myself to use the word d*apers) will also often buy beer. More prosaically, data mining helped researchers make the connection between magnesium deficiency and migraine.

Text mining is an extension of data mining. There's a false belief out there that people want to read scholarly articles - yet lots of evidence that suggests they are doing everything within their power to avoid reading, because they can't keep up with the literature. Text mining helps us to extract the core facts - from data that is designed for human, not machine, reading. It parses texts for data which can be reliably extracted and interpreted to create keyword-type labels for that text.

Bilder showcased the Gate tool (General Architecture for Text Engineering) and noted that it has more or less accuracy/value depending on the subject area and type of text being mined. But then comes the crunch: "the thing that keeps striking me is: if hiding information in unstructured text is a problem, shouldn't we be exploring new ways to publish?"

So Bilder proposes some new approaches which we could deploy to help users avoid text/data mining in future. He used an initial example of human reading being able to identify the different reasons why words in different types of phrase might be italicised (for emphasis; because the word is foreign; etc). He then showed the machine-readable version of the example, which would require the words not simply to be tagged with italic tags, but to be tagged with more useful, more granular tags denoting the different meanings intended by the italicisation. Bilder cited IngentaConnect's semantic tagging of data which can then be machine read by, for example, social bookmarking tools and RSS readers.

He then introduced Nature Publishing's Open Text Mining Initiative which moves beyond tagging of metadata to tagging of full text, to enable researchers to make use of a full article without necessarily having access to the human-readable full text. An OTMI file pre-identifies the number of times particular words appear in the article, and includes out of order snippets - so that a text mining tool can make use of the text, but humans cannot read it. OTMI thus allows providers to open up paid archives of content to allow machines to mine it, thus making it more useful for users.

But oh, says Bilder, so much more is possible (and everyone in the room sits wide-eyed with wonder at this emerging new dawn).

The semantic web, he reminds us, is "web as database", where every item of information is categorised to aid its integration and usage elsewhere in the web. Information items are identified as either subjects (Bill), predicates (is the brother of) or objects (Ben), which are then linked together in a simple data structure called a "triple" (Bill is the brother of Ben). A query language (such as SPARQL) can be pointed at an RDF data file (made up of triples) thus enabling the web to be queried in a way that was previously restricted to databases.

Given that we *can* provide data in such a well-tagged and structured way, users shouldn't *have* to data mine. It's like the early evolution of publishing - once we had created the concept of page numbers and tables of contents, wasn't it only logical to then implement these in order to make life as easy as possible? "Before we go out and get everybody text mining, I think we should ask ourselves the question: why are we publishing text? We can also publish data. We don't have to strip it out, we can supplement it and help our users."

For a full moment there was an awestruck silence - and then, as testament to Bilder's ability to make non-technical audiences comprehend densely technical subjects, the questions came.

Where does this RDF data might come from - who has to create it? Bilder replies that publishers generally have it and are already doing things with it e.g. sending it to CrossRef. Plus Nature's OTMI has a tool that can convert data from the PubMed DTD to OTMI.

How many researchers are attempting to do text analysis in this way - is it a small number but likely to grow, or? Bilder says "a lot of organisations [e.g. PubMedCentral] justify what they do on the basis that the data they collate will be data mined". He notes that it's not, of course, necessary for data to be gathered in one place, as machines that can read data can also retrieve it.

What's the typical publisher policy, given that text mining activities have in the past set off the security systems and brought up IP blocks? Bilder notes that agreements may be necessary between miner and provider to ensure the activity can take place. Any interface can create an area for this kind of usage of its data.

Labels: data mining, gate, otmi, rdf, semantic web, text mining, triple