Monday, April 12, 2010

Plenary 1: Technology and Change, Richard Wallis

“Libraries have always been at the leading edge of technology” says Richard Wallis, Technology Evangelist from Talis, as he shows us some photos of very early OPAC systems, many times removed from the modern day library website. To further illustrate the changes of the past several decades he shows us a calculator from the 1970s, whose £39.95 price is the equivalent of £400 today, but whose functionality is now a free app on almost any electronic device you care to name.

The information industry is all about helping people to find things and linking students to the resources that they need. We need to rethink how we do this, bringing the information directly to the user, in the format that they want. There should be no need to bounce the user via resolvers and multiple URLs to a site that eventually proclaims “Here it is!”. It should just be delivered.

Getting users to the answers they need can be done via search (searching the OPAC or more likely Googling) but answers can also be computed (WolframAlpha) or navigated to. All of these methods rely on metadata. Librarians and metadata, Wallis points out, go back a long way, but there’s still a way to go. Library metadata for the most part is “built on principles that worked for physical stuff” and is more often than not only available from the library. This metadata can be made more useful.

And so to Linked Data, which, put simply, identifies things, links to those things and describes those things. The four principles of Linked Data are
  1. Use URIs as names for things
  2. Use HTTP URIs so people can look them up
  3. When someone looks up a URI provide useful info about the thing, in the right format (for human or machine consumption)
  4. Provide links to the URIs of other things to aid navigation.
To show some examples of Linked Data Wallis bravely attempts - and pulls off - a web demo. He shows us education.data.gov.uk, a semantic store of data about UK schools, and BBC Wildlife, which pulls in its descriptions of animals from Wikipedia.

What about libraries? Some are experimenting in Linked Data - the Library of Congress, National Library of Sweden, for example, so we’re seeing the start of library linking hubs. But we need to go further. Library linking hubs should link to non-library hubs (government data, the internet movie database, many other resources) and the library catalogue should become a set of links between concepts, become part of the Linked Data web.

The drivers for this evolution are not likely to be local cataloguers. Metadata for e-resources needs to be good and needs to delivered with the resource. Article level metadata can’t be catalogued, there’s just too much. But it is being aggregated (by CrossRef, for example). As this metadata gets better we’ll start to see non-library hubs linking to library data.

In summary, technology is evolving extremely quickly, and consumers are driving delivery methods - “get it to me on my device”. Education needs to link students to resources and search is only one way of doing this. Linked Data is powering the web but mostly outside of libraries, and libraries and publishers need to catch up.

“You can add great value to the web, but you need to be proactively of the web to do it. They won’t come just because you build it.”

See the slides here.

Labels: , , , ,

Monday, March 30, 2009

Solving organisation underload: rethinking scholarly communications to add new conceptual value

"Open Access is going now," says Jan Velterop. "So I feel I can talk about something else - Beyond Open Access." That 'something' is organisation underload.

Too much of our data is too deeply hidden - we struggle to get the most out of it. Jan suggests the problem is not information overload but 'organisation underload'; a lack of organisational conceptual structures to manage all this information. The information overload aspect is going to increase; think expanding communication mechanisms e.g. blogs, peer-reviewed wikis - and why, says Jan, are these not being initiated by publishers?

He uses water as an analogy for information. When there's just a bit, we take it in (we drink it). When there's too much, we have to devise a means to navigate it - a boat. We need to find ways of presenting knowledge that helps us to do something useful and immediate with it. This means not just publishing articles but creating visualisations of conceptually connected data - "this is where the future lies". And as the communications process changes, we will need to think again about what skills and workflows are required to manage it.

As scientists we have traditionally focussed on the detail; now we complement this with a step back to see the bigger picture of how things are connected. This isn't feasible in the traditional manner of ingesting research, but it's this lateral thinking that produces breakthroughs (revealing new conceptual connections).

Jan talks about Knewco's software that mines data for concepts rather than simply words, and (I paraphrase, but I think the essence is that it) breaks the data down into triples that codify the relationships between lots of different pieces of data. Mapping these connections can reveal a powerful picture but once you scale this to millions of triples there will be redundancy that needs to be removed in order to focus on the valuable connections. This kind of semantic data analysis and mapping can make scientific literature even more useful by helping users find new, valuable sources of information - using the connections between literature to support new browse options in library catalogues and publisher websites.

[I have not done Jan's paper justice and would urge you to check out the wealth of additional interesting comments rippling around the Twittersphere (#UKSG09) - and must acknowledge that if I have managed to grasp his thesis at all it is because this is almost exactly what powers Publishing Technology's pub2web platform, which I have spent a good long time getting my head round!]

Labels: , , , ,

Wednesday, April 18, 2007

A beginner's guide to mining, and why you shouldn't do it anyway

Geoffrey Bilder contends that, when asked to deliver this session for UKSG, "I knew nothing about text mining". By the end of today's session, I suspected this was purely a comedy opener - either that, or he's really done his homework in the meantime.

Bilder promised to help us understand the concept of text mining and reach the stage where "you can avoid having to do it". He began by clarifying what data mining is *not*:
  • Data mining is not information retrieval. Tools which filter and refine searches to find specific bits of information are retrieving, not mining, data
  • Data mining is not information extraction. Tools that allow you to extract and normalise data from many sources, for further analysis, are extracting, not mining, data
  • Data mining is not information analysis. Tools that allow you to load, manipulate and analyse data are analysing, not mining data.
However - put these together and you may have something closer to the concept of data mining. Data mining collates information - masses of it, and perhaps seemingly disparate - and looks at it in a new way to reveal something new; something previously unknown. Bilder cites an apocryphal example of data mining which despite lack of veracity espouses the spirit: a supermarket discovered that people who buy nappies (sorry, Geoff; I can't bring myself to use the word d*apers) will also often buy beer. More prosaically, data mining helped researchers make the connection between magnesium deficiency and migraine.

Text mining is an extension of data mining. There's a false belief out there that people want to read scholarly articles - yet lots of evidence that suggests they are doing everything within their power to avoid reading, because they can't keep up with the literature. Text mining helps us to extract the core facts - from data that is designed for human, not machine, reading. It parses texts for data which can be reliably extracted and interpreted to create keyword-type labels for that text.

Bilder showcased the Gate tool (General Architecture for Text Engineering) and noted that it has more or less accuracy/value depending on the subject area and type of text being mined. But then comes the crunch: "the thing that keeps striking me is: if hiding information in unstructured text is a problem, shouldn't we be exploring new ways to publish?"

So Bilder proposes some new approaches which we could deploy to help users avoid text/data mining in future. He used an initial example of human reading being able to identify the different reasons why words in different types of phrase might be italicised (for emphasis; because the word is foreign; etc). He then showed the machine-readable version of the example, which would require the words not simply to be tagged with italic tags, but to be tagged with more useful, more granular tags denoting the different meanings intended by the italicisation. Bilder cited IngentaConnect's semantic tagging of data which can then be machine read by, for example, social bookmarking tools and RSS readers.

He then introduced Nature Publishing's Open Text Mining Initiative which moves beyond tagging of metadata to tagging of full text, to enable researchers to make use of a full article without necessarily having access to the human-readable full text. An OTMI file pre-identifies the number of times particular words appear in the article, and includes out of order snippets - so that a text mining tool can make use of the text, but humans cannot read it. OTMI thus allows providers to open up paid archives of content to allow machines to mine it, thus making it more useful for users.

But oh, says Bilder, so much more is possible (and everyone in the room sits wide-eyed with wonder at this emerging new dawn).

The semantic web, he reminds us, is "web as database", where every item of information is categorised to aid its integration and usage elsewhere in the web. Information items are identified as either subjects (Bill), predicates (is the brother of) or objects (Ben), which are then linked together in a simple data structure called a "triple" (Bill is the brother of Ben). A query language (such as SPARQL) can be pointed at an RDF data file (made up of triples) thus enabling the web to be queried in a way that was previously restricted to databases.

Given that we *can* provide data in such a well-tagged and structured way, users shouldn't *have* to data mine. It's like the early evolution of publishing - once we had created the concept of page numbers and tables of contents, wasn't it only logical to then implement these in order to make life as easy as possible? "Before we go out and get everybody text mining, I think we should ask ourselves the question: why are we publishing text? We can also publish data. We don't have to strip it out, we can supplement it and help our users."

For a full moment there was an awestruck silence - and then, as testament to Bilder's ability to make non-technical audiences comprehend densely technical subjects, the questions came.

Where does this RDF data might come from - who has to create it? Bilder replies that publishers generally have it and are already doing things with it e.g. sending it to CrossRef. Plus Nature's OTMI has a tool that can convert data from the PubMed DTD to OTMI.

How many researchers are attempting to do text analysis in this way - is it a small number but likely to grow, or? Bilder says "a lot of organisations [e.g. PubMedCentral] justify what they do on the basis that the data they collate will be data mined". He notes that it's not, of course, necessary for data to be gathered in one place, as machines that can read data can also retrieve it.

What's the typical publisher policy, given that text mining activities have in the past set off the security systems and brought up IP blocks? Bilder notes that agreements may be necessary between miner and provider to ensure the activity can take place. Any interface can create an area for this kind of usage of its data.

Labels: , , , , , ,