LiveSerials

Wednesday, April 01, 2009

Plenary presentation summary: Journal Spend, use and research outcomes: A UK perspective on Value for Money. Presented by: Ian Rowlands, CIBER

During the second plenary session on Tuesday during UKSG, Mr. Rowlands presented some preliminary data from part of Research Information Network funded research project. He is halfway through project and will be continuing into next year. There are some very interesting visualization tools to explore the data online.

There has been an unprecedented growth in access to journal material over the past decade as content has moved from print to electronic. However, it is critical to assess the impact of the increase in access and availability of content has had in past decade. Has this increase in access led to higher productivity and more innovative research?

In exploring the research outcomes, Rowlands is looking at many quantifiable criteria, including: Number of Counter downloads, # of Phds, # of grants, institutional spending patterns, and deep log analysis in a variety of disciplines.

It should come as little surprise to the community that the transition from print to electronic publication is nearly complete. 96.1% of science journals are online and 88.5% of arts and humanities journals are online. In 2007, the academic community spent £80 million on e-journal licenses. Collectively those purchases have yielded more than102 million downloads or 0.80 £ per download.

There has been tremendous end-user take up of these resources. The number of downloads doubled from 2004 to 2007. This represents a 21.7% per annum growth in downloads over that period. The core proposition of providing online articles is “very popular” among researchers.

There has also been a rapid increase of number of journals available at an average institution. The average number of titles per researchers is up from just above 4 to just below 8. {TC – Given the present economic environment it is likely these figures will decrease in the coming year, but it certainly will remain at a higher average level.}

Citation analysis is showing that users are drawing more sources, and including more references per paper. The use of navigation and discovery tools, increased access, has created a situation where research is now more deeply founded in previous work.

University administrations are looking for clear and compelling justifications for the continued expense of information purchases and Mr. Rowlands thinks that compelling information is now available.

This change of availability has impacted the information seeking behavior of end-users. It is not surprising that Google is the “librarians friend”. Many Researchers are using gateways, such as Google, Pubmed, etc. to get access to content. Examples of the increase of traffic abound. One OUP Journals saw a two-fold increase in journal uses as an effect of opening up their content to Google.

The access provided by online content is also having a profound impact on resource use. The convenience of 24 X 7 access is tremendous. 17% of activity is taking place on weekends and the “Working day is growing” with 1/3 of activity taking place outside of “normal office hours” of 9:00am – 5:00pm. This access was more difficult in a print-based world.

However questions remain about whether efficient search is the same as or necessarily yields successful research? There is a strong negative correlation between research rating of the scientists in institutions and the average session length on Science Direct. The most “successful researchers” were the group spending the least amount of time online with content. Trends pointed to the fact that the most successful researchers use gateways. Much more search activity is taking place outside the library, typically on services like Pubmed, Google, and Google Scholar.

There were natural clustering of intensive use and the figures for the differences between moderate, high and super users correlated significantly with outputs such as the numbers of papers produced, the amount of grants funds received and the number of PhD’s the institution produces. In addition, while the average cost per download is consistent across institution, the more active the institution the less per article the institution paid.

Mr. Rowlands stressed that these data merely show associations not causation. Nor does the data show any directionality. Is it that a lot of research creates demand for lots of information, or is it that research institutions, put things together and in place for research, which therefore impacts results.

The next stage of this research will look at historical information. Among the topics to be explored is what are linkages between products, spending and outcomes? He is working to produce a computer model, that shows, for example, scenarios what the increase in the number of titles and/or downloads might have on research outcomes.

The initial information points to the fact that downloads and research outputs are like “gears on a bicycle” that move in tandem. As one gear gets bigger, the faster the other gear turns. Although one needs to understand the causality question, the understanding of the fact of the connection is a useful addition to knowledge about assessment and performance measurement.

{NB Disclaimer: Much of this summary is verbatim and/or paraphrased from the Mr. Rowlands talk – very little in this post is interpreted and should not be credited to me. Apologies to Mr. Rowlands for any errors.}

Labels: analysis, assessment, CIBER, data mining, e-journals, performacne, research, RIN, uksg09, usage metrics, usage statistics

Monday, March 30, 2009

Solving organisation underload: rethinking scholarly communications to add new conceptual value

"Open Access is going now," says Jan Velterop. "So I feel I can talk about something else - Beyond Open Access." That 'something' is organisation underload.

Too much of our data is too deeply hidden - we struggle to get the most out of it. Jan suggests the problem is not information overload but 'organisation underload'; a lack of organisational conceptual structures to manage all this information. The information overload aspect is going to increase; think expanding communication mechanisms e.g. blogs, peer-reviewed wikis - and why, says Jan, are these not being initiated by publishers?

He uses water as an analogy for information. When there's just a bit, we take it in (we drink it). When there's too much, we have to devise a means to navigate it - a boat. We need to find ways of presenting knowledge that helps us to do something useful and immediate with it. This means not just publishing articles but creating visualisations of conceptually connected data - "this is where the future lies". And as the communications process changes, we will need to think again about what skills and workflows are required to manage it.

As scientists we have traditionally focussed on the detail; now we complement this with a step back to see the bigger picture of how things are connected. This isn't feasible in the traditional manner of ingesting research, but it's this lateral thinking that produces breakthroughs (revealing new conceptual connections).

Jan talks about Knewco's software that mines data for concepts rather than simply words, and (I paraphrase, but I think the essence is that it) breaks the data down into triples that codify the relationships between lots of different pieces of data. Mapping these connections can reveal a powerful picture but once you scale this to millions of triples there will be redundancy that needs to be removed in order to focus on the valuable connections. This kind of semantic data analysis and mapping can make scientific literature even more useful by helping users find new, valuable sources of information - using the connections between literature to support new browse options in library catalogues and publisher websites.

[I have not done Jan's paper justice and would urge you to check out the wealth of additional interesting comments rippling around the Twittersphere (#UKSG09) - and must acknowledge that if I have managed to grasp his thesis at all it is because this is almost exactly what powers Publishing Technology's pub2web platform, which I have spent a good long time getting my head round!]

Labels: data mining, information, information overload, semantic web, uksg09

Wednesday, April 09, 2008

There is 99% more to do: realising the potential of scientific data

Publishers, says Peter Murray-Rust, are not helping in the dissemination of scholarly research: they are destroying it by locking it up or scambling the data in bad formats (PDFs) and by failing to publish so much of it. "There is 99% more to do," and if publishers aren't doing it then scientists will do it themselves. They're not interested in the human-readable discourse, they're interested in the data behind it. We're still emulating the way the Victorians shared information, and barely including any data in it, or enabling that which is there to be readily extracted and re-used.

Some publishers do publish the data alongside the articles - reams of additional information which help to prove that the ensuing article is scientifically accurate. In some cases the data is protected by copyright, or firewall barriers, which Murray-Rust considers inappropriate. "This is not a work of art to be copyrighted by the publisher; this is a scientific fact and should be freely available ... We have to get away from this culture of restricting the flow of scientific information."

"Young people", however, are not held back from the future and "have no fear of changing the world". Nick Day's Crystal Eye robot searches for crystallography on the web (this data is usually exposed explicitly, and not subject to copyright, but the data is not well structured). Nick has stored his data in RDF so that it can be mashed up with other data - plotted on a map, for example, to demonstrate the changing geographical balance of research output. "We've got to get into the habit of publishing data, as well as text." When the data is not published, we need to resort to text mining. The Oscar tool (written by Cambridge undergrads) "reads" documents and mines them for specific subjects (e.g. chemistry). But, copyright restrictions on scientific literature limit widescale data mining.

Murray-Rust admits to being "slightly polemic" as he accuses publishers of "being desperate" to prevent literature being more widely opened up. Talis and the Royal Society of Chemistry, on the other hand, he praises for the promulgation of open data and their support for some of the activities discussed. Semantic authoring presents a huge opportunity, but "science will be impoverished" until data is widely, openly available.

IOP's Jerry Cowhig (and chair of STM) tries to redress the balance: if everything scientists did was simply given away to each other, but what publishers contribute to the process does incur costs and all the giant leaps of recent years (move to online publishing) has been funded by publishers. With costs of $2-3000 per paper is it reasonable to expect publishers to then give the content away free? Scientists do acknowledge that publishers' role is important. PM-R, in reply, points to the Wellcome Trust's initiative to make data available and then pay for publishing upfront: funder, rather than reader, paying. CERNE too are funding SCOAP3 to support upfront open publishing. Publishers will be remunerated but at the opposite end of the process. It's not an easy transition to make but it is a viable way forward. JC agrees; a wider debate is necessary, but certainly the NIH model of mandated deposit is "nobody pays" rather than reader- or author-pays.

Labels: data mining, mashup, raw data, rdf, supplementary data, text mining

Wednesday, April 18, 2007

A beginner's guide to mining, and why you shouldn't do it anyway

Geoffrey Bilder contends that, when asked to deliver this session for UKSG, "I knew nothing about text mining". By the end of today's session, I suspected this was purely a comedy opener - either that, or he's really done his homework in the meantime.

Bilder promised to help us understand the concept of text mining and reach the stage where "you can avoid having to do it". He began by clarifying what data mining is *not*:

Data mining is not information retrieval. Tools which filter and refine searches to find specific bits of information are retrieving, not mining, data
Data mining is not information extraction. Tools that allow you to extract and normalise data from many sources, for further analysis, are extracting, not mining, data
Data mining is not information analysis. Tools that allow you to load, manipulate and analyse data are analysing, not mining data.

However - put these together and you may have something closer to the concept of data mining. Data mining collates information - masses of it, and perhaps seemingly disparate - and looks at it in a new way to reveal something new; something previously unknown. Bilder cites an apocryphal example of data mining which despite lack of veracity espouses the spirit: a supermarket discovered that people who buy nappies (sorry, Geoff; I can't bring myself to use the word d*apers) will also often buy beer. More prosaically, data mining helped researchers make the connection between magnesium deficiency and migraine.

Text mining is an extension of data mining. There's a false belief out there that people want to read scholarly articles - yet lots of evidence that suggests they are doing everything within their power to avoid reading, because they can't keep up with the literature. Text mining helps us to extract the core facts - from data that is designed for human, not machine, reading. It parses texts for data which can be reliably extracted and interpreted to create keyword-type labels for that text.

Bilder showcased the Gate tool (General Architecture for Text Engineering) and noted that it has more or less accuracy/value depending on the subject area and type of text being mined. But then comes the crunch: "the thing that keeps striking me is: if hiding information in unstructured text is a problem, shouldn't we be exploring new ways to publish?"

So Bilder proposes some new approaches which we could deploy to help users avoid text/data mining in future. He used an initial example of human reading being able to identify the different reasons why words in different types of phrase might be italicised (for emphasis; because the word is foreign; etc). He then showed the machine-readable version of the example, which would require the words not simply to be tagged with italic tags, but to be tagged with more useful, more granular tags denoting the different meanings intended by the italicisation. Bilder cited IngentaConnect's semantic tagging of data which can then be machine read by, for example, social bookmarking tools and RSS readers.

He then introduced Nature Publishing's Open Text Mining Initiative which moves beyond tagging of metadata to tagging of full text, to enable researchers to make use of a full article without necessarily having access to the human-readable full text. An OTMI file pre-identifies the number of times particular words appear in the article, and includes out of order snippets - so that a text mining tool can make use of the text, but humans cannot read it. OTMI thus allows providers to open up paid archives of content to allow machines to mine it, thus making it more useful for users.

But oh, says Bilder, so much more is possible (and everyone in the room sits wide-eyed with wonder at this emerging new dawn).

The semantic web, he reminds us, is "web as database", where every item of information is categorised to aid its integration and usage elsewhere in the web. Information items are identified as either subjects (Bill), predicates (is the brother of) or objects (Ben), which are then linked together in a simple data structure called a "triple" (Bill is the brother of Ben). A query language (such as SPARQL) can be pointed at an RDF data file (made up of triples) thus enabling the web to be queried in a way that was previously restricted to databases.

Given that we *can* provide data in such a well-tagged and structured way, users shouldn't *have* to data mine. It's like the early evolution of publishing - once we had created the concept of page numbers and tables of contents, wasn't it only logical to then implement these in order to make life as easy as possible? "Before we go out and get everybody text mining, I think we should ask ourselves the question: why are we publishing text? We can also publish data. We don't have to strip it out, we can supplement it and help our users."

For a full moment there was an awestruck silence - and then, as testament to Bilder's ability to make non-technical audiences comprehend densely technical subjects, the questions came.

Where does this RDF data might come from - who has to create it? Bilder replies that publishers generally have it and are already doing things with it e.g. sending it to CrossRef. Plus Nature's OTMI has a tool that can convert data from the PubMed DTD to OTMI.

How many researchers are attempting to do text analysis in this way - is it a small number but likely to grow, or? Bilder says "a lot of organisations [e.g. PubMedCentral] justify what they do on the basis that the data they collate will be data mined". He notes that it's not, of course, necessary for data to be gathered in one place, as machines that can read data can also retrieve it.

What's the typical publisher policy, given that text mining activities have in the past set off the security systems and brought up IP blocks? Bilder notes that agreements may be necessary between miner and provider to ensure the activity can take place. Any interface can create an area for this kind of usage of its data.

Labels: data mining, gate, otmi, rdf, semantic web, text mining, triple