Monday, April 03, 2006

It's Monday morning ... "Mice Love Lard"

UKSG has officially started, and first on the podium (following retiring chairman Keith Courtney's welcome, above) is Carole Goble from the University of Manchester, with an excellent review of how workflows can be employed to better connect researchers to the collective content they use. My on the spot notes below:

Bioinformaticians' (people working in life sciences) daily work = identifying new overlapping sequences of interest -- looking them up in databases and annotating to indicate similarity to genetic sequence under investigation.

Example: 35 different resources; all with web interfaces; many publication-centric. Copy and pasting content from different resources, annotating by hand. Can't replicate or log activity to see if it's been done accurately

Bioinformaticians do not distinguish between data and publications; publishers need to recognise there's not a difference between these 2 types of content for users.

Heretical view: CG doesn't read journals -- but does read content on a pre-print service (journals are outdated).
Where conference papers turn into journal papers -- the first iteration may well be the Powerpoint.
"Google is the Lord's work" -- "I haven't been to the library for 14 years!" -- can find it from laptop and send a PhD student to the library if really necessary ...

Workflows: computerising the research process. Enabling machines to interoperate and execute the necessary processes.
"Workflow at its simplest is the movement of documents and or tasks through a work process" (Wikipedia)
Simple scripting language specifies how steps of a pipeline link together -- hides the backend fiddling about
Linking together and cross referencing data in different repositories -- including serials.
Everthing needs to be accessible to the workflow machinery -- including serials.
Results can then be annotated -- semantic metadata annotation of data -- *and* provenance is tracked accurately for future checking/use. You can then reuse, or amend and reuse, your workflow. So the workflow protocol itself becomes valuable, not just the data (therefore need to test thoroughly to make sure it runs oin different platforms etc.) CG cites Taverna Workbench research project: still just a flaky research project but 150 biocentres already using it. And it's just one of many workflow systems.

Workflows can cut processes down from 2 weeks to 2 hours. Publishing workflows enables them to be adapted and shared throughout the community.
e.g. Use of PubMedCentral portal to make into web service for machines to read. Life sciences databases interlink e.g. Interpro links to Medline - these links can be used to retrieve article. XML result is "just" part of the workflow and can be processed and used further down the worklow. Extra value service e.g. Chilibot -- text mining, sits on top of PubMed and tries to build relationship between genes, proteins. Can again be made into a computable workflow. (Using this workflow, the scientist was able to discover that Mice Love Lard.)

Some results will need somebody to read them! -- mixture of machinery and people.

Termina software (Imperial College & Univ Manchester?) looks for terms and recognises them to associate them with a term from a gene ontology -- using text mining -- but would be easier if text mining wasn't necessary i.e. if terms could be identified and flagged at point of publication. The information/knowledge (that these terms are controlled vocabulary) is there at the point of publication -- so why lose it, only to have to reverse engineer it later.

(This reminds me of Leigh Dodds' paper "The Journal Article as Palimpsest", given at Ingenta's Publisher Forum in Dec 2005 -- view slides .pps).

Several projects working on this -- Liz Lyon's eBank project "confusogram" -- escience workflows & research feeding institutional repositories, but also conference proceedings etc. At the time of data creation, annotation is done -- publication & data are deeply intertwined -- breaking up the silo between data, experiment & publication.

Active data forms a web of data object and publications -- all combined together. Workflows also offer provenance tracking - at the point of capture, giving you evidence for your publication which should also be used within the publication.

Web services, workflows
-> publications need to be machine-accessible.
-> Licensing needs to work, so workflows can be shared
-> DRM, authorisation, authentication all need to work
Integration of data and publications
->workflows need to link results -- need common IDs
Semantic mark-up at source
->need better ways to interpret content
Text mining
-> retro-extraction is more useful if it can read full text not just abstract

Why isn't workflow/data published with publications?
-- privacy/intellectual property; workflows give too much away! Need to be slightly modified before publication. They do also need to be licensed -- what model? -- to enable reuse/sharing of results/workflows.

o machines, not just people, are reading journals
o if journals are not online, they are unread
o workflows are another form of outcome which should be publishing alongside data, metadata and publications
o Google rocks!

Q: Anthony Watkinson: what should we do to support our editors to offer the best they can to life scientists?
A: Life scientists want semantic mark-up & web services, so they can avoid expensive, unreliable text-mining. So we need to be able to access the journal's content through a computational process -- and ensure that the same identifiers are being used across databases.

Q: Richard Gedye, OUP. Happy to see an OUP journal being used ... wrt controlled vocabularies-- how many of these are there? Should we ask our editors for their advice on which to use? Are there standards?
A: -- big US project bringing together all the bioontologies being developed in the community; controlled vocabularies only make sense if there's community consensus. Is very much the case in life sciences but different levels of endorsement.

Q: Peter Burnhill, EDINA: fingerprinting vs controlled vocab -- is the need to access full text primarily to discover relevant material, or to provide access to it?
A: Both, we want to be able to put it into the pipeline so need to enable access. But also need it for discovery, and primarily (now) for text mining.
Re. fingerprinting -- helping to create Controlled vocab as well as identifying common terms.
(to what extent is there a limit to controlled vocab and does it need to rely on a lower level identiication structure?)
You do need both -- and identifiers representing a concept, because the words being mined will change. Building controlled vocab is an entire discipline in itself...


Post a Comment

Links to this post:

Create a Link

<< Home