Tuesday 19 April 2011

Quick project update ...

(Extracted from previous post for the sake of brevity)


In terms of project progress, we have a workable and easily customizable (CSV configurable) Marc21 to RDF-triples export script nearing completion. We hope to be able to share this towards the end of the project. Getting a suitable triple-store and associated software frameworks in place for a 'data.lib.cam.ac.uk' domain will be the next focus of our technical work. Work also continues on identifying records for sharing and internal discussions on licensing issues within the project, the main barrier to eventual data-release. We've also been in contact with Eric Childress and his colleagues over at OCLC in enhancing our data with identifiers for the FAST and VIAFF services.

Presentation at the 'Open Data Open Doors' event and other musings ...

Yesterday in Manchester, I was asked to give a brief presentation on our reasons for pursuing Open Data. Beyond the "its a good thing to do" arguments, ( which better people than myself have been able to better put across), I wanted to try and give an impression of the 'internal' reasons for doing so. In particular, I was keen to draw on real world examples, including the work done by Rufus Pollock in estimating the size and growth of the printed public domain from CUL's bibliographic data. I also touched on my own personal ideas for resource discovery services at a national level, and how libraries could be providing developer orientated services.



There were many useful discussions on both Open and Linked data that day, with Paul Walk at UKOLN on-hand to remind us that we should not jump at lumping the two together, and that each was itself something of a band-wagon.

A real cost-driven business case for opening up data was mentioned, which gave me cause for concern. Benefits of publishing data will only be fully realized when developers get to produce useful outputs, which could take several cycles to emerge. Identifying successes outside of cultural heritage was seen as a good way to sidestep this.

Much concern was given to licensing, and the library-centric issues of record-ownership again came to the fore. For many years and in pre-Internet eco-systems, Libraries and Librarians have benefited greatly from shared practices and resources in cataloguing. It would be a real shame to let the technical and legal frameworks developed to support previous generations of activity get in the way of finding better ways to share data between and beyond the library community.

The linked data approach of the Comet project was compared to the formidable API-orientated work on Jerome, taking place over at Lincoln. There was some discussion over the relative merits of each approach.

My personal take right now is both have fairly separate use cases, and that publishing large amounts of data as RDF (or in 'community' formats such MODs, Marc21 etc.) will be more useful for aggregation services than straight API provision, but that any eventual shared-data-service should itself expose data in API's of the highest quality. Thus the work done by Jerome will be of great importance to the RDTF no matter which way things move. Lincoln are also themselves gaining an excellent platform for future service development.

As I argued in my talk, Linked Data still has a high entry-bar, and many developers are much happier with a simple bit of Json over XML/RDF. RDF may not be the easiest means for aggregation (OAI-PMH works for me) but its arguably a great tool for sharing library data beyond the library community in bulk. Apart from anything, self-describing data means we don't have to explain Marc21 to people with useful things to do.

A third alternative, not currently being investigated by the RDTF (to my knowledge) would be crawler exposure of existing catalogues with RDFa or some kind of useful microformat in place.

I raised a point which no-one seemed able to answer, what types of license are applicable to feeds of data, i.e. an JSON or XML API such as those we provide at www.lib.cam.ac.uk/api, or even an Atom/RSS feed?
Would Creative Commons licenses suffice, or do they need the data specific Open Data Commons licenses? If anything, they are more of a service than a resource. How can we imply complete openness (or otherwise) in easily understandable terms?