Thursday, 28 July 2011

Final post

Its time for the final project post for COMET. Here we summarize major outputs over the past six months, cover what we've gained in terms of skills and the most significant lessons we've learnt. We will also take a look at what could be done to follow up on the project.

COMET was perhaps overly ambitious for a six month project, but we've made some firm progress in a number of areas relating to libraries and their open distribution of data.

Major outputs
- Document on data ownership analysis - A document describing the major sources of data in the Cambridge University Library catalogue, and making some comments about ownership and licensing around re-use.

- Workflow proposal and tool for record segmentation by vendor code, based on the above work. Suggests a methodology for sorting records when a vendor specifically requests a license other than PDDL.

- Marc21 to RDF triples conversion utility - A standalone tool designed to get
data out of the dead Marc21 format and into something better quickly. It features extensive CSV based customization, see the readme file for more details. Our digital metadata specialist Huw Jones was largely responsible for making this happen.

- - Our first run at a library-centric open data service. It includes:- Application framework code for the above - a PHP / MYSQL application framework to store and deliver RDF data in a variety of formats, with a flexibile SPARQL endpoint. Also includes an experimental library of congress subject headings enrichment utility. A 'getting started' document covers installation and data loading. (For a flashier alternative, take a look at the Open Biblio suite.)

- An interesting sideline into the world of microdata and search engines

- Talks and presentations on Open Bibliographic Data at Birmingham and Manchester

Next steps
Publishing 'more open/linked data' would be useful, but data alone will not solve the challenge to improve resource discovery in the UK cultural sector. Here are some thoughts on what could come next. This is quite an eclectic list of ideas and musings on next steps that the Discovery programme could take, with some deeper focus around RDF:

1) Useable services for a wider audience
Open bibliographic data is one thing, but a certain level of skill and understanding is required to fully appreciate it, a criticism of the wider open data movement. To spread the word and enthuse a wider audience beyond 'data geeks', it would be great to see working services built around a framework of Open Data, or at least some impressive tech demos (Its worth mentioning Bibliographica here, which is already a great step in this direction ...)

2) RDF
If RDF is to continue in use as a mechanism for publishing open bibliographic data, its application needs further thought and development. Here are four suggestions:

2.1) Move beyond pure bibliography into holdings data.
In library systems and services, the real interactions that matter to library users are focused around library holdings. This data could potentially be published openly, and modeled in RDF. Links could be established to activity data to provide a framework for user driven discovery services.

2.2) 'Enliven' linked RDF data.
Like most open bib data, we've published a static dump of our catalogue at one time. It would be great to see pipes and processes in place to reflect changes and possibly track provenance. This is not as simple as it sounds, do we provide regular full updates or track incremental changes?

2.3) Better ways to get to RDF.
RDF data is valuable in its own right, but arguably needs easier access methods than SPARQL. Combining RDF data with better indexing and REST API technologies would be useful in widening its access and making it a more 'developer friendly' format. Thankfully, many RDF based tools offer this functionality, including the Talis platform. The Neo4J graph database technology also looks promising.

2.4) Recommendations for RDF vocabularies and linking targets for linked bibliographic data.
I think this needs to happen soonish, otherwise we will still be producing different attempts at the same thing over and over again. It does not need to be complete or final, but a useful set of starting places and guidelines for bibliographic RDF is required. The Discovery program is well placed to provide these recommendations to the UK sector. That would be a great start internationally. Then we can just get on with producing it and improving it :)

3) Cloud based platforms and services for publishing bibliographic data

COMET has shown that this is not yet as easy or cheap as it could be. With library systems teams and infrastructure often overstretched, taking on new publishing practices that do not have an obvious immediate in-house benefits is a hard sell.

To make it more palatable, better mechanisms for sharing are needed. The Extensible Catalog toolkit already provides a great set of tools for doing this with OAI-PMH. Imagine a similar but cloud based data distribution service whereby all a library has to do is (S)FTP a dump of its catalogue once a week. This is transformed on the fly into a variety of formats (RDF, XML, JSON etc.) for simple re-use, with licenses automatically applied depending on set criteria.

4) Microdata, Microformats and sitemaps
This is how Google and Bing want to index sites, and thus how web based data sharing and discovery largely happens outside of academia and libraries. The rest of the Internet gets by on these technologies, could they be applied to the aims of the Discovery programme? What are the challenges standing in the way, how do they compare to current approaches? We've made some first steps into this area by using microdata in a standard library catalogue interface.

Evidence of reuse
We were late in the day releasing our data, so re-use has so far been limited. We've been trying to consume it ourselves in development and our colleagues at OCLC and the British Library have provided useful feedback. We are glad to see it included in the recent developer competition. We've pledged to support our data outputs for a year, so will respond actively to any feedback from consumers over that time.

This project entailed a large amount of 'stepping up', not least on my part. Other than the odd Talis presentation, I had only a conceptual understanding of RDF. Now I've helped write tools to create it and worked with RDF stores and application frameworks. The time out to gain this skillset has been invaluable for me. The book, 'Programming for the Semantic Web' has saved my sanity on a number of occasions.

LinkIn terms of embedding this knowledge, our SPARQL workshop is designed to provide the first rung on the ladder for librarians and developers interested in RDF.

Despite this, we've suffered by doing everything in house, and the steep learning curve around RDF has meant that progress is not always as it could have been. Our current datastore is holding over 30 million triples, and we've still not been able to load all of our data output. This has hit the limits of ARC2/MYSQL and we will need a more robust back-end if we are to progress further.

Our RDF vocab choice is also a bit of a shot in the dark in places, and there are things with our data structure that could do with improvement.

If we are to continue to work with RDF data, we would like to bring in external assistance on scaling and development, as well as RDF vocab and modelling.

Most significant lessons
To finish, here are some random reflective thoughts ...

1) Don't aim for 100% accuracy in publishing data. In six months, with 2.2 million records that were written over 20 years in a variety of environments, this was never going to happen. I would hope that at least 80% of our data is fit for purpose. This is 80% more open data than we had six months ago.

2) Ask others. There are strong communities built around both open and linked data. Often, its the same people. They can be intimidating, but are useful. With hindsight, we should have asked for community guidance on issues of vocab and technology choice earlier.

3) Its an oldy, but please use Open Source and don't code from scratch if at all possible. The ARC2 framework may have its limitations with our scale of data, but it allowed a workable data site hosting 25k records to be assembled in two days. God bless Github.


  1. Bluehost is ultimately one of the best hosting company for any hosting plans you might need.