COMET (Cambridge Open METadata) project blog

Thursday, 16 February 2012

Code moved to GitHub

I've finally transferred all code output from last years' project to GitHub. The Perl based analysis and RDF conversion tools, as well as the PHP site source are all available as single repository.

https://github.com/edchamberlain/COMET

Monday, 31 October 2011

Back with more data!

As announced by Jim Michalko on the OCLC research blog, we've launched another dataset!

Its yet more bib data, this time comprising over 600,000 records originating from Worldcat as RDF triples. We've also loaded most of this into our triplestore. OCLC have enhanced this data with links to the FAST and VIAF authority services.

Even better, the previous two datasets we released have also been enhanced with the same links. There are still some things that could be better, especially our vocab choices around VIAF expression, but the data is there.

This data is licensed under an ODC-by Attribution License and is one of the first to make use of OCLC's newly updated community norms (details here), their preference for licensing Worldcat data for reuse. http://www.blogger.com/img/blank.gif

This is slightly in contrast to the pain free PDDL we've managed to provide so far, but we and OCLC are interested to see what users will make of this. The attribution is handled at a dataset level and should be relatively easy to implement and maintain

Dealing with attribution stacking was a major problem we encountered with COMET. That was partly due to Marc21s' inability to manage multiple record identifiers well and necessitated complex decision making regarding record ownership. Hopefully, the clear attribution policy set out here should be much easier to handle than the 'hobo stew' we encountered in our catalogue (as Jim puts it)!

I'd like to thank various folk at OCLC (especially our lead contact Eric Childress) for their support and patience over the past few months whilst we worked through a number of technical, administrative and legal points. They were voluntary partners on COMET but have given us a lot of time and assistance.

Next up, (when I find the time), will be enhanced links to Library of Congress subject headings and the recently released Name Authority File for everything in out triplestore.

Thursday, 28 July 2011

Final post

Its time for the final project post for COMET. Here we summarize major outputs over the past six months, cover what we've gained in terms of skills and the most significant lessons we've learnt. We will also take a look at what could be done to follow up on the project.

COMET was perhaps overly ambitious for a six month project, but we've made some firm progress in a number of areas relating to libraries and their open distribution of data.

Major outputs
- Document on data ownership analysis - A document describing the major sources of data in the Cambridge University Library catalogue, and making some comments about ownership and licensing around re-use.

- Workflow proposal and tool for record segmentation by vendor code, based on the above work. Suggests a methodology for sorting records when a vendor specifically requests a license other than PDDL.

- Marc21 to RDF triples conversion utility - A standalone tool designed to get
data out of the dead Marc21 format and into something better quickly. It features extensive CSV based customization, see the readme file for more details. Our digital metadata specialist Huw Jones was largely responsible for making this happen.

- data.lib.cam.ac.uk - Our first run at a library-centric open data service. It includes:

2.2 million linked RDF bibliographic records licensed under a PDDL with more to come
A SPARQL endpoint provides access to some of the above data
Supporting documentation, including an FAQ and SPARQL tutorial, aimed at a 'first rung on the ladder' for RDF and SPARQL novices.

- Application framework code for the above - a PHP / MYSQL application framework to store and deliver RDF data in a variety of formats, with a flexibile SPARQL endpoint. Also includes an experimental library of congress subject headings enrichment utility. A 'getting started' document covers installation and data loading. (For a flashier alternative, take a look at the Open Biblio suite.)

- An interesting sideline into the world of microdata and search engines

- Talks and presentations on Open Bibliographic Data at Birmingham and Manchester

Next steps
Publishing 'more open/linked data' would be useful, but data alone will not solve the challenge to improve resource discovery in the UK cultural sector. Here are some thoughts on what could come next. This is quite an eclectic list of ideas and musings on next steps that the Discovery programme could take, with some deeper focus around RDF:

1) Useable services for a wider audience
Open bibliographic data is one thing, but a certain level of skill and understanding is required to fully appreciate it, a criticism of the wider open data movement. To spread the word and enthuse a wider audience beyond 'data geeks', it would be great to see working services built around a framework of Open Data, or at least some impressive tech demos (Its worth mentioning Bibliographica here, which is already a great step in this direction ...)

2) RDF
If RDF is to continue in use as a mechanism for publishing open bibliographic data, its application needs further thought and development. Here are four suggestions:

2.1) Move beyond pure bibliography into holdings data.
In library systems and services, the real interactions that matter to library users are focused around library holdings. This data could potentially be published openly, and modeled in RDF. Links could be established to activity data to provide a framework for user driven discovery services.

2.2) 'Enliven' linked RDF data.
Like most open bib data, we've published a static dump of our catalogue at one time. It would be great to see pipes and processes in place to reflect changes and possibly track provenance. This is not as simple as it sounds, do we provide regular full updates or track incremental changes?

2.3) Better ways to get to RDF.
RDF data is valuable in its own right, but arguably needs easier access methods than SPARQL. Combining RDF data with better indexing and REST API technologies would be useful in widening its access and making it a more 'developer friendly' format. Thankfully, many RDF based tools offer this functionality, including the Talis platform. The Neo4J graph database technology also looks promising.

2.4) Recommendations for RDF vocabularies and linking targets for linked bibliographic data.
I think this needs to happen soonish, otherwise we will still be producing different attempts at the same thing over and over again. It does not need to be complete or final, but a useful set of starting places and guidelines for bibliographic RDF is required. The Discovery program is well placed to provide these recommendations to the UK sector. That would be a great start internationally. Then we can just get on with producing it and improving it :)

3) Cloud based platforms and services for publishing bibliographic data
COMET has shown that this is not yet as easy or cheap as it could be. With library systems teams and infrastructure often overstretched, taking on new publishing practices that do not have an obvious immediate in-house benefits is a hard sell.

To make it more palatable, better mechanisms for sharing are needed. The Extensible Catalog toolkit already provides a great set of tools for doing this with OAI-PMH. Imagine a similar but cloud based data distribution service whereby all a library has to do is (S)FTP a dump of its catalogue once a week. This is transformed on the fly into a variety of formats (RDF, XML, JSON etc.) for simple re-use, with licenses automatically applied depending on set criteria.

4) Microdata, Microformats and sitemaps
This is how Google and Bing want to index sites, and thus how web based data sharing and discovery largely happens outside of academia and libraries. The rest of the Internet gets by on these technologies, could they be applied to the aims of the Discovery programme? What are the challenges standing in the way, how do they compare to current approaches? We've made some first steps into this area by using schema.org microdata in a standard library catalogue interface.

Evidence of reuse
We were late in the day releasing our data, so re-use has so far been limited. We've been trying to consume it ourselves in development and our colleagues at OCLC and the British Library have provided useful feedback. We are glad to see it included in the recent developer competition. We've pledged to support our data outputs for a year, so will respond actively to any feedback from consumers over that time.

Skills
This project entailed a large amount of 'stepping up', not least on my part. Other than the odd Talis presentation, I had only a conceptual understanding of RDF. Now I've helped write tools to create it and worked with RDF stores and application frameworks. The time out to gain this skillset has been invaluable for me. The book, 'Programming for the Semantic Web' has saved my sanity on a number of occasions.

In terms of embedding this knowledge, our SPARQL workshop is designed to provide the first rung on the ladder for librarians and developers interested in RDF.

Despite this, we've suffered by doing everything in house, and the steep learning curve around RDF has meant that progress is not always as it could have been. Our current datastore is holding over 30 million triples, and we've still not been able to load all of our data output. This has hit the limits of ARC2/MYSQL and we will need a more robust back-end if we are to progress further.

Our RDF vocab choice is also a bit of a shot in the dark in places, and there are things with our data structure that could do with improvement.

If we are to continue to work with RDF data, we would like to bring in external assistance on scaling and development, as well as RDF vocab and modelling.

Most significant lessons
To finish, here are some random reflective thoughts ...

1) Don't aim for 100% accuracy in publishing data. In six months, with 2.2 million records that were written over 20 years in a variety of environments, this was never going to happen. I would hope that at least 80% of our data is fit for purpose. This is 80% more open data than we had six months ago.

2) Ask others. There are strong communities built around both open and linked data. Often, its the same people. They can be intimidating, but are useful. With hindsight, we should have asked for community guidance on issues of vocab and technology choice earlier.

3) Its an oldy, but please use Open Source and don't code from scratch if at all possible. The ARC2 framework may have its limitations with our scale of data, but it allowed a workable data site hosting 25k records to be assembled in two days. God bless Github.

Where exactly DOES a record come from?

Early on in the COMET project, Hugh Taylor assembled a complex document attempting to describe problems inherent in understanding the origin of a Marc-encoded bibliographic record. It also included a thorough analysis of Cambridge University Library data and the number and nature of vendor codes contained there-in.

We've updated this document with information on the various contracts and agreements associated with each vendor code to reflect the final work. The next problem was how to make sense of it all.

A major sticking point is related to Marc21 and its usage.

In our Marc data, we have four separate fields (015, 038, 994 ,035) that could indicate ownership, some of which may be repeated multiple times in a record. There is to my knowledge no mechanism in Marc21 or AACR2 to indicate which field and thus which vendor code takes precedence over others, (although cataloguers have some 'community knowledge' in this area).

Furthermore, many vendors change code and field used. Most rely on prefixes. Some are simply unhelpful strings of numbers.

In terms of practicalities, we need to ensure that:

1) Records from vendors who explicitly and contractually prohibit re-sharing in any format are excluded, this includes most ebook and ejournal records. Otherwise, there is no good reason not to share a record, although its origin may have an impact on license choice

2) In our case, records from OCLC are segmented due to a need to publish data from that vendor under an attribution license

3) Data from vendors who prefer non-marc output to be shared openly, but want Marc21 output restricted are segmented (RLUK and the BNB in our case) so these records will need to be split off

4) Data produced in house (usually that with no identifier) can be segmented for clarification

5) Everything else from smaller/ specialist record vendors is segmented together with a view to publishing openly

We've had to make some decisions over which field and vendor takes precedence based largely on this order of importance. To do this, we came up with a rough decision tree regarding record ownership:

The above JPG is also available as a scaled vector graphics file created in MS Visio.

One of my final tasks on COMET was to take this decision tree and turn it into a script to export record data for our final exercises in data publishing. I've also released a Perl script as output for COMET on out code page. (A warning / apology, this script is as ugly as the situation it attempts to resolve. It was pulled together at the last minute and could really do with a rethink.)

In the case of both the script and chart, the situation relates to Cambridge's specific and current situation., but should hopefully be useful for those wishing to replicate this activity.

As a personal opinion, I see this confusion regarding ownership as a key barrier that prevents libraries from openly sharing their data.

Furthermore, it is important that we do NOT see a repeat of this problem with the next set of record container and delivery standards.

Its my worry that stacking attribution statements in records at the bibliographic level could lead to similar problems down the road. Attribution at a data-set level, with some indication of the relationship between a record and a data-set seems more practical.

A standardization of practice across the library community with regards to licensing could help ease this pain in the future.

Because we always need more standards.

Wednesday, 27 July 2011

More data and status of future updates

I'm pleased to announce the next set of data release at data.lib.cam.ac.uk - 1,741,080 records of linked RDF distributed under a Public Domain Data License.

This data is from two major UK record suppliers, RLUK and the British Library BNB. Both have indicated to us that they have no problem with records being redistributed as RDF, but would rather we did not redistribute Marc21. See a blog post on licensing for explanations as to why.

This dataset is bulk download only for now - we hope to have it in our triple-store in the future

Although major development work on the COMET project has ended, we will continue to open-up as much of our catalogue as possible, publishing and updating datasets over the next year using the open source tools we've developed. In particular, we hope to carry data from OCLC under an Open Data Commons attribution license and enrich all datasets with FAST and VIAF links. We've been experimenting with these for the past month and results are promising.

In a follow-up post, I'll explain how we decided which data could be shared.

Friday, 22 July 2011

Friday update ...

A few things on a Friday.

Firstly, I've written a small piece on getthedata.org about data.lib.cam.ac.uk and the various mechanisms for querying and retrieving data, also mirrored on our FAQ. May be of interest to those in the Discovery developer competition.

Development work is all but done, so we've also published the application framework code behind data.lib.cam.ac.uk on the code page. This PHP based site provides a lightweight approach to RDF publishing and makes a great starting / exploration point for libraries wanting to publish data as a RDF.
More details and a read-me are available on the code page. As with all our output, its provided 'as-is' under a GPL, but we welcome feedback.

As the COMET deadline approaches next week, we are still working to release as much data as we can. Sadly, we are still waiting on final confirmation from some external bodies. As such we will work to publish and republish data using existing tools throughout the following year, as we can.

Tuesday, 19 July 2011

Cost benefits

The JISC has asked us to blog on cost benefits of providing open data. I'll give a rough indication costs based on time spent and an idea of what I think the benefits may be with an indication of how the two weigh up.

Costs:

1) Marc21 Data 'ownership' analysis - ( 5 days staff time at SP64)
Mapping and conversion of bibliographic information. An experimental and iterative process.

2) Marc21 to RDF data conversion - ( x2 developers at SP53)
Again, this has been drawn out through experimental work. Several methods and iterations were tried. Those aiming to repeat this may not incur the same cost.

3) Web infrastructure development and record curation- (x2 developers at SP53)
A lightweight approach to development was taken using existing application frameworks. Time was also spent understanding underlying principles of RDF stores and associated best practice for linked data. Several iterative loads of data were undertaken in parallel with Marc to RDF conversion.

4) Hosting and sustainability costs - costs tbc
COMET's web infrastructure makes use of existing VM and MYSQL infrastructure at CARET, so additional infrastructure costs were negligible and hard to determine. We've promised to keep the service running for a year.

5) Other stuff
Project management etc.

External benefits:

Substantial contribution to Open Bibliography - Open data is arguably a good thing, and whilst it has flaws, ours is hopefully useful enough to be useful to others in its own right

Clarification on licensing agreements with record vendors - Much headway has been made into this issue by the COMET project, with some clarification on licensing preferences for RDF data from three major UK record Vendors, OCLC, RLUK and the British Library. Down the line, we hope that these organizations will formalize their agreements with us so that others can benefit, which will hopefully help in publishing more data

Advice on how to analyise records to determine 'ownership' and lightweight (Perl, PHP, MYSQL based) tools to create and publish RDF linked data from Marc21

Experiments with FAST and VIAF - Two potentially useful data sources

In house benefits:

Community interaction - There is strong interest in Open Bibliography an its benefits. The University Library has also benefited greatly from its interaction with the open and linked data communities, in its work with OCLC and with others through the JISC Discovery program

In house skills - We've gained vital in-house understanding of the design and publication of RDF. We've developed basic training materials around SPARQL for non-developers, which could play off down the line

Summary:
External benefits clearly outweigh internal benefits, although as external benefits affect the whole library community, they also benefit us!

Whats' clear is that Open Data is not free data, at least not to us. We could have simply dumped our Marc21 or Dublin core XML and have been done with it, and for many that would have sufficed.

Instead, combining our wish to publish more Open Data with a need to learn about Linked Data (and thus lashing two fast bandwagons nicely together) has pushed the costs far higher.

However, by publishing linked data we've hopefully made our output more useful to a wider community than library metadata specialists, and in that sense added value.

More data being published means greater community feedback to draw upon, which should result in lower costs for those repeating this exercise.

It may indeed be several development cycles before we or others fully reap the benefits of this work. Alternatively, things could move in a different direction, with RDF based linked data falling by the wayside in favour of more accessible mechanisms for record sharing, in which case, our work could be useful in avoiding mistakes.

Pages