COMET (Cambridge Open METadata) project blog

Friday, 20 May 2011

Metadata and standards - URI construction

We are continuing to finalize our RDf conversion and work through linking to OCLC resources. As we are also finalising the datasets we can make available under a permissive useful license, we are currently working of some random samples of catalogue data.

One issue worth highlighting at this stage is that of URI construction. URI's for records and other important entities described in a catalogue are a key component of linked data. We are taking a standards based approach to URI construction, trying to follow guidelines set out by the cabinet office for UK public sector (pdf link).

Our record URI string is quite simple:

http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705

the /id/entry/ denotes that the uri relates to an identifier for either a catalogue entry or entity described in our dataset. The following identifier string is a mixture of a string of characters for the dataset (which we may remove) and the catalogue records' identifier, already used in persistent URL's for our catalogue interface.

One issue we've not tackled is human readable unique identifiers for creators. The guild portion at the end constructed from a string of characters (say the 100$a in a Marc record) being stripped of punctuation (where errors tend to occur) and run through an MD5 checksum.

http://data.lib.cam.ac.uk/id/entity/cul_comet_pddl_0a72dd0c8fe090f78970db02b336900f

Human readable URI's would be nice, but some attempt at keeping this unique is probably better. If the Library of Congress were to follow suit on their excellent subject work and publish their name authority file as linked data, we could utilize and guids used there. Hopefully, we will be able to provide links to relevant VIAFF (Virtual International Authority File) entries for authors, where they can be matched by OCLC.

I'll follow this up shortly with a post about how we are ensuring the data behind a URI is easily referenced by both humans and machines.

Thursday, 19 May 2011

Small (but fiddly) win for URI's ...

Work on RDF conversion goes on. In addition to eventual complete dumps of data, we've also started putting together the pieces for our application to support RDF queries via SPARQL and HTTPD.

We are using the apache extension mod_rewrite to turn human readable uris like the below ...

http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705

into those easily parsed by the web application dishing out the record content:

http://data.lib.cam.ac.uk/record.php?uri=http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705&format=html

Its also considered best practice with linked data to provide dish up records in the format required by the requesting agent in their httpd request. This practice is referred to as 'cool uri's'. As an example, if I want to view 'http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705' in a browser, when the standard http request accepts content returned as 'text/html', then they should see html in their browser.

Conversely, if they want to see rdf+xml content, they make request it via a script or command line, e.g:

curl -H "Accept: application/rdf+xml" http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705

They should not have to add any kind of file extension (.rdf) to the request uri, although its also nice to support this.

We can handle this within the web application framework, which will involve monitoring requests and parsing incoming uri strings for file extensions, but that will add precious lines of code. Much easier to let the web server take over, which is where mod_rewite again comes in. It allows you to specify a set of rules that monitor for file extensions and accepted content types and rework uri's so a web application can dish out the required format

Sadly, we can't escape regular expressions, and mod_rewite relies heavily on them. Logging is also essential for de-bugging. Here is our htaccess ruleset, with each rule commented. We are still not supporting all the formats available for RDF distribution, but sticking to xml, json, baseline triples and turtle.

Mod_rewite or equivalent tools are a vital part of semantic web infrastructure, and whilst fiddly, a little knowledge can go a long way. Here are three great tutorials:

Beginners guide to mod_rewrite

Patrick Altofts' Ultimate guide

mod_rewrite and RDF

Tuesday, 19 April 2011

Quick project update ...

(Extracted from previous post for the sake of brevity)

In terms of project progress, we have a workable and easily customizable (CSV configurable) Marc21 to RDF-triples export script nearing completion. We hope to be able to share this towards the end of the project. Getting a suitable triple-store and associated software frameworks in place for a 'data.lib.cam.ac.uk' domain will be the next focus of our technical work. Work also continues on identifying records for sharing and internal discussions on licensing issues within the project, the main barrier to eventual data-release. We've also been in contact with Eric Childress and his colleagues over at OCLC in enhancing our data with identifiers for the FAST and VIAFF services.

Presentation at the 'Open Data Open Doors' event and other musings ...

Yesterday in Manchester, I was asked to give a brief presentation on our reasons for pursuing Open Data. Beyond the "its a good thing to do" arguments, ( which better people than myself have been able to better put across), I wanted to try and give an impression of the 'internal' reasons for doing so. In particular, I was keen to draw on real world examples, including the work done by Rufus Pollock in estimating the size and growth of the printed public domain from CUL's bibliographic data. I also touched on my own personal ideas for resource discovery services at a national level, and how libraries could be providing developer orientated services.

Open (linked) bibliographic data

There were many useful discussions on both Open and Linked data that day, with Paul Walk at UKOLN on-hand to remind us that we should not jump at lumping the two together, and that each was itself something of a band-wagon.

A real cost-driven business case for opening up data was mentioned, which gave me cause for concern. Benefits of publishing data will only be fully realized when developers get to produce useful outputs, which could take several cycles to emerge. Identifying successes outside of cultural heritage was seen as a good way to sidestep this.

Much concern was given to licensing, and the library-centric issues of record-ownership again came to the fore. For many years and in pre-Internet eco-systems, Libraries and Librarians have benefited greatly from shared practices and resources in cataloguing. It would be a real shame to let the technical and legal frameworks developed to support previous generations of activity get in the way of finding better ways to share data between and beyond the library community.

The linked data approach of the Comet project was compared to the formidable API-orientated work on Jerome, taking place over at Lincoln. There was some discussion over the relative merits of each approach.

My personal take right now is both have fairly separate use cases, and that publishing large amounts of data as RDF (or in 'community' formats such MODs, Marc21 etc.) will be more useful for aggregation services than straight API provision, but that any eventual shared-data-service should itself expose data in API's of the highest quality. Thus the work done by Jerome will be of great importance to the RDTF no matter which way things move. Lincoln are also themselves gaining an excellent platform for future service development.

As I argued in my talk, Linked Data still has a high entry-bar, and many developers are much happier with a simple bit of Json over XML/RDF. RDF may not be the easiest means for aggregation (OAI-PMH works for me) but its arguably a great tool for sharing library data beyond the library community in bulk. Apart from anything, self-describing data means we don't have to explain Marc21 to people with useful things to do.

A third alternative, not currently being investigated by the RDTF (to my knowledge) would be crawler exposure of existing catalogues with RDFa or some kind of useful microformat in place.

I raised a point which no-one seemed able to answer, what types of license are applicable to feeds of data, i.e. an JSON or XML API such as those we provide at www.lib.cam.ac.uk/api, or even an Atom/RSS feed?
Would Creative Commons licenses suffice, or do they need the data specific Open Data Commons licenses? If anything, they are more of a service than a resource. How can we imply complete openness (or otherwise) in easily understandable terms?

Friday, 18 March 2011

The inevitable metadata post

Insipred by a recent post on the LOCAH (Linked Open Copac Archives Hub) blog I'd like to begin to tackle the prickly subject of RDF vocabularies. At the recent Infrastructure for Resource Discovery kick off meeting in Birmingham, a choice of vocabularies for linked data was seen as one of the biggest issues facing projects. Alongside licensing issues for bibliographic data, its taking up a lot of time at this early planning end of the COMET project.

Archives Hub have a specialised form of EAD to convert into RDF, and have been putting a serious amount of work into modeling their data as well as identifying useful existing vocabularies to use. When we approached COMET, we hoped to avoid any such modelling of bibliographic data to RDF and instead make use of existing work, part of our greater project philosophy of aiming to minimise any 'coding from scratch'.

After all, plenty of previous attempts to model bibliographic metadata have been made, it should be simple enough not to reinvent the wheel?

Rather frustratingly for a standards loving Librarian such as myself, there is no accepted single set of vocabularies in place to publish library bibliographic metadata. Whilst the W3C incubator group is closely examining the issue, actual output and recommendations seems a while off. If this is symptomatic of anything, its that publishing linked data is still an exception for libraries rather than standard practice.

When scoping our bid, we initally looked at the SIMILE project, specifically its MARC-21->MODS->RDF XSLT based conversion tool. Initial tests proved promising, and MODS is sufficiently rich enough a standard not to loose 'data richness' although we would still have to do some work to create URI's for entities described by the data.

However, outside of this project, there has been little take-up of this concept, (and we could not see much movement around SIMILE). Indeed, very few library metadata standards have been directly expressed in RDF, although the Library of Congress are looking at MADS.

At the same time, we want our data to be as easily reusable possible. Whilst this is an inherent feature with RDF due to its self describing nature, we felt that using popular vocabularies helps to minimise effort and makes our data more easily readable by those more familiar with existing linked data practices than library standards.

Looking at the Open Bibliography project, their choice of vocabularies for the OpenBiblio software underpinning their Bibliographica service is comprised of several generally used vocabularies, most notably Bibo, the Bibliographic Ontology for additional bibliographic elements, including some elements of FRBRisation. It also includes Dublin core for general descriptive terms, FOAF for people and (Will Waites, a developer at Open Knowledge Foundation has gone into some further detail on their development email list).

The initial knee-jerk concern for any meta-data fixated librarian with a MARC-21 to Dublin-Core conversion is that of 'data loss'. As an example, the many MARC21 fields for author or creator (100, 110, 700, 710 etc) are generally flattened to DC:creator. Given the use case of development within a linked data environment, one has to sit back and question the value of having several different types of author/creator or indeed the myriad of additional alternative and uniform title fields that MARC21 based data may present. Following accepted practice from outside of the library sphere could be more useful in this context.

As such, we are currently aiming to generally adopt the set of vocabularies used by Open Bibliography, with some modifications over. Bibo in particular looks to be good vocab to use, with the growing support, Dortmund University Library has recently adopted it for its open metadata. Sadly, they've no MARC-21 conversion script to hand. Our options right now are to adopt the already outdated MARC import script that is part of Bibliographica / OpenBiblio or create our own.

In following posts, we will discuss URI naming conventions for RDF graphs and accepted practice for bibliographic entities.

Monday, 21 February 2011

"Ownership" of MARC 21 records - please comment

The case of for open bibliographic data has been well made. The Open Knowledge Foundations' Bibliographic Working Group has established a set of Open Bibliographic Principles. The JISC Resource Discovery Taskforce has itself produced a comprehensive Open Bibliographic Data Guide, examining reasons for publishing through use cases and the wider context of open data elsewhere within the UK.

With such principles' and use-cases firmly established, one barrier to publishing open data lies in establishing the 'ownership' of a record, ensuring that as far as a library is aware, no existing license agreements with record vendors are breached.

COMET's in ital document on "Ownership" of MARC-21 records is designed to help identify where MARC-21 encoded metadata originates from and assist in establishing its provenance.

The documentation and underlying investigation was performed by Hugh Taylor, Head of Collection Description and Development at Cambridge University Library. Hugh is as familiar as anyone with the vast and varied dataset at the University Library. Given the size and scope of our data, the issues and examples raised will hopefully be of use to anyone else considering publishing of Open Data.

This guide is something of a work in progress, which we will revisit as COMET progresses. Next up is a brief summary of relevant licenses, aiming to provide an overview of what is allowed and not allowed with the array of data we have.

We would welcome feedback in the comments below.

Tuesday, 15 February 2011

Welcome

Welcome to the COMET (Cambridge Open Metadata) project blog. COMET is a JISC funded collaboration between Cambridge University Library and CARET, University of Cambridge. It is funded under the JISC Infrastructure for Resource Discovery programme.

COMET will release a large sub-set of bibliographic data from Cambridge University Library catalogues as open (under a Public Domain Dedication License) metadata. It will also explore and test a number of technologies and methodologies for publishing XML/RDF.

COMET aims to build upon the successes of previous work in this area.

The library has previously contributed a dataset of 132,130 bibliographic records to the JISC-funded Open Bibliography project led by the Unilever Centre for Molecular Science Informatics at the University of Cambridge, in partnership with the Open Knowledge Foundation and the International Union of Crystallography.

This collaboration began to develop our understanding of the intellectual property and technical issues relating to the exposure of bibliographic data and potential value in linking the data.

COMET will have a particular focus on library-catalogue derived bibliographic data, aiming to provide the University of Cambridge and wider academic community with a readily accessable RDF store for bibliographic data. The development and installation work behind this will be documented in such a way as to be repeatable by others.

We will also investigate and document the availability of metadata for the library’s collections which can be released openly in machine-readable formats and the barriers which prevent other data from being exposed in this way.

The project will also explore the value of a linked approach to enrichment of records using services provided by OCLC to assign FAST (Faceted Application of Subject Terminology) and VIAF (Virtual International Authority File) headings to the metadata, allowing the development of innovative services for information retrieval and resource discovery.

You can find detailed information regrarding COMET on our about page, including full aims, objectives and expected ouputs, as well as a project plan.

Pages