Friday 18 March 2011

The inevitable metadata post

Insipred by a recent post on the LOCAH (Linked Open Copac Archives Hub) blog I'd like to begin to tackle the prickly subject of RDF vocabularies. At the recent Infrastructure for Resource Discovery kick off meeting in Birmingham, a choice of vocabularies for linked data was seen as one of the biggest issues facing projects. Alongside licensing issues for bibliographic data, its taking up a lot of time at this early planning end of the COMET project.

Archives Hub have a specialised form of EAD to convert into RDF, and have been putting a serious amount of work into modeling their data as well as identifying useful existing vocabularies to use. When we approached COMET, we hoped to avoid any such modelling of bibliographic data to RDF and instead make use of existing work, part of our greater project philosophy of aiming to minimise any 'coding from scratch'.

After all, plenty of previous attempts to model bibliographic metadata have been made, it should be simple enough not to reinvent the wheel?

Rather frustratingly for a standards loving Librarian such as myself, there is no accepted single set of vocabularies in place to publish library bibliographic metadata. Whilst the W3C incubator group is closely examining the issue, actual output and recommendations seems a while off. If this is symptomatic of anything, its that publishing linked data is still an exception for libraries rather than standard practice.

When scoping our bid, we initally looked at the SIMILE project, specifically its MARC-21->MODS->RDF XSLT based conversion tool. Initial tests proved promising, and MODS is sufficiently rich enough a standard not to loose 'data richness' although we would still have to do some work to create URI's for entities described by the data.

However, outside of this project, there has been little take-up of this concept, (and we could not see much movement around SIMILE). Indeed, very few library metadata standards have been directly expressed in RDF, although the Library of Congress are looking at MADS.

At the same time, we want our data to be as easily reusable possible. Whilst this is an inherent feature with RDF due to its self describing nature, we felt that using popular vocabularies helps to minimise effort and makes our data more easily readable by those more familiar with existing linked data practices than library standards.

Looking at the Open Bibliography project, their choice of vocabularies for the OpenBiblio software underpinning their Bibliographica service is comprised of several generally used vocabularies, most notably Bibo, the Bibliographic Ontology for additional bibliographic elements, including some elements of FRBRisation. It also includes Dublin core for general descriptive terms, FOAF for people and (Will Waites, a developer at Open Knowledge Foundation has gone into some further detail on their development email list).

The initial knee-jerk concern for any meta-data fixated librarian with a MARC-21 to Dublin-Core conversion is that of 'data loss'. As an example, the many MARC21 fields for author or creator (100, 110, 700, 710 etc) are generally flattened to DC:creator. Given the use case of development within a linked data environment, one has to sit back and question the value of having several different types of author/creator or indeed the myriad of additional alternative and uniform title fields that MARC21 based data may present. Following accepted practice from outside of the library sphere could be more useful in this context.

As such, we are currently aiming to generally adopt the set of vocabularies used by Open Bibliography, with some modifications over. Bibo in particular looks to be good vocab to use, with the growing support, Dortmund University Library has recently adopted it for its open metadata. Sadly, they've no MARC-21 conversion script to hand. Our options right now are to adopt the already outdated MARC import script that is part of Bibliographica / OpenBiblio or create our own.

In following posts, we will discuss URI naming conventions for RDF graphs and accepted practice for bibliographic entities.