Monday, 20 June 2011

On licensing ...

Background
Licensing of bibliographic metadata is far too complex a subject.
One of the major aims of COMET has been to see how easy it is to identify records from major record vendors in the UK HE environment and address issues and concerns around data reuse. This work is still on-going, but its high time we got a post out on the subject, explaining where things are at.

Like most university libraries, Cambridge University Library relies heavily on external record vendors to meet its cataloging needs and keep up to speed with a high intake of material. Much of this data has its potential reuse and republication covered by an explicit contractual agreement. At the same time, we understand and support the need to produce Open Data as a platform for a better set of services for Higher Education.

State of play
Through the Comet Project we have been investigating our data for traces of 'ownership' and have been examining contracts. We've contacted the major record providers and some have indicated a preference for certain types of licenses in data re-publishing.
As an example, the British Library have published the British National bibliography as RDF formatted data under a PDDL and are happy for others holding BNB data in their catalogues to do the same,(although there is not yet any formal announcement to this effect!).

OCLC, perhaps the biggest record supplier have recently expressed a preference for ODC-By attribution licensing. We are one of a number of libraries working with OCLC to investigate the practicalities around this.

We in turn produce a substantial amount of data in-house, and would still like to publish this under a Public Domain Data License. Identifying this data was actually more difficult than it should be, we ourselves insert no 'made in Cambridge' label on our records, so we had to identify this set via a process of elimination.

Given this disparity between approaches to licensing, we will be aiming to produce several different datasets under established Open Data Commons licenses.

In terms of URI structure and vocab choice, they will be identical, but each whole set will be represented by a separate graph in our RDF datastore itself linked to the appropriate license information. For data produced under anything other than a PDDL, license information will also be made explicitly obvious to those downloading in bulk.

A final solution?
This area is still in flux. We feel that although licenses may vary, there should be no barrier to publishing data for others to reuse. We hope that over time, the library community will work to a set of established practices and community norms over data publishing. This work represents one of the first steps taken in this area.

Public Domain Data Licensing is an obvious ideal and one which we prefer, but adopting a pragmatic approach now can get more useful data out in the wild quickly. Whilst stepping back from PDDL or CC0 is next to impossible, adopting a slightly less open standard as an initial position which can be rethought downstream may be more palatable. Just steer clear of non-commercial licenses for data!

Marc21 - another reason for deviation
Whilst there is strong interest in and backing for Open Bibliographic Data within the international HE Library community, there have been concerns raised about its impact on organizations that rely on commercial Marc21 record supply to maintain and develop services.
We recognize that partner institutions have valid commercial interests in this and benefit ourselves from such services. As such, we are only releasing Marc21 that we can claim total ownership of. Other data is being released as RDF only. We believe our RDF output is sufficiently altered to make cross-walking it back to useful Marc21 next to impossible.

This may not be an approach suited to everyones' tastes, but it is pragmatic. To put this in perspective, how many open data consumers really care about Marc21? Its a format that really deserves to die and is irrelevant to the wider conversation.

Some of this post has been distilled down into a forthcoming F.A.Q for data.lib.cam.acuk.

Friday, 20 May 2011

Metadata and standards - URI construction

We are continuing to finalize our RDf conversion and work through linking to OCLC resources. As we are also finalising the datasets we can make available under a permissive useful license, we are currently working of some random samples of catalogue data.

One issue worth highlighting at this stage is that of URI construction. URI's for records and other important entities described in a catalogue are a key component of linked data. We are taking a standards based approach to URI construction, trying to follow guidelines set out by the cabinet office for UK public sector (pdf link).

Our record URI string is quite simple:

http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705

the /id/entry/ denotes that the uri relates to an identifier for either a catalogue entry or entity described in our dataset. The following identifier string is a mixture of a string of characters for the dataset (which we may remove) and the catalogue records' identifier, already used in persistent URL's for our catalogue interface.

One issue we've not tackled is human readable unique identifiers for creators. The guild portion at the end constructed from a string of characters (say the 100$a in a Marc record) being stripped of punctuation (where errors tend to occur) and run through an MD5 checksum.


http://data.lib.cam.ac.uk/id/entity/cul_comet_pddl_0a72dd0c8fe090f78970db02b336900f


Human readable URI's would be nice, but some attempt at keeping this unique is probably better. If the Library of Congress were to follow suit on their excellent subject work and publish their name authority file as linked data, we could utilize and guids used there. Hopefully, we will be able to provide links to relevant VIAFF (Virtual International Authority File) entries for authors, where they can be matched by OCLC.

I'll follow this up shortly with a post about how we are ensuring the data behind a URI is easily referenced by both humans and machines.

Thursday, 19 May 2011

Small (but fiddly) win for URI's ...


Work on RDF conversion goes on. In addition to eventual complete dumps of data, we've also started putting together the pieces for our application to support RDF queries via SPARQL and HTTPD.


We are using the apache extension mod_rewrite to turn human readable uris like the below ...


http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705


into those easily parsed by the web application dishing out the record content:


http://data.lib.cam.ac.uk/record.php?uri=http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705&format=html



Its also considered best practice with linked data to provide dish up records in the format required by the requesting agent in their httpd request. This practice is referred to as 'cool uri's'. As an example, if I want to view 'http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705' in a browser, when the standard http request accepts content returned as 'text/html', then they should see html in their browser.











Conversely, if they want to see rdf+xml content, they make request it via a script or command line, e.g:



curl -H "Accept: application/rdf+xml" http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705


They should not have to add any kind of file extension (.rdf) to the request uri, although its also nice to support this.


We can handle this within the web application framework, which will involve monitoring requests and parsing incoming uri strings for file extensions, but that will add precious lines of code. Much easier to let the web server take over, which is where mod_rewite again comes in. It allows you to specify a set of rules that monitor for file extensions and accepted content types and rework uri's so a web application can dish out the required format


Sadly, we can't escape regular expressions, and mod_rewite relies heavily on them. Logging is also essential for de-bugging. Here is our htaccess ruleset, with each rule commented. We are still not supporting all the formats available for RDF distribution, but sticking to xml, json, baseline triples and turtle.

Mod_rewite or equivalent tools are a vital part of semantic web infrastructure, and whilst fiddly, a little knowledge can go a long way. Here are three great tutorials:


Tuesday, 19 April 2011

Quick project update ...

(Extracted from previous post for the sake of brevity)


In terms of project progress, we have a workable and easily customizable (CSV configurable) Marc21 to RDF-triples export script nearing completion. We hope to be able to share this towards the end of the project. Getting a suitable triple-store and associated software frameworks in place for a 'data.lib.cam.ac.uk' domain will be the next focus of our technical work. Work also continues on identifying records for sharing and internal discussions on licensing issues within the project, the main barrier to eventual data-release. We've also been in contact with Eric Childress and his colleagues over at OCLC in enhancing our data with identifiers for the FAST and VIAFF services.

Presentation at the 'Open Data Open Doors' event and other musings ...

Yesterday in Manchester, I was asked to give a brief presentation on our reasons for pursuing Open Data. Beyond the "its a good thing to do" arguments, ( which better people than myself have been able to better put across), I wanted to try and give an impression of the 'internal' reasons for doing so. In particular, I was keen to draw on real world examples, including the work done by Rufus Pollock in estimating the size and growth of the printed public domain from CUL's bibliographic data. I also touched on my own personal ideas for resource discovery services at a national level, and how libraries could be providing developer orientated services.



There were many useful discussions on both Open and Linked data that day, with Paul Walk at UKOLN on-hand to remind us that we should not jump at lumping the two together, and that each was itself something of a band-wagon.

A real cost-driven business case for opening up data was mentioned, which gave me cause for concern. Benefits of publishing data will only be fully realized when developers get to produce useful outputs, which could take several cycles to emerge. Identifying successes outside of cultural heritage was seen as a good way to sidestep this.

Much concern was given to licensing, and the library-centric issues of record-ownership again came to the fore. For many years and in pre-Internet eco-systems, Libraries and Librarians have benefited greatly from shared practices and resources in cataloguing. It would be a real shame to let the technical and legal frameworks developed to support previous generations of activity get in the way of finding better ways to share data between and beyond the library community.

The linked data approach of the Comet project was compared to the formidable API-orientated work on Jerome, taking place over at Lincoln. There was some discussion over the relative merits of each approach.

My personal take right now is both have fairly separate use cases, and that publishing large amounts of data as RDF (or in 'community' formats such MODs, Marc21 etc.) will be more useful for aggregation services than straight API provision, but that any eventual shared-data-service should itself expose data in API's of the highest quality. Thus the work done by Jerome will be of great importance to the RDTF no matter which way things move. Lincoln are also themselves gaining an excellent platform for future service development.

As I argued in my talk, Linked Data still has a high entry-bar, and many developers are much happier with a simple bit of Json over XML/RDF. RDF may not be the easiest means for aggregation (OAI-PMH works for me) but its arguably a great tool for sharing library data beyond the library community in bulk. Apart from anything, self-describing data means we don't have to explain Marc21 to people with useful things to do.

A third alternative, not currently being investigated by the RDTF (to my knowledge) would be crawler exposure of existing catalogues with RDFa or some kind of useful microformat in place.

I raised a point which no-one seemed able to answer, what types of license are applicable to feeds of data, i.e. an JSON or XML API such as those we provide at www.lib.cam.ac.uk/api, or even an Atom/RSS feed?
Would Creative Commons licenses suffice, or do they need the data specific Open Data Commons licenses? If anything, they are more of a service than a resource. How can we imply complete openness (or otherwise) in easily understandable terms?

Friday, 18 March 2011

The inevitable metadata post

Insipred by a recent post on the LOCAH (Linked Open Copac Archives Hub) blog I'd like to begin to tackle the prickly subject of RDF vocabularies. At the recent Infrastructure for Resource Discovery kick off meeting in Birmingham, a choice of vocabularies for linked data was seen as one of the biggest issues facing projects. Alongside licensing issues for bibliographic data, its taking up a lot of time at this early planning end of the COMET project.

Archives Hub have a specialised form of EAD to convert into RDF, and have been putting a serious amount of work into modeling their data as well as identifying useful existing vocabularies to use. When we approached COMET, we hoped to avoid any such modelling of bibliographic data to RDF and instead make use of existing work, part of our greater project philosophy of aiming to minimise any 'coding from scratch'.

After all, plenty of previous attempts to model bibliographic metadata have been made, it should be simple enough not to reinvent the wheel?

Rather frustratingly for a standards loving Librarian such as myself, there is no accepted single set of vocabularies in place to publish library bibliographic metadata. Whilst the W3C incubator group is closely examining the issue, actual output and recommendations seems a while off. If this is symptomatic of anything, its that publishing linked data is still an exception for libraries rather than standard practice.

When scoping our bid, we initally looked at the SIMILE project, specifically its MARC-21->MODS->RDF XSLT based conversion tool. Initial tests proved promising, and MODS is sufficiently rich enough a standard not to loose 'data richness' although we would still have to do some work to create URI's for entities described by the data.

However, outside of this project, there has been little take-up of this concept, (and we could not see much movement around SIMILE). Indeed, very few library metadata standards have been directly expressed in RDF, although the Library of Congress are looking at MADS.

At the same time, we want our data to be as easily reusable possible. Whilst this is an inherent feature with RDF due to its self describing nature, we felt that using popular vocabularies helps to minimise effort and makes our data more easily readable by those more familiar with existing linked data practices than library standards.

Looking at the Open Bibliography project, their choice of vocabularies for the OpenBiblio software underpinning their Bibliographica service is comprised of several generally used vocabularies, most notably Bibo, the Bibliographic Ontology for additional bibliographic elements, including some elements of FRBRisation. It also includes Dublin core for general descriptive terms, FOAF for people and (Will Waites, a developer at Open Knowledge Foundation has gone into some further detail on their development email list).

The initial knee-jerk concern for any meta-data fixated librarian with a MARC-21 to Dublin-Core conversion is that of 'data loss'. As an example, the many MARC21 fields for author or creator (100, 110, 700, 710 etc) are generally flattened to DC:creator. Given the use case of development within a linked data environment, one has to sit back and question the value of having several different types of author/creator or indeed the myriad of additional alternative and uniform title fields that MARC21 based data may present. Following accepted practice from outside of the library sphere could be more useful in this context.

As such, we are currently aiming to generally adopt the set of vocabularies used by Open Bibliography, with some modifications over. Bibo in particular looks to be good vocab to use, with the growing support, Dortmund University Library has recently adopted it for its open metadata. Sadly, they've no MARC-21 conversion script to hand. Our options right now are to adopt the already outdated MARC import script that is part of Bibliographica / OpenBiblio or create our own.

In following posts, we will discuss URI naming conventions for RDF graphs and accepted practice for bibliographic entities.

Monday, 21 February 2011

"Ownership" of MARC 21 records - please comment

The case of for open bibliographic data has been well made. The Open Knowledge Foundations' Bibliographic Working Group has established a set of Open Bibliographic Principles. The JISC Resource Discovery Taskforce has itself produced a comprehensive Open Bibliographic Data Guide, examining reasons for publishing through use cases and the wider context of open data elsewhere within the UK.

With such principles' and use-cases firmly established, one barrier to publishing open data lies in establishing the 'ownership' of a record, ensuring that as far as a library is aware, no existing license agreements with record vendors are breached.

COMET's in ital document on "Ownership" of MARC-21 records is designed to help identify where MARC-21 encoded metadata originates from and assist in establishing its provenance.

The documentation and underlying investigation was performed by Hugh Taylor, Head of Collection Description and Development at Cambridge University Library. Hugh is as familiar as anyone with the vast and varied dataset at the University Library. Given the size and scope of our data, the issues and examples raised will hopefully be of use to anyone else considering publishing of Open Data.

This guide is something of a work in progress, which we will revisit as COMET progresses. Next up is a brief summary of relevant licenses, aiming to provide an overview of what is allowed and not allowed with the array of data we have.

We would welcome feedback in the comments below.