Friday, 20 May 2011

Metadata and standards - URI construction

We are continuing to finalize our RDf conversion and work through linking to OCLC resources. As we are also finalising the datasets we can make available under a permissive useful license, we are currently working of some random samples of catalogue data.

One issue worth highlighting at this stage is that of URI construction. URI's for records and other important entities described in a catalogue are a key component of linked data. We are taking a standards based approach to URI construction, trying to follow guidelines set out by the cabinet office for UK public sector (pdf link).

Our record URI string is quite simple:

http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705

the /id/entry/ denotes that the uri relates to an identifier for either a catalogue entry or entity described in our dataset. The following identifier string is a mixture of a string of characters for the dataset (which we may remove) and the catalogue records' identifier, already used in persistent URL's for our catalogue interface.

One issue we've not tackled is human readable unique identifiers for creators. The guild portion at the end constructed from a string of characters (say the 100$a in a Marc record) being stripped of punctuation (where errors tend to occur) and run through an MD5 checksum.


http://data.lib.cam.ac.uk/id/entity/cul_comet_pddl_0a72dd0c8fe090f78970db02b336900f


Human readable URI's would be nice, but some attempt at keeping this unique is probably better. If the Library of Congress were to follow suit on their excellent subject work and publish their name authority file as linked data, we could utilize and guids used there. Hopefully, we will be able to provide links to relevant VIAFF (Virtual International Authority File) entries for authors, where they can be matched by OCLC.

I'll follow this up shortly with a post about how we are ensuring the data behind a URI is easily referenced by both humans and machines.

Thursday, 19 May 2011

Small (but fiddly) win for URI's ...


Work on RDF conversion goes on. In addition to eventual complete dumps of data, we've also started putting together the pieces for our application to support RDF queries via SPARQL and HTTPD.


We are using the apache extension mod_rewrite to turn human readable uris like the below ...


http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705


into those easily parsed by the web application dishing out the record content:


http://data.lib.cam.ac.uk/record.php?uri=http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705&format=html



Its also considered best practice with linked data to provide dish up records in the format required by the requesting agent in their httpd request. This practice is referred to as 'cool uri's'. As an example, if I want to view 'http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705' in a browser, when the standard http request accepts content returned as 'text/html', then they should see html in their browser.











Conversely, if they want to see rdf+xml content, they make request it via a script or command line, e.g:



curl -H "Accept: application/rdf+xml" http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705


They should not have to add any kind of file extension (.rdf) to the request uri, although its also nice to support this.


We can handle this within the web application framework, which will involve monitoring requests and parsing incoming uri strings for file extensions, but that will add precious lines of code. Much easier to let the web server take over, which is where mod_rewite again comes in. It allows you to specify a set of rules that monitor for file extensions and accepted content types and rework uri's so a web application can dish out the required format


Sadly, we can't escape regular expressions, and mod_rewite relies heavily on them. Logging is also essential for de-bugging. Here is our htaccess ruleset, with each rule commented. We are still not supporting all the formats available for RDF distribution, but sticking to xml, json, baseline triples and turtle.

Mod_rewite or equivalent tools are a vital part of semantic web infrastructure, and whilst fiddly, a little knowledge can go a long way. Here are three great tutorials: