Friday, 20 May 2011
One issue worth highlighting at this stage is that of URI construction. URI's for records and other important entities described in a catalogue are a key component of linked data. We are taking a standards based approach to URI construction, trying to follow guidelines set out by the cabinet office for UK public sector (pdf link).
Our record URI string is quite simple:
the /id/entry/ denotes that the uri relates to an identifier for either a catalogue entry or entity described in our dataset. The following identifier string is a mixture of a string of characters for the dataset (which we may remove) and the catalogue records' identifier, already used in persistent URL's for our catalogue interface.
One issue we've not tackled is human readable unique identifiers for creators. The guild portion at the end constructed from a string of characters (say the 100$a in a Marc record) being stripped of punctuation (where errors tend to occur) and run through an MD5 checksum.
Human readable URI's would be nice, but some attempt at keeping this unique is probably better. If the Library of Congress were to follow suit on their excellent subject work and publish their name authority file as linked data, we could utilize and guids used there. Hopefully, we will be able to provide links to relevant VIAFF (Virtual International Authority File) entries for authors, where they can be matched by OCLC.
I'll follow this up shortly with a post about how we are ensuring the data behind a URI is easily referenced by both humans and machines.
Thursday, 19 May 2011
Work on RDF conversion goes on. In addition to eventual complete dumps of data, we've also started putting together the pieces for our application to support RDF queries via SPARQL and HTTPD.
We are using the apache extension mod_rewrite to turn human readable uris like the below ...
into those easily parsed by the web application dishing out the record content:
Its also considered best practice with linked data to provide dish up records in the format required by the requesting agent in their httpd request. This practice is referred to as 'cool uri's'. As an example, if I want to view 'http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705' in a browser, when the standard http request accepts content returned as 'text/html', then they should see html in their browser.
Conversely, if they want to see rdf+xml content, they make request it via a script or command line, e.g:
curl -H "Accept: application/rdf+xml" http://data.lib.cam.ac.uk/id/entry/cul_comet_pddl_4589705
They should not have to add any kind of file extension (.rdf) to the request uri, although its also nice to support this.
We can handle this within the web application framework, which will involve monitoring requests and parsing incoming uri strings for file extensions, but that will add precious lines of code. Much easier to let the web server take over, which is where mod_rewite again comes in. It allows you to specify a set of rules that monitor for file extensions and accepted content types and rework uri's so a web application can dish out the required format
Sadly, we can't escape regular expressions, and mod_rewite relies heavily on them. Logging is also essential for de-bugging. Here is our htaccess ruleset, with each rule commented. We are still not supporting all the formats available for RDF distribution, but sticking to xml, json, baseline triples and turtle.
Mod_rewite or equivalent tools are a vital part of semantic web infrastructure, and whilst fiddly, a little knowledge can go a long way. Here are three great tutorials: