COMET (Cambridge Open METadata) project blog

Friday, 22 July 2011

Friday update ...

A few things on a Friday.

Firstly, I've written a small piece on getthedata.org about data.lib.cam.ac.uk and the various mechanisms for querying and retrieving data, also mirrored on our FAQ. May be of interest to those in the Discovery developer competition.

Development work is all but done, so we've also published the application framework code behind data.lib.cam.ac.uk on the code page. This PHP based site provides a lightweight approach to RDF publishing and makes a great starting / exploration point for libraries wanting to publish data as a RDF.
More details and a read-me are available on the code page. As with all our output, its provided 'as-is' under a GPL, but we welcome feedback.

As the COMET deadline approaches next week, we are still working to release as much data as we can. Sadly, we are still waiting on final confirmation from some external bodies. As such we will work to publish and republish data using existing tools throughout the following year, as we can.

Tuesday, 19 July 2011

Cost benefits

The JISC has asked us to blog on cost benefits of providing open data. I'll give a rough indication costs based on time spent and an idea of what I think the benefits may be with an indication of how the two weigh up.

Costs:

1) Marc21 Data 'ownership' analysis - ( 5 days staff time at SP64)
Mapping and conversion of bibliographic information. An experimental and iterative process.

2) Marc21 to RDF data conversion - ( x2 developers at SP53)
Again, this has been drawn out through experimental work. Several methods and iterations were tried. Those aiming to repeat this may not incur the same cost.

3) Web infrastructure development and record curation- (x2 developers at SP53)
A lightweight approach to development was taken using existing application frameworks. Time was also spent understanding underlying principles of RDF stores and associated best practice for linked data. Several iterative loads of data were undertaken in parallel with Marc to RDF conversion.

4) Hosting and sustainability costs - costs tbc
COMET's web infrastructure makes use of existing VM and MYSQL infrastructure at CARET, so additional infrastructure costs were negligible and hard to determine. We've promised to keep the service running for a year.

5) Other stuff
Project management etc.

External benefits:

Substantial contribution to Open Bibliography - Open data is arguably a good thing, and whilst it has flaws, ours is hopefully useful enough to be useful to others in its own right

Clarification on licensing agreements with record vendors - Much headway has been made into this issue by the COMET project, with some clarification on licensing preferences for RDF data from three major UK record Vendors, OCLC, RLUK and the British Library. Down the line, we hope that these organizations will formalize their agreements with us so that others can benefit, which will hopefully help in publishing more data

Advice on how to analyise records to determine 'ownership' and lightweight (Perl, PHP, MYSQL based) tools to create and publish RDF linked data from Marc21

Experiments with FAST and VIAF - Two potentially useful data sources

In house benefits:

Community interaction - There is strong interest in Open Bibliography an its benefits. The University Library has also benefited greatly from its interaction with the open and linked data communities, in its work with OCLC and with others through the JISC Discovery program

In house skills - We've gained vital in-house understanding of the design and publication of RDF. We've developed basic training materials around SPARQL for non-developers, which could play off down the line

Summary:
External benefits clearly outweigh internal benefits, although as external benefits affect the whole library community, they also benefit us!

Whats' clear is that Open Data is not free data, at least not to us. We could have simply dumped our Marc21 or Dublin core XML and have been done with it, and for many that would have sufficed.

Instead, combining our wish to publish more Open Data with a need to learn about Linked Data (and thus lashing two fast bandwagons nicely together) has pushed the costs far higher.

However, by publishing linked data we've hopefully made our output more useful to a wider community than library metadata specialists, and in that sense added value.

More data being published means greater community feedback to draw upon, which should result in lower costs for those repeating this exercise.

It may indeed be several development cycles before we or others fully reap the benefits of this work. Alternatively, things could move in a different direction, with RDF based linked data falling by the wayside in favour of more accessible mechanisms for record sharing, in which case, our work could be useful in avoiding mistakes.

Wednesday, 13 July 2011

Project update and following in our footsteps

As the COMET project comes to a close, we are working through the final piece of ownership analysis to identify more data for RDF conversion and publication.

We've loaded sample records with FAST and VIAF and are in discussion with OCLC about the best way to model them.

In the interim, we've been asked to briefly blog about helping others to 'follow in our footsteps'. We ourselves were very much following the work done by the Open Bibliography project, even if we had a slightly different focus and toolset. There was a reason for this. One of the aims of COMET, at least in my mind was to see how easy it would be for an average library systems team to attempt the impressive work seen on projects such as Open Bibliography, work done by those who already had considerable experience of linked data and open licensing.

Here are few tips based on our experiences.

1) Be aware of your licensing. Whilst there is no good reason not to share data, some vendors have explictily prohibited it. We hope to have a better summary of our work examining out contracts up soon, but the main thing to look for is in explicit contractual agreements from vendors that prohibit re-sharing.

Otherwise, you then have to choose an appropriate license. We've ended up 'chunking' our data so that in the public domain stuff can can be PDDL will be. Otherwise, some form of attribution license would be required.

Thankfully, few other libraries should have as complex collections of data as Cambridge, with most relying on one or two vendors.

2) Think about the backend and issues of scaling before you start. We approached COMET with an exploratory hat on, the world of triplestores and SPARQL was new and we were not sure how much data we would be able to publish. The ARC2 datastore we eventual chose was great to develop with, but ultimately unable to adequately store our entire data output. For libraries with smaller datasets, (under half a million records or 16 million triples), its well worth a look. ( At least we are in good company with this, I've noticed that the DBpedia backend does not provide access to everything... )

3) Take a look at our tools. - We have an Perl MArc21 to RDF generation utility ready to go. We chose Perl as it is often used by systems librarians to 'munge' and export data. Our mapping is customisable, and the baseline triples it produces easy to load. We've based a lot of the final output on work done by the British Library in modelling the British National Bibliography.

4) RDF vocab modelling is itself something of a burden, you can give it a lot of thought and concern, try numerous different schemas and still not be sure as to the usefulness of your output. Our advice is focus on useful elements such as subject entries and identifiers. Be careful with the structure, too many links and nodes can lead to data that is 'linked to death'.

Don't expect to get it right first time.

Monday, 11 July 2011

And now for something completely different ...

Well, fairly different. Its not record licensing or RDF vocabs, but has relevance to the wider aims of the discovery programme. Prompted by a recent tweet, I'm blogging about search engine exposure in an academic library context.

Google, Yahoo and Bing have recently clubbed together to create schema.org, a set of data standards for websites to provider richer information for search engines, its a very lightweight 'semantic' approach. Many, including Eric Hellman at ALA have said the libraries should be on board with this, and that Search Engine Optimisation should be one of our aims. Some have taken issue with this approach, but his ideas seems a far cry from the traditional closed library OPAC.

For me at least, his timing was right. As well as working on COMET to produce open, linked library metadata I've been quietly experimenting with Search Engine Optimisation (SEO) on the side.

The main problem lies in crawling. Since Google dropped OAI-PMH support a few years back, they now only accept sitemaps, large XML files of persistent URLS which their robots can crawl over.

Many library catalogues do not even have persistent URL's, let alone sitemaps and many lack the means to develop their own.

Thanks however to some nifty 'unsupported' features in Aquabrowser, our current discovery service for library materials, I've been able to generate sitemaps for each record in the system, which have persistent URLs as standard. I understand that VuFind, an open source equivalent, can also do this.

With help from colleagues at Harvard and Chicago, I've also customised the full record display to work with two microformat standards, COINs (Z39.88) and schema.org microdata. This allowed me to get around some inadequacies in the catalogue page design, in allowing titles to be read over the page title.

Good news, right?

Yes and no. Since April, about 180,000 record page have been indexed by Google. Sounds promising, except we submitted sitemaps of over 5 million URLS. By the time they finish, we may have well moved from Aquabrowser to a new platform. We can of course throttle up the crawling by Google, but need to watch its effect on our live service, even with some fairly beefy servers.

A few thoughts:

This was not easy and a bit experimental, but undoubtedly a useful exercise and a nice comparison to our work on COMET. I can publish 1.3 million fairly rich records as RDF in a week, but no search engine right now would want to touch them. Outside of the linked and open data communities, few would take notice and it will probably not get extra people through my doors

Realistically however, search engine exposure will bring few extra people to Cambridge libraries, unless we can get record pages linked to in a useful manner and drive results rankings up to the front page. One rough search for an indexed record 'the cat sacase library' gets me Open Library and Worldcat as top hits, but no Cambridge :(

Schema.org is aimed at e-commerce and improving services like Google shopping. Metadata choices are limited to author/title, description, identifers, and availability. Seems fair enough, given that Google is an advertising company, but where does academic research or even basic library use actually fit in? Its designed to be extendible. Could an 'academic web crawler' make better use of the tags? What about the clever bods at Wolfram Alpha or True Knowledge? (They are also welcome to some RDF ...)

Few other libraries have even bothered with search engine exposure and optimisation, mainly due to problems with integrated library systems (Huddersfield and Lincoln being two known exceptions). Their reasons are practical, one rampant crawler could bring down both back and front office systems and few systems support permanent URLs. Sadly, this trend may not be reversing (Aquabrowser being an exception). Services like Summon, EBSCO discovery and Primo central are not search engine friendly, being large closed indexes themselves. Permanent URLs for records may not be a given. Summon even does away with a full page per record, ironically because people don't 'expect that from a search engine'...

Will schema.org really take off? I am getting the feeling that I've been here before. I remember being told in training sessions many years back to 'always tag my URL' and include meta tags in web page headers. As a young, budding Librarian, this sounded great. I was very disappointed to later learn that most engines ignored them, as they were a great way of breaking ranking systems. How will this 'system gaming' be avoided with schema.org and other microdata formats?

So to summarise, right now I can expose a little data to a lot of people and hope they see it amongst a lot of other data, or expose a lot of data to a little set of people, who just might do something great with it. Meanwhile, those that use our library will probably still know to use the catalogue.

You can try our Google customised search for http://search.lib.cam.ac.uk here. I'd be interested to see what people think.

Tuesday, 5 July 2011

Two more updates ...

After this weeks' launch of data.lib.cam.ac.uk, its good to follow up with some more updates.

First up is a SPARQL workshop, a three-part tutorial on RDF and the SPARQL query language aimed at I.T. workers with little to no knowledge of the semantic web, technically minded librarians, web designers and those with an interest in metadata and its (re)use.

One of the primary (and justifiable) criticisms of RDF is the high entry barrier. Much of the literature assumes a high level of technical and semantic web knowledge.

In an attempt to 'help others follow in our footsteps', I've tried to represent the learning done by myself using SPARQL to query our dataset. This may not actually lower the entry barrier, but will hopeful provide those with an interest in RDF with a base-line starting place.

Secondly, we are beginning to better link our Linked Data!

we've made some experimental gains in URI enrichment, supplementing our graphs for catalogue subject entries with links the the Library of Congress vocabularies.

See these examples:

http://data.lib.cam.ac.uk/id/entry/cambrdgedb_c1574b4e36a34f04bda61b3ea57b2379

http://data.lib.cam.ac.uk/id/entry/cambrdgedb_2ca5328ca9bebe20f37a7718d5e1f67b

http://data.lib.cam.ac.uk/id/entry/cambrdgedb_2883408d7b714bb6423d5c1ebcb40a48

As our labels are made up of a number of Library of Congress subject components, (subject,

geographic, chronological etc) we are taking the inital main entry and representing it with a 'skos:broader' vocab. We would love some feedback on this approach, which is little more than a starting point. As we are using http requests against the id.loc.gov service, we are also running into scaling issues with our 600,000 + subject entries.

Enrichment is being done directly in our RDF store, so for now this is not being reflected in our bulk data downloads for now.

Monday, 4 July 2011

data.lib.cam.ac.uk now live

Good news everyone! Our open data service, and the major output from the COMET project is now live. You can find it along with our first dataset of over a million records at data.lib.cam.ac.uk. We've been running in stealth mode for a while, but are happy to announce the live service today.

This dataset also part of the first Discovery developer competition. Its our first 'in-house' attempt at producing Linked Data and we welcome feedback on it. See our FAQ for more information.

As we wrap COMET up over the coming month, we will have additional outcomes and datasets available at data.lib.cam.ac.uk.

In related news, the Open Bibliography project blog has published its end of project post. Its an excellent read highlighting some great achievements and provides strong example-led arguments for the value of Open Bibliography.

Monday, 20 June 2011

On licensing ...

Background

Licensing of bibliographic metadata is far too complex a subject.
One of the major aims of COMET has been to see how easy it is to identify records from major record vendors in the UK HE environment and address issues and concerns around data reuse. This work is still on-going, but its high time we got a post out on the subject, explaining where things are at.

Like most university libraries, Cambridge University Library relies heavily on external record vendors to meet its cataloging needs and keep up to speed with a high intake of material. Much of this data has its potential reuse and republication covered by an explicit contractual agreement. At the same time, we understand and support the need to produce Open Data as a platform for a better set of services for Higher Education.

State of play

Through the Comet Project we have been investigating our data for traces of 'ownership' and have been examining contracts. We've contacted the major record providers and some have indicated a preference for certain types of licenses in data re-publishing.
As an example, the British Library have published the British National bibliography as RDF formatted data under a PDDL and are happy for others holding BNB data in their catalogues to do the same,(although there is not yet any formal announcement to this effect!).

OCLC, perhaps the biggest record supplier have recently expressed a preference for ODC-By attribution licensing. We are one of a number of libraries working with OCLC to investigate the practicalities around this.

We in turn produce a substantial amount of data in-house, and would still like to publish this under a Public Domain Data License. Identifying this data was actually more difficult than it should be, we ourselves insert no 'made in Cambridge' label on our records, so we had to identify this set via a process of elimination.

Given this disparity between approaches to licensing, we will be aiming to produce several different datasets under established Open Data Commons licenses.

In terms of URI structure and vocab choice, they will be identical, but each whole set will be represented by a separate graph in our RDF datastore itself linked to the appropriate license information. For data produced under anything other than a PDDL, license information will also be made explicitly obvious to those downloading in bulk.

A final solution?

This area is still in flux. We feel that although licenses may vary, there should be no barrier to publishing data for others to reuse. We hope that over time, the library community will work to a set of established practices and community norms over data publishing. This work represents one of the first steps taken in this area.

Public Domain Data Licensing is an obvious ideal and one which we prefer, but adopting a pragmatic approach now can get more useful data out in the wild quickly. Whilst stepping back from PDDL or CC0 is next to impossible, adopting a slightly less open standard as an initial position which can be rethought downstream may be more palatable. Just steer clear of non-commercial licenses for data!

Marc21 - another reason for deviation

Whilst there is strong interest in and backing for Open Bibliographic Data within the international HE Library community, there have been concerns raised about its impact on organizations that rely on commercial Marc21 record supply to maintain and develop services.
We recognize that partner institutions have valid commercial interests in this and benefit ourselves from such services. As such, we are only releasing Marc21 that we can claim total ownership of. Other data is being released as RDF only. We believe our RDF output is sufficiently altered to make cross-walking it back to useful Marc21 next to impossible.

This may not be an approach suited to everyones' tastes, but it is pragmatic. To put this in perspective, how many open data consumers really care about Marc21? Its a format that really deserves to die and is irrelevant to the wider conversation.

Some of this post has been distilled down into a forthcoming F.A.Q for data.lib.cam.acuk.

Pages