COMET (Cambridge Open METadata) project blog: July 2011

Thursday, 28 July 2011

Final post

Its time for the final project post for COMET. Here we summarize major outputs over the past six months, cover what we've gained in terms of skills and the most significant lessons we've learnt. We will also take a look at what could be done to follow up on the project.

COMET was perhaps overly ambitious for a six month project, but we've made some firm progress in a number of areas relating to libraries and their open distribution of data.

Major outputs
- Document on data ownership analysis - A document describing the major sources of data in the Cambridge University Library catalogue, and making some comments about ownership and licensing around re-use.

- Workflow proposal and tool for record segmentation by vendor code, based on the above work. Suggests a methodology for sorting records when a vendor specifically requests a license other than PDDL.

- Marc21 to RDF triples conversion utility - A standalone tool designed to get
data out of the dead Marc21 format and into something better quickly. It features extensive CSV based customization, see the readme file for more details. Our digital metadata specialist Huw Jones was largely responsible for making this happen.

- data.lib.cam.ac.uk - Our first run at a library-centric open data service. It includes:

2.2 million linked RDF bibliographic records licensed under a PDDL with more to come
A SPARQL endpoint provides access to some of the above data
Supporting documentation, including an FAQ and SPARQL tutorial, aimed at a 'first rung on the ladder' for RDF and SPARQL novices.

- Application framework code for the above - a PHP / MYSQL application framework to store and deliver RDF data in a variety of formats, with a flexibile SPARQL endpoint. Also includes an experimental library of congress subject headings enrichment utility. A 'getting started' document covers installation and data loading. (For a flashier alternative, take a look at the Open Biblio suite.)

- An interesting sideline into the world of microdata and search engines

- Talks and presentations on Open Bibliographic Data at Birmingham and Manchester

Next steps
Publishing 'more open/linked data' would be useful, but data alone will not solve the challenge to improve resource discovery in the UK cultural sector. Here are some thoughts on what could come next. This is quite an eclectic list of ideas and musings on next steps that the Discovery programme could take, with some deeper focus around RDF:

1) Useable services for a wider audience
Open bibliographic data is one thing, but a certain level of skill and understanding is required to fully appreciate it, a criticism of the wider open data movement. To spread the word and enthuse a wider audience beyond 'data geeks', it would be great to see working services built around a framework of Open Data, or at least some impressive tech demos (Its worth mentioning Bibliographica here, which is already a great step in this direction ...)

2) RDF
If RDF is to continue in use as a mechanism for publishing open bibliographic data, its application needs further thought and development. Here are four suggestions:

2.1) Move beyond pure bibliography into holdings data.
In library systems and services, the real interactions that matter to library users are focused around library holdings. This data could potentially be published openly, and modeled in RDF. Links could be established to activity data to provide a framework for user driven discovery services.

2.2) 'Enliven' linked RDF data.
Like most open bib data, we've published a static dump of our catalogue at one time. It would be great to see pipes and processes in place to reflect changes and possibly track provenance. This is not as simple as it sounds, do we provide regular full updates or track incremental changes?

2.3) Better ways to get to RDF.
RDF data is valuable in its own right, but arguably needs easier access methods than SPARQL. Combining RDF data with better indexing and REST API technologies would be useful in widening its access and making it a more 'developer friendly' format. Thankfully, many RDF based tools offer this functionality, including the Talis platform. The Neo4J graph database technology also looks promising.

2.4) Recommendations for RDF vocabularies and linking targets for linked bibliographic data.
I think this needs to happen soonish, otherwise we will still be producing different attempts at the same thing over and over again. It does not need to be complete or final, but a useful set of starting places and guidelines for bibliographic RDF is required. The Discovery program is well placed to provide these recommendations to the UK sector. That would be a great start internationally. Then we can just get on with producing it and improving it :)

3) Cloud based platforms and services for publishing bibliographic data
COMET has shown that this is not yet as easy or cheap as it could be. With library systems teams and infrastructure often overstretched, taking on new publishing practices that do not have an obvious immediate in-house benefits is a hard sell.

To make it more palatable, better mechanisms for sharing are needed. The Extensible Catalog toolkit already provides a great set of tools for doing this with OAI-PMH. Imagine a similar but cloud based data distribution service whereby all a library has to do is (S)FTP a dump of its catalogue once a week. This is transformed on the fly into a variety of formats (RDF, XML, JSON etc.) for simple re-use, with licenses automatically applied depending on set criteria.

4) Microdata, Microformats and sitemaps
This is how Google and Bing want to index sites, and thus how web based data sharing and discovery largely happens outside of academia and libraries. The rest of the Internet gets by on these technologies, could they be applied to the aims of the Discovery programme? What are the challenges standing in the way, how do they compare to current approaches? We've made some first steps into this area by using schema.org microdata in a standard library catalogue interface.

Evidence of reuse
We were late in the day releasing our data, so re-use has so far been limited. We've been trying to consume it ourselves in development and our colleagues at OCLC and the British Library have provided useful feedback. We are glad to see it included in the recent developer competition. We've pledged to support our data outputs for a year, so will respond actively to any feedback from consumers over that time.

Skills
This project entailed a large amount of 'stepping up', not least on my part. Other than the odd Talis presentation, I had only a conceptual understanding of RDF. Now I've helped write tools to create it and worked with RDF stores and application frameworks. The time out to gain this skillset has been invaluable for me. The book, 'Programming for the Semantic Web' has saved my sanity on a number of occasions.

In terms of embedding this knowledge, our SPARQL workshop is designed to provide the first rung on the ladder for librarians and developers interested in RDF.

Despite this, we've suffered by doing everything in house, and the steep learning curve around RDF has meant that progress is not always as it could have been. Our current datastore is holding over 30 million triples, and we've still not been able to load all of our data output. This has hit the limits of ARC2/MYSQL and we will need a more robust back-end if we are to progress further.

Our RDF vocab choice is also a bit of a shot in the dark in places, and there are things with our data structure that could do with improvement.

If we are to continue to work with RDF data, we would like to bring in external assistance on scaling and development, as well as RDF vocab and modelling.

Most significant lessons
To finish, here are some random reflective thoughts ...

1) Don't aim for 100% accuracy in publishing data. In six months, with 2.2 million records that were written over 20 years in a variety of environments, this was never going to happen. I would hope that at least 80% of our data is fit for purpose. This is 80% more open data than we had six months ago.

2) Ask others. There are strong communities built around both open and linked data. Often, its the same people. They can be intimidating, but are useful. With hindsight, we should have asked for community guidance on issues of vocab and technology choice earlier.

3) Its an oldy, but please use Open Source and don't code from scratch if at all possible. The ARC2 framework may have its limitations with our scale of data, but it allowed a workable data site hosting 25k records to be assembled in two days. God bless Github.

Where exactly DOES a record come from?

Early on in the COMET project, Hugh Taylor assembled a complex document attempting to describe problems inherent in understanding the origin of a Marc-encoded bibliographic record. It also included a thorough analysis of Cambridge University Library data and the number and nature of vendor codes contained there-in.

We've updated this document with information on the various contracts and agreements associated with each vendor code to reflect the final work. The next problem was how to make sense of it all.

A major sticking point is related to Marc21 and its usage.

In our Marc data, we have four separate fields (015, 038, 994 ,035) that could indicate ownership, some of which may be repeated multiple times in a record. There is to my knowledge no mechanism in Marc21 or AACR2 to indicate which field and thus which vendor code takes precedence over others, (although cataloguers have some 'community knowledge' in this area).

Furthermore, many vendors change code and field used. Most rely on prefixes. Some are simply unhelpful strings of numbers.

In terms of practicalities, we need to ensure that:

1) Records from vendors who explicitly and contractually prohibit re-sharing in any format are excluded, this includes most ebook and ejournal records. Otherwise, there is no good reason not to share a record, although its origin may have an impact on license choice

2) In our case, records from OCLC are segmented due to a need to publish data from that vendor under an attribution license

3) Data from vendors who prefer non-marc output to be shared openly, but want Marc21 output restricted are segmented (RLUK and the BNB in our case) so these records will need to be split off

4) Data produced in house (usually that with no identifier) can be segmented for clarification

5) Everything else from smaller/ specialist record vendors is segmented together with a view to publishing openly

We've had to make some decisions over which field and vendor takes precedence based largely on this order of importance. To do this, we came up with a rough decision tree regarding record ownership:

The above JPG is also available as a scaled vector graphics file created in MS Visio.

One of my final tasks on COMET was to take this decision tree and turn it into a script to export record data for our final exercises in data publishing. I've also released a Perl script as output for COMET on out code page. (A warning / apology, this script is as ugly as the situation it attempts to resolve. It was pulled together at the last minute and could really do with a rethink.)

In the case of both the script and chart, the situation relates to Cambridge's specific and current situation., but should hopefully be useful for those wishing to replicate this activity.

As a personal opinion, I see this confusion regarding ownership as a key barrier that prevents libraries from openly sharing their data.

Furthermore, it is important that we do NOT see a repeat of this problem with the next set of record container and delivery standards.

Its my worry that stacking attribution statements in records at the bibliographic level could lead to similar problems down the road. Attribution at a data-set level, with some indication of the relationship between a record and a data-set seems more practical.

A standardization of practice across the library community with regards to licensing could help ease this pain in the future.

Because we always need more standards.

Wednesday, 27 July 2011

More data and status of future updates

I'm pleased to announce the next set of data release at data.lib.cam.ac.uk - 1,741,080 records of linked RDF distributed under a Public Domain Data License.

This data is from two major UK record suppliers, RLUK and the British Library BNB. Both have indicated to us that they have no problem with records being redistributed as RDF, but would rather we did not redistribute Marc21. See a blog post on licensing for explanations as to why.

This dataset is bulk download only for now - we hope to have it in our triple-store in the future

Although major development work on the COMET project has ended, we will continue to open-up as much of our catalogue as possible, publishing and updating datasets over the next year using the open source tools we've developed. In particular, we hope to carry data from OCLC under an Open Data Commons attribution license and enrich all datasets with FAST and VIAF links. We've been experimenting with these for the past month and results are promising.

In a follow-up post, I'll explain how we decided which data could be shared.

Friday, 22 July 2011

Friday update ...

A few things on a Friday.

Firstly, I've written a small piece on getthedata.org about data.lib.cam.ac.uk and the various mechanisms for querying and retrieving data, also mirrored on our FAQ. May be of interest to those in the Discovery developer competition.

Development work is all but done, so we've also published the application framework code behind data.lib.cam.ac.uk on the code page. This PHP based site provides a lightweight approach to RDF publishing and makes a great starting / exploration point for libraries wanting to publish data as a RDF.
More details and a read-me are available on the code page. As with all our output, its provided 'as-is' under a GPL, but we welcome feedback.

As the COMET deadline approaches next week, we are still working to release as much data as we can. Sadly, we are still waiting on final confirmation from some external bodies. As such we will work to publish and republish data using existing tools throughout the following year, as we can.

Tuesday, 19 July 2011

Cost benefits

The JISC has asked us to blog on cost benefits of providing open data. I'll give a rough indication costs based on time spent and an idea of what I think the benefits may be with an indication of how the two weigh up.

Costs:

1) Marc21 Data 'ownership' analysis - ( 5 days staff time at SP64)
Mapping and conversion of bibliographic information. An experimental and iterative process.

2) Marc21 to RDF data conversion - ( x2 developers at SP53)
Again, this has been drawn out through experimental work. Several methods and iterations were tried. Those aiming to repeat this may not incur the same cost.

3) Web infrastructure development and record curation- (x2 developers at SP53)
A lightweight approach to development was taken using existing application frameworks. Time was also spent understanding underlying principles of RDF stores and associated best practice for linked data. Several iterative loads of data were undertaken in parallel with Marc to RDF conversion.

4) Hosting and sustainability costs - costs tbc
COMET's web infrastructure makes use of existing VM and MYSQL infrastructure at CARET, so additional infrastructure costs were negligible and hard to determine. We've promised to keep the service running for a year.

5) Other stuff
Project management etc.

External benefits:

Substantial contribution to Open Bibliography - Open data is arguably a good thing, and whilst it has flaws, ours is hopefully useful enough to be useful to others in its own right

Clarification on licensing agreements with record vendors - Much headway has been made into this issue by the COMET project, with some clarification on licensing preferences for RDF data from three major UK record Vendors, OCLC, RLUK and the British Library. Down the line, we hope that these organizations will formalize their agreements with us so that others can benefit, which will hopefully help in publishing more data

Advice on how to analyise records to determine 'ownership' and lightweight (Perl, PHP, MYSQL based) tools to create and publish RDF linked data from Marc21

Experiments with FAST and VIAF - Two potentially useful data sources

In house benefits:

Community interaction - There is strong interest in Open Bibliography an its benefits. The University Library has also benefited greatly from its interaction with the open and linked data communities, in its work with OCLC and with others through the JISC Discovery program

In house skills - We've gained vital in-house understanding of the design and publication of RDF. We've developed basic training materials around SPARQL for non-developers, which could play off down the line

Summary:
External benefits clearly outweigh internal benefits, although as external benefits affect the whole library community, they also benefit us!

Whats' clear is that Open Data is not free data, at least not to us. We could have simply dumped our Marc21 or Dublin core XML and have been done with it, and for many that would have sufficed.

Instead, combining our wish to publish more Open Data with a need to learn about Linked Data (and thus lashing two fast bandwagons nicely together) has pushed the costs far higher.

However, by publishing linked data we've hopefully made our output more useful to a wider community than library metadata specialists, and in that sense added value.

More data being published means greater community feedback to draw upon, which should result in lower costs for those repeating this exercise.

It may indeed be several development cycles before we or others fully reap the benefits of this work. Alternatively, things could move in a different direction, with RDF based linked data falling by the wayside in favour of more accessible mechanisms for record sharing, in which case, our work could be useful in avoiding mistakes.

Wednesday, 13 July 2011

Project update and following in our footsteps

As the COMET project comes to a close, we are working through the final piece of ownership analysis to identify more data for RDF conversion and publication.

We've loaded sample records with FAST and VIAF and are in discussion with OCLC about the best way to model them.

In the interim, we've been asked to briefly blog about helping others to 'follow in our footsteps'. We ourselves were very much following the work done by the Open Bibliography project, even if we had a slightly different focus and toolset. There was a reason for this. One of the aims of COMET, at least in my mind was to see how easy it would be for an average library systems team to attempt the impressive work seen on projects such as Open Bibliography, work done by those who already had considerable experience of linked data and open licensing.

Here are few tips based on our experiences.

1) Be aware of your licensing. Whilst there is no good reason not to share data, some vendors have explictily prohibited it. We hope to have a better summary of our work examining out contracts up soon, but the main thing to look for is in explicit contractual agreements from vendors that prohibit re-sharing.

Otherwise, you then have to choose an appropriate license. We've ended up 'chunking' our data so that in the public domain stuff can can be PDDL will be. Otherwise, some form of attribution license would be required.

Thankfully, few other libraries should have as complex collections of data as Cambridge, with most relying on one or two vendors.

2) Think about the backend and issues of scaling before you start. We approached COMET with an exploratory hat on, the world of triplestores and SPARQL was new and we were not sure how much data we would be able to publish. The ARC2 datastore we eventual chose was great to develop with, but ultimately unable to adequately store our entire data output. For libraries with smaller datasets, (under half a million records or 16 million triples), its well worth a look. ( At least we are in good company with this, I've noticed that the DBpedia backend does not provide access to everything... )

3) Take a look at our tools. - We have an Perl MArc21 to RDF generation utility ready to go. We chose Perl as it is often used by systems librarians to 'munge' and export data. Our mapping is customisable, and the baseline triples it produces easy to load. We've based a lot of the final output on work done by the British Library in modelling the British National Bibliography.

4) RDF vocab modelling is itself something of a burden, you can give it a lot of thought and concern, try numerous different schemas and still not be sure as to the usefulness of your output. Our advice is focus on useful elements such as subject entries and identifiers. Be careful with the structure, too many links and nodes can lead to data that is 'linked to death'.

Don't expect to get it right first time.

Monday, 11 July 2011

And now for something completely different ...

Well, fairly different. Its not record licensing or RDF vocabs, but has relevance to the wider aims of the discovery programme. Prompted by a recent tweet, I'm blogging about search engine exposure in an academic library context.

Google, Yahoo and Bing have recently clubbed together to create schema.org, a set of data standards for websites to provider richer information for search engines, its a very lightweight 'semantic' approach. Many, including Eric Hellman at ALA have said the libraries should be on board with this, and that Search Engine Optimisation should be one of our aims. Some have taken issue with this approach, but his ideas seems a far cry from the traditional closed library OPAC.

For me at least, his timing was right. As well as working on COMET to produce open, linked library metadata I've been quietly experimenting with Search Engine Optimisation (SEO) on the side.

The main problem lies in crawling. Since Google dropped OAI-PMH support a few years back, they now only accept sitemaps, large XML files of persistent URLS which their robots can crawl over.

Many library catalogues do not even have persistent URL's, let alone sitemaps and many lack the means to develop their own.

Thanks however to some nifty 'unsupported' features in Aquabrowser, our current discovery service for library materials, I've been able to generate sitemaps for each record in the system, which have persistent URLs as standard. I understand that VuFind, an open source equivalent, can also do this.

With help from colleagues at Harvard and Chicago, I've also customised the full record display to work with two microformat standards, COINs (Z39.88) and schema.org microdata. This allowed me to get around some inadequacies in the catalogue page design, in allowing titles to be read over the page title.

Good news, right?

Yes and no. Since April, about 180,000 record page have been indexed by Google. Sounds promising, except we submitted sitemaps of over 5 million URLS. By the time they finish, we may have well moved from Aquabrowser to a new platform. We can of course throttle up the crawling by Google, but need to watch its effect on our live service, even with some fairly beefy servers.

A few thoughts:

This was not easy and a bit experimental, but undoubtedly a useful exercise and a nice comparison to our work on COMET. I can publish 1.3 million fairly rich records as RDF in a week, but no search engine right now would want to touch them. Outside of the linked and open data communities, few would take notice and it will probably not get extra people through my doors

Realistically however, search engine exposure will bring few extra people to Cambridge libraries, unless we can get record pages linked to in a useful manner and drive results rankings up to the front page. One rough search for an indexed record 'the cat sacase library' gets me Open Library and Worldcat as top hits, but no Cambridge :(

Schema.org is aimed at e-commerce and improving services like Google shopping. Metadata choices are limited to author/title, description, identifers, and availability. Seems fair enough, given that Google is an advertising company, but where does academic research or even basic library use actually fit in? Its designed to be extendible. Could an 'academic web crawler' make better use of the tags? What about the clever bods at Wolfram Alpha or True Knowledge? (They are also welcome to some RDF ...)

Few other libraries have even bothered with search engine exposure and optimisation, mainly due to problems with integrated library systems (Huddersfield and Lincoln being two known exceptions). Their reasons are practical, one rampant crawler could bring down both back and front office systems and few systems support permanent URLs. Sadly, this trend may not be reversing (Aquabrowser being an exception). Services like Summon, EBSCO discovery and Primo central are not search engine friendly, being large closed indexes themselves. Permanent URLs for records may not be a given. Summon even does away with a full page per record, ironically because people don't 'expect that from a search engine'...

Will schema.org really take off? I am getting the feeling that I've been here before. I remember being told in training sessions many years back to 'always tag my URL' and include meta tags in web page headers. As a young, budding Librarian, this sounded great. I was very disappointed to later learn that most engines ignored them, as they were a great way of breaking ranking systems. How will this 'system gaming' be avoided with schema.org and other microdata formats?

So to summarise, right now I can expose a little data to a lot of people and hope they see it amongst a lot of other data, or expose a lot of data to a little set of people, who just might do something great with it. Meanwhile, those that use our library will probably still know to use the catalogue.

You can try our Google customised search for http://search.lib.cam.ac.uk here. I'd be interested to see what people think.

Tuesday, 5 July 2011

Two more updates ...

After this weeks' launch of data.lib.cam.ac.uk, its good to follow up with some more updates.

First up is a SPARQL workshop, a three-part tutorial on RDF and the SPARQL query language aimed at I.T. workers with little to no knowledge of the semantic web, technically minded librarians, web designers and those with an interest in metadata and its (re)use.

One of the primary (and justifiable) criticisms of RDF is the high entry barrier. Much of the literature assumes a high level of technical and semantic web knowledge.

In an attempt to 'help others follow in our footsteps', I've tried to represent the learning done by myself using SPARQL to query our dataset. This may not actually lower the entry barrier, but will hopeful provide those with an interest in RDF with a base-line starting place.

Secondly, we are beginning to better link our Linked Data!

we've made some experimental gains in URI enrichment, supplementing our graphs for catalogue subject entries with links the the Library of Congress vocabularies.

See these examples:

http://data.lib.cam.ac.uk/id/entry/cambrdgedb_c1574b4e36a34f04bda61b3ea57b2379

http://data.lib.cam.ac.uk/id/entry/cambrdgedb_2ca5328ca9bebe20f37a7718d5e1f67b

http://data.lib.cam.ac.uk/id/entry/cambrdgedb_2883408d7b714bb6423d5c1ebcb40a48

As our labels are made up of a number of Library of Congress subject components, (subject,

geographic, chronological etc) we are taking the inital main entry and representing it with a 'skos:broader' vocab. We would love some feedback on this approach, which is little more than a starting point. As we are using http requests against the id.loc.gov service, we are also running into scaling issues with our 600,000 + subject entries.

Enrichment is being done directly in our RDF store, so for now this is not being reflected in our bulk data downloads for now.

Monday, 4 July 2011

data.lib.cam.ac.uk now live

Good news everyone! Our open data service, and the major output from the COMET project is now live. You can find it along with our first dataset of over a million records at data.lib.cam.ac.uk. We've been running in stealth mode for a while, but are happy to announce the live service today.

This dataset also part of the first Discovery developer competition. Its our first 'in-house' attempt at producing Linked Data and we welcome feedback on it. See our FAQ for more information.

As we wrap COMET up over the coming month, we will have additional outcomes and datasets available at data.lib.cam.ac.uk.

In related news, the Open Bibliography project blog has published its end of project post. Its an excellent read highlighting some great achievements and provides strong example-led arguments for the value of Open Bibliography.

Pages