COMET (Cambridge Open METadata) project blog: And now for something completely different ...

Well, fairly different. Its not record licensing or RDF vocabs, but has relevance to the wider aims of the discovery programme. Prompted by a recent tweet, I'm blogging about search engine exposure in an academic library context.

Google, Yahoo and Bing have recently clubbed together to create schema.org, a set of data standards for websites to provider richer information for search engines, its a very lightweight 'semantic' approach. Many, including Eric Hellman at ALA have said the libraries should be on board with this, and that Search Engine Optimisation should be one of our aims. Some have taken issue with this approach, but his ideas seems a far cry from the traditional closed library OPAC.

For me at least, his timing was right. As well as working on COMET to produce open, linked library metadata I've been quietly experimenting with Search Engine Optimisation (SEO) on the side.

The main problem lies in crawling. Since Google dropped OAI-PMH support a few years back, they now only accept sitemaps, large XML files of persistent URLS which their robots can crawl over.

Many library catalogues do not even have persistent URL's, let alone sitemaps and many lack the means to develop their own.

Thanks however to some nifty 'unsupported' features in Aquabrowser, our current discovery service for library materials, I've been able to generate sitemaps for each record in the system, which have persistent URLs as standard. I understand that VuFind, an open source equivalent, can also do this.

With help from colleagues at Harvard and Chicago, I've also customised the full record display to work with two microformat standards, COINs (Z39.88) and schema.org microdata. This allowed me to get around some inadequacies in the catalogue page design, in allowing titles to be read over the page title.

Good news, right?

Yes and no. Since April, about 180,000 record page have been indexed by Google. Sounds promising, except we submitted sitemaps of over 5 million URLS. By the time they finish, we may have well moved from Aquabrowser to a new platform. We can of course throttle up the crawling by Google, but need to watch its effect on our live service, even with some fairly beefy servers.

A few thoughts:

This was not easy and a bit experimental, but undoubtedly a useful exercise and a nice comparison to our work on COMET. I can publish 1.3 million fairly rich records as RDF in a week, but no search engine right now would want to touch them. Outside of the linked and open data communities, few would take notice and it will probably not get extra people through my doors

Realistically however, search engine exposure will bring few extra people to Cambridge libraries, unless we can get record pages linked to in a useful manner and drive results rankings up to the front page. One rough search for an indexed record 'the cat sacase library' gets me Open Library and Worldcat as top hits, but no Cambridge :(

Schema.org is aimed at e-commerce and improving services like Google shopping. Metadata choices are limited to author/title, description, identifers, and availability. Seems fair enough, given that Google is an advertising company, but where does academic research or even basic library use actually fit in? Its designed to be extendible. Could an 'academic web crawler' make better use of the tags? What about the clever bods at Wolfram Alpha or True Knowledge? (They are also welcome to some RDF ...)

Few other libraries have even bothered with search engine exposure and optimisation, mainly due to problems with integrated library systems (Huddersfield and Lincoln being two known exceptions). Their reasons are practical, one rampant crawler could bring down both back and front office systems and few systems support permanent URLs. Sadly, this trend may not be reversing (Aquabrowser being an exception). Services like Summon, EBSCO discovery and Primo central are not search engine friendly, being large closed indexes themselves. Permanent URLs for records may not be a given. Summon even does away with a full page per record, ironically because people don't 'expect that from a search engine'...

Will schema.org really take off? I am getting the feeling that I've been here before. I remember being told in training sessions many years back to 'always tag my URL' and include meta tags in web page headers. As a young, budding Librarian, this sounded great. I was very disappointed to later learn that most engines ignored them, as they were a great way of breaking ranking systems. How will this 'system gaming' be avoided with schema.org and other microdata formats?

So to summarise, right now I can expose a little data to a lot of people and hope they see it amongst a lot of other data, or expose a lot of data to a little set of people, who just might do something great with it. Meanwhile, those that use our library will probably still know to use the catalogue.

You can try our Google customised search for http://search.lib.cam.ac.uk here. I'd be interested to see what people think.