Yay, my entry for the Discovery & DevCSI Developers Competition– Lodopac– was awarded a commendation for its use of the Cambridge University Library (CUL) dataset. During the judging I was asked for searches which were known to work well- the timeout issues I discussed under Limitations being not insignificant, especially with author or title searches. I submitted a version of the following brief general notes which I hope are helpful to anyone else who wants to play:
The British National Bibliography (BNB) server is generally more responsive than the Cambridge University Library one; title seems to work better than author. The following are hopefully useful examples useful:
- author=”fisher” seems to work (at least in BNB) so long as the number of hits to return is small (e.g. 5).
- title=”fisher” works well for both.
- ISBN=”0709039425″ should very quickly give you a single hit in both BNB and CUL.
I would really like to try and think of ways of improving free text regular expression search times for things like author and title in Sparql* although I doubt there is one that doesn’t rely on the configuration, processing power, or indexing of the server being searched.
* thinking aloud, some ideas might include: downloading a larger imprecise set for further local searching (e.g. for an author/title search downloading the title matches and searching the authors locally: although this would also be slow, it would get round the timeout at least); forcing a look-up in a controlled vocab first in order to get an exact string match (esp for authors, although even if this is possible, this forces a user to do more work, which isn’t the point); local indexing of the triple store (this is probably the best way but I’m not sure how to go about it, whether I really have the server capabilities to do it, and can be committed to the updating required).
Glad to see our service is being used, and I echo your arguments around SPARQL.
Improvements are needed, but at the backend sadly and with SPARQL generally. With our ARC2 internal based store, every regex query is layered via PHP to MYSQL (presumably as a sql regex) and passed back. Its a far cry from the world of free text Lucene and Sphinx indexes. Our ARC2 is a very sub-optimal datastore, but even better bigger versions such as the Talis service have the same drawbacks, hence Talis also offer a better indexing/ search solution alongside SPARQL.
SPARQL is great for getting ‘known item’ data from solid identifiers and tracing patterns and links through data. Its rubbish at returning free text results.
It was fascinating to use the CUL data, and especially to compare it with the BNB data. I think I was seduced by how powerful Sparql has the potential to be, but doing this exercise has taught me that it does have severe real world limitations. I did some tests on the Kasabi version of the BNB, expecting it to somehow be wizard quick but it was much the same too.
The most effective if unorthodox solution I have come across seems to be the bif:contains construct available on Virtuoso software, which made a similar script work on a previous version of the BNB far more quickly (sadly defunct since the BNB site went down). This is a variety of a two-stage approach: effectively doing a rough search first to get a manageable set; then searching that set to get the precise results required.