You can’t go far these days without tripping over commentary on Google’s strategy. I’ve not really paid this much attention, but it’s been interesting watching the launch of Google Scholar and reactions from the library communities because it directly intersects with my day job: managing the team that has built and is enhancing IngentaConnect my employers new scholarly content aggregation.
I thought it might be interesting to share some perspectives on working with Google and a couple of notes on Google Scholar itself.
My involvement with Google goes back about nine months after they contacted us to see if we wanted to collaborate with them on an initiative to add more scholarly content to the Google indexes. Of course we jumped at the chance, this is undeniably a Good Thing (both for us and the publishers we work with).
So the first step there was to help them ensure that the crawler could get to all the content. Our original site had a fairly crufty link syntax (too much reliance on query strings) so the first issue was tweaking their crawler to work around this. The new site is much cleaner as, like a lot of people, we’ve learnt a thing or two about REST recently.
The second issue was to ensure that the crawler got the full text so they could work their on the full content rather than just the titles and abstracts. A bit of sleight-of-hand at our end ensured that the crawler got what it needed but with the URLs in the Google index being a suitable entry point for an end user.
Like any search engine the Googlebot simply adds the URLs that it GETs to the index, so you have to think a bit about your URL structure and where you route the bot if its different to where you’d normally route a user. The crawler doesn’t seem to have any real notion of “preferred” URL for content: it investigates every link as used content based checksums to de-duplicate the data.
You can also make the bots life easier by providing it with a “sitemap” so it can quickly harvest all the content. So this is my first tip to site owners: publish an index of your site specifically for the Googlebot (and other crawlers) and you’ll be indexed much quicker. If you contact Google you should be able to get the index added to the crawl, useful if you don’t care to publish the sitemap to end users.
We turned all this work all round very quickly and it was then just a matter of sitting back and watching the Googlebot wash over us. Well, that and play with the Google frisbee that their marketing department sent me. Actually, that’s a lie. They must breed them differently in the Googleplex. Go outside? Run, like, around? Surely some mistake?!(*)
Early on in the discussions I’d checked with Anurag whether we could co-ordinate to ensure that the ‘bot came in at quiet times to avoid swamping the servers. But that’s not how it works, the Googlebot wanders where it might and can’t be trained to index particular sites at particular times. It is performance sensitive though so will back-off from a site if the response times start to increase.
So, another tip I’d share is to rap the ‘bot on the nose by throttling it (e.g. via Apache) so that it becomes a much friendlier beast to work with. You’re then in a better position to control when and how quickly the bot hits your site. Even with their built-in rate limiting you can get sudden peaks of load that could swamp a server.
With our content appearing in the Google indexes it’s been interesting to watch the referral traffic increase very nicely. Now that Google Scholar has launched the referrals from the new site similarly jump into life; they must already have attracted a large user base, which isn’t that surprising.
I also had a bit of fun with bookmarklets to help me highlight our content in the scholar indexes, check whats been indexed, etc. Note these are only certified for a real man’s browser
at the moment. There are more to come, to better tie in Google Scholar results both with our own site and others.
Surprisingly, Google seem to be being a bit cagey about who is in/out of the scholar indexes and their criteria for selection. I know we’re in, and I also know they were working with the CrossRef folk among others, so thats a fair percentage of scholarly publishers. I’ve also seen PubMed and other well-known sites cropping up repeatedly in test searches I’ve done on the site.
This highlights another mis-conception I’ve seen in some of the recent commentary: as far as I can tell Scholar is not yet making more of the invisible web visible, its mainly a subset of its existing index. I don’t see that they’ve created a custom crawler so I’m expecting data to appear in both the main index and Scholar. The latter just had some limited editorial input (domain selection from what I can see) and some extra processing required, e.g. citation extraction and analysis.
Based on hard-won experience I can predict a number of debates about Google Scholar that are still to come, but one that’s worth mentioning now is the old: structured metadata versus text indexing debate. In fact Danny is on this tack already.
For what it’s worth IngentaConnect has had Dublin Core metadata embedded in article pages since the first beta, with RDF to follow soon. This ought to help anyone interested in writing a scutter. Again more details to follow.
In fact, the embedded metadata, and a cleaner site design is already bearing fruit in the form of the rather del.icio.us (pun intended!) CiteULike. Richard Cameron is making a nice job of that site, and hopefully gadgets like mattb’s Python API will be appearing for it shortly. You can follow the development of that site in the CiteULike devblog.
(*) Actually Wayne Davey did use to work for us, but again, finance people are a different breed entirely :)