Slug: A Simple Semantic Web Crawler

Back in March I was tinkering with writing a Scutter. I’d never written a web crawler before, so was itching to give it a go as a side project. I decided to call it Slug because I was pretty sure it’d end up being a slow and probably icky; crafting a decent web crawler is an art in itself.

I got as far as putting together a basic framework that did the essential stuff: reading a scutter plan, fetching the documents using multi-threaded workers, etc. But I ended up getting sucked into a work project that ate up all my time so didn’t get much further with it.

Anyway, because the world is obviously sorely in need of another half-finished Scutter implementation, I’ve spent a few hours this evening tidying up some of the code so that it’s suitable for sharing.

If you’re just in interested in the code, then lets get the links out of the way first:

The code is published under a Creative Commons Attribution-ShareALike licence.

To run the code using the supplied batch file (sorry, don’t have access to a *nix box at the moment to add a shell script) do the following from the directory into which you unpack the zip:

slug -mem memory.rdf -workers 10 -plan sample-plan.rdf

This will kick off a scutter with 10 worker threads, as well as telling it where to find its memory and new scutter plan.

As Slug is basically a prototype it doesn’t do anything clever with what it finds. It simply GETs every URL from its RDF scutter plan, writes a copy of the original RDF file to filesystem, which it then parses with Jena to find any seeAlso’s. The new URLs it finds as a result are then added to its ongoing list of tasks. And so on ad infinitum: it’ll just keep on sliming its way across the semantic web until you kill it. You can merrily Ctrl-C the process as there’s a shutdown hook registered that’ll ensure the process tidies up after itself.

The reason it doesn’t add the triples directly to a triple store is because I wanted to be able to collect a chunk of RDF files locally for processing in different ways, e.g. to test out smushing algorithms, look for common authoring mistakes, etc. By default these files are stored in a slug-cache directory under your user home — but you can override that with the -cache parameter.

The one novel thing it does do (at least as far as I’m aware) is to use the ScutterVocab to record what it did when. This is what gets stored in the memory. Here’s an extract from the example included in the distribution:

<scutter:source rdf:resource=""/>
<scutter:origin rdf:resource=""/>
<scutter:origin rdf:resource=""/>
<scutter:origin rdf:resource=""/>
<scutter:lastModified>Mon, 05 Jul 2004 13:52:28 GMT</scutter:lastModified>
<scutter:origin rdf:resource=""/> <scutter:origin rdf:resource=""/> <scutter:fetch rdf:nodeID="A164"/> <scutter:latestFetch rdf:nodeID="A164"/> <scutter:origin rdf:resource=""/> <scutter:origin rdf:resource=""/> <scutter:origin rdf:resource=""/> <scutter:origin rdf:resource=""/> ... </scutter:Representation>

The source property indicates the source URL of the Representation, and the origin properties indicating references to it from elsewhere.

The Scutter stores the results of its its GET in a Fetch resource that includes details such as date of fetch, HTTP response codes, Last-Modified and ETag headers (Slug supports Conditional-GET behaviour), and the number of triples in the file. If Slug encountered an error then a Reason is recorded too — it’ll also avoid refetching that URL again. See the ScutterVocab page for more details.

Thats pretty much it. No fancy crawling strategires, no loop detection, no cleaver handling of HTML responses to look for referenced metadata, and no LiveJournal avoidance tactics. If you want to do something more clever with it though, then the framework is reasonably extensible:

For example if you want to put the triples directly into a triple store, then just add a new Consumer implementation. The DelegatingConsumerImpl I’m already using can create a simple pipeline for handling results of a GET.

Or if you want to add on a user interface then there are hooks for that too, look at the Controller and Monitor interfaces. There are methods there for monitoring how many threads are active, and dynamically adjusting the number of workers.

But if you’re just interested in analysing links between resources on the semantic web, getting estimates of numbers of triples, or analysing the RDF that’s out there to look for common authoring mistakes, etc then just collecting data in Slug’s memory and offline cache may be sufficient for your needs.

Anyway, if you do find this useful, or want help getting it up and running and/or integrated into your own applications then please feel free to get in touch.

At the moment I’m noodling with an alternate version which uses asynchronous messaging using JMS as the basic Scutter kernel. Matt Biddulph’s Crawling the Semantic Web paper mentions using asynchronous messaging to provider co-ordination between a Scutter and application interested in RDF data, so I may hack a crack at something in that vein.