Lunchtime Lecture: “How you (yes, you) can contribute to open data”

The following is a written version of the lunchtime lecture I gave today at the Open Data Institute. I’ll put in a link to the video when it comes online. It’s not a transcript, I’m just writing down what I had planned to say.

Hello!

I’m going to talk today about some of the projects that first got me excited about data on the web and open data specifically. I’m hopefully going to get you excited about them too. And show some ways in which you can individually get involved in creating some open data.

Open data is not (just) open government data

I’ve been reflecting recently about the shape of the open data community and ecosystem, to try and understand common issues and areas for useful work.

For example, we spend a lot of time focusing on Open Government Data. And so we talk about how open data can drive economic growth, create transparency, and be used to help tackle social issues.

But open data isn’t just government data. It’s a broader church that includes many different communities and organisations who are publishing and using open data for different purposes.

Open data is not (just) organisational data

More recently, as a community, we’ve focused some of our activism on encouraging commercial organisations to not just use open data (which many have been doing for years), but also to publish open data.

And so we talk about how open data can be supported by different business models and the need for organisational change to create more open cultures. And we collect evidence of impact to encourage more organisations to also become more open.

But open data isn’t just about data from organisations. Open data can be created and published by individuals and communities for their own needs and purposes.

Open data can (also) be a creative activity

Open data can also be a creative activity. A means for communities to collaborate around sharing what they know about a topic that is important or meaningful to them. Simply because they want to do it. I think sometimes we overlook these projects in the drive to encourage governments and other organisations to publish open data.

So I’m going to talk through eight (you said six in the talk, idiot! – Ed) different example projects. Some you will have definitely heard about before, but I suspect there will be a few that you haven’t. In most cases the primary goals of these projects are to create an openly licensed dataset. So when you contribute to the project, you’re directly helping to create more open data.

Of course, there are other ways in which we each contribute to open data. But these are often indirect contributions. For example where our personal data that is held in various services is aggregated, anonymised and openly published. But today I want to focus today on more direct contributions.

For each of the examples I’ve collected a few figures that indicate the date the project started, the number of contributors, and an indication of the size of the dataset. Hopefully this will help paint a picture of the level of effort that is already going into maintaining these resources. (Psst, see the slides for the figures – Ed)

Wikipedia

The first example is Wikipedia. Everyone knows that anyone can edit Wikipedia. But you might not be aware that Wikipedia can be turned into structured data and used in applications. There’s lots of projects that do it. E.g. dbpedia which brings Wikipedia into the web of data.

The bit that’s turned into structured data are the “infoboxes” that give you the facts and figures about the person (for example) that you’re reading about. So if you add to Wikipedia and specifically add to the infoboxes, then you’re adding to an openly licensed dataset.

The most obvious example of where this data is used is in Google search results. The infoboxes you seen on search results whenever you google for a person, place or thing is partly powered by Wikipedia data.

A few years ago I added a wikipedia page for Gordon Boshell, the author of some children’s books I loved as a kid. There wasn’t a great deal of information about him on the web, so I pulled whatever I could find together and created a page for him. Now when anyone searches for Gordon Boshell they can see some information about him right on Google. And they now link out to the books that he wrote. It’s nice to think that I’ve helped raise his profile.

There’s also a related project from the Wikimedia Foundation called Wikidata. Again, anyone can edit it, but its a database of facts and figures rather than an encyclopedia.

OpenStreetMap

The second example is OpenStreetMap. You’ll definitely have already heard about its goal to create a crowd-sourced map of the world. OpenStreetMap is fascinating because its grown this incredible ecosystem of tools and projects that make it easier to contribute to the database.

I’ve recently been getting involved with contributing to OpenStreetMap. My initial impression was that I was probably going to have to get a commercial GPS and go out and do complicated surveying. But its not like that at all. It’s really easy to add points to the map, and to use their tools to trace buildings from satellite imagery. They provide create tutorials to help you get started.

It’s surprisingly therapeutic. I’ve spent a few evenings drinking a couple of beers and tracing buildings. It’s a bit like an adult colouring book, except you’re creating a better map of the world. Neat!

There are a variety of other tools that let you contribute to OpenStreetMap. For example Wheelmap allows you to add wheelchair accessibility ratings to locations on the map. We’ve been using this in the AccessibleBath project to help crowd-source data about wheelchair accessibility in Bath. One afternoon we got a group of around 25 volunteers together for a couple of hours and mapped 30% of the city centre.

There’s a lot of humanitarian mapping that happens using OpenStreetMap. If there’s been a disaster or a disease outbreak then aid workers often need better maps to help reach the local population and target their efforts. Missing Maps lets you take part in that. They have a really nice workflow that lets you contribute towards improving the map by tracing satellite imagery.

There’s a related project called MapSwipe. Its a mobile application that presents you with a grid of satellite images. All you have to do is click the titles which contain a building and then swipe left. Behind the scenes this data is used to direct Missing Maps volunteers towards the areas where more detailed mapping would be most useful. This focuses contributors attention where its best needed and so is really respectful of people’s time.

MapSwipe can also be used offline. So you can download a work package to do when you’re on your daily commute. Easy!

Zooniverse

You’ve probably also heard of Zooniverse, which is my third example. It’s a platform for citizen science projects. That just means using crowd-sourcing to create scientific datasets.

Their most famous project is probably GalaxyZoo which asked people to help classify objects in astronomical imagery. But there are many other projects. If you’re interested in biology then perhaps you’d like to help catalogue specimens held in the archives of the Natural History Museum?

Or there’s Old Weather, which I might get involved with. In that project you can help to build a picture of our historical climate by transcribing the weather reports that whaling ship captains wrote in their logs. By collecting that information we can build a dataset that tells us more about our climate.

I think its a really innovative way to use historical documents.

MusicBrainz

This is my fourth and favourite example. MusicBrainz is a database of music metadata: information about artists, albums, and tracks. It was created in direct response to commercial music databases that were asking people to contribute to their dataset, but then were taking all of the profits and not returning any value to the community. MusicBrainz created a free, open alternative.

I think MusicBrainz is the first open dataset I first got involved with. I wrote a client library to help developers use the data. (14 years ago, and you’re still talking about it – Ed)

MusicBrainz has also grown a commercial ecosystem around it, which has helped it be sustainable. There are a number of projects that use the dataset, including Spotify. And its been powering the BBC Music website for about ten years.

Discogs

My fifth example, Discogs is also a music dataset. But its a dataset about vinyl releases. So it focuses more on the releases, labels, etc. Discogs is a little different because it started as, and still is a commercial service. At its core its a marketplace for record collectors. But to power that marketplace you need a dataset of vinyl releases. So they created tools to help the community build it. And, over time, its become progressively more open.

Today all of the data is in the public domain.

OpenPlaques

My sixth example is OpenPlaques. It’s a database of the commemorative plaques that you can see dotted around on buildings and streets. The plaques mark that an important event happened in that building, or that someone famous was born or lived there. Volunteers take photos of the plaques and share them with the service, along with the text and names of anyone who might be mentioned in the plaque.

It provides a really interesting way to explore the historical information in the context of cities and buildings. All of the information is linked to Wikipedia so you can find out more information.

Rebrickable

My seventh example is Rebrickable which you’re unlikely to have heard about. I’m cheating a little here as its a service and not strictly a dataset. But its Lego, so I had to include it.

Rebrickable has a big database of all the official lego sets and what parts they contain. If you’re a fan of lego (they’re called AFOLS – Ed) design and create your own custom lego models (they’re known as MOCS – Ed) then you can upload the design and instructions to the service in machine-readable LEGO CAD formats.

Rebrickable exposes all of the information via an API under a liberal licence. So people can build useful tools. For example using the service you can find out which other official and custom sets you can build with bricks you already own.

Grand Comics Database

My eighth and final example is the Grand Comics Database. It’s also the oldest project as it was started in 1994. The original creators started with desktop tools before bringing it to the web.

It’s a big database of 1.3m comics. It contains everything from The Dandy and The Beano through to Marvel and DC releases. Its not just data on the comics, but also story arcs, artists, authors, etc. If you love comics you’ll love GCD. I checked and this weeks 2000AD (published 2 days ago – Ed) is in there already.

So those are my examples of places where you could contribute to open data.

Open data is an enabler

The interesting thing about them all is that open data is an enabler. Open data isn’t creating economic growth, or being used as a business model. Open licensing is being applied as a tool.

It creates a level playing field that means that everyone who contributes has an equal stake in the results. If you and I both contribute then we can both use the end result for any purpose. A commercial organisation is not extracting that value from us.

Open licensing can help to encourage people to share what they know, which is the reason the web exists.

Working with data

The projects are also great examples of ways of working with data on the web. They’re all highly distributed projects, accepting submissions from people internationally who will have very different skill sets and experience. That creates a challenge that can only be dealt with by having good collaboration tools and by having really strong community engagement.

Understanding the reasons how and why people collaborate to your open database is important. Because often those reasons will change over time. When OpenStreetMap had just started, contributors had the thrill of filling in a blank map with data about their local area. But now contributions are different. It’s more about maintaining data and adding depth.

Collaborative maintenance

In the open data community we often talk about making things open to make them better. It’s the tenth GDS design principle. And making data open does make them better in the sense that more people can use it. And perhaps more eyes can help spot flaws.

But if you really want to let people help make something better, then you need to put your data into a collaborative environment. Then data can get better at the pace of the community and not your ability to accept feedback.

It’s not work if you love it

Hopefully the examples give you an indication of the size of these communities and how much has been created. It struck me that many of them have been around since the early 2000s. I’ve not really found any good recent examples (Maybe people can suggest some – Ed). I wonder what that is?

Most of the examples were born around the Web 2.0 era (Mate. That phrase dates you. – Ed) when we were all excitedly contributing different types of content to different services. Bookmarks and photos and playlists. But now we mostly share things on social media. It feels like we’ve lost something. So it’s worth revisiting these services to see that they still exist and that we can still contribute.

While these fan communities are quietly hard at work, maybe we in the open data community can do more to support them?

There’s a lot of examples of “open” datasets that I didn’t use because they’re not actually open. The licenses are restrictive. Or the community has decided not to think about it. Perhaps we can help them understand why being a bit more open might be better?

There are also examples of openly licensed content that could be turned into more data. Take Wikia for example. It contains 360,000 wikis all with openly licensed content. They get 190m views a month and the system contains 43 million pages. About the same size as the English version of Wikipedia is currently. They’re all full of infoboxes that are crying out to be turned into structured data.

I think it’d be great to have all this fan produced data to a proper part of the open data commons, sitting alongside the government and organisational datasets that are being published.

Thank you (yes, you!)

That’s the end of my talk. I hope I’ve piqued your interest in looking at one or more of these projects in more detail. Hopefully there’s a project that will help you express your inner data geek.

Photo Attributions

Lego SpacemanEdwin AndradeJamie Street, Olu Elet, Aaron Burden, Volkan OlmezAlvaro SerranoRawPixel.com, Jordan WhitfieldAnthony DELANOIX

 

“Open”

For the purposes of having something to point to in future, here’s a list of different meanings of “open” that I’ve encountered.

XYZ is “open” because:

  • It’s on the web
  • It’s free to use
  • It’s published under an open licence
  • It’s published under a custom licence, which limits some types of use (usually commercial, often everything except personal)
  • It’s published under an open licence, but we’ve not checked too deeply in whether we can do that
  • It’s free to use, so long as you do so within our app or application
  • There’s a restricted/limited access free version
  • There’s documentation on how it works
  • It was (or is) being made in public, with equal participation by anyone
  • It was (or is) being made in public, lead by a consortium or group that has limitation on membership (even if just fees)
  • It was (or is) being made privately, but the results are then being made available publicly for you to use

I gather that at IODC “open washing” was a frequently referenced topic. It’s not surprising given the variety of ways in which the word “open” is used. Many of which are not open at all. And the list I’ve given above is hardly comprehensive. This is why the Open Definition is such an important reference. Even if it may have it’s faults.

Depending on your needs, any or all of those definitions might be fine. But “open” for you, may not be “open” for everyone. So let’s not lose sight of the goal and keep checking that we’re using that word correctly.

And, importantly, if we’re really making things open to make them better, then we might need to more open to collaboration. Open isn’t entirely about licensing either.

 

101100

Today I am 101100.

That’s XLIV in Roman.

44 is also the square root of 1936. 1936 was a leap year starting on a Wednesday.

The Year 44 was also a leap year starting on a Wednesday.

It was also known as the Year of the Consulship of Crispus and Taurus. Which is another coincidence because I like crisps and I’m also a Taurus.

And while we’re on Wikipedia, we can use the API to find out that page id 101100 is Sydney Harbour National Park which opened when I was 3.

Wolfram Alpha reminds me that 44 is the ASCII code for a comma.

Whichever way you look at it #101100 is a disappointing colour.

But if we use the random art generator then we can make a more colourful image from the number. But actually the image with that identifier is more interesting. Glitchy!

The binary number is also a car multimedia entertainment system. But £200 feels a bit steep, even if it is my birthday.

A 12 year old boy once bid £101,100 for a flooded Seat Toledo on EBay. Because reasons.

101100, or tubulin tyrosine ligase-like family, member 3 to its friends, also seems to do important things for mice.

I didn’t really enjoy Jamendo album 101100, the Jamez Anthony story.

Care of Cell Block 101100 was a bit better in my opinion. But only a bit.

Discogs release 101100 is The Sun’s Running Out by Perfume Tree. Of which the most notable thing is that track six includes a sample from a Dr Who episode.

I’m not really sure what the tag 101100 on flickr means.

IMDB entry 101100 is “Flesh ‘n’ Blood

The Board Game Geek identifier 101100 is for an XBox 360 version of 1 vs 100. That’s not even a board game!

Whereas Drive Thru RPG catalogue product 101100 as Battlemage. Which sounds much more interesting.

If I search for “101100 coordinates” on google, then it tells me that it’s somewhere in China. I should probably know why.

There are 26 results for 101100 on data.gov.uk. But none on data.gov. Which explains why the UK is #1 in the world for open data.

But HD 101100 is also a star.

And a minor planet discovered on 14th September 1998

CAS 101-10-0 is 2-(3-Chlorophenoxy)propionic acid. I think its a herbicide. Anyway, this is what it looks like.

It’s also a marine worm.

And an insect.

In the database of useful biological numbers, we discover that entry 101100 is the maximal emission wavelength for Venus fluorophore. Which is, of course, 528 nm.

I think the main thing I’ve learnt in my 44 years is that the web is an amazing place.

Data marketplaces, we hardly knew ye

I’m on a panel at the ODI lunchtime lecture this week, where I’m hoping to help answer the question of “what does a good data market look like?“.

As many of you know I was previously the product manager/lead for a data marketplace called Kasabi. That meant that I spent quite a bit of time exploring options for building both free and commercial services around data, business models for data supply, etc. At the time data marketplaces seemed to be “a thing”. See also this piece from 2011. There were suddenly a number of data marketplaces springing up from a variety of organisations.

The idea of data marketplaces, perhaps as an evolution of current data portals is one that seems to be resurfacing. I’ve already written about why I think “data marketplace” isn’t the right framing for encouraging more collaboration around data, particularly in cities.

I’m not going to rehash that here, but, as preparation for Friday, I thought I’d take a look at how the various data marketplaces are fairing. Here’s a quick run down.

If you think I’ve misrepresented anything then leave a comment and I’ll correct the post.

  • Data Market was originally focused on delivering data to businesses, offered sophisticated charting and APIs. Drew largely on national and international statistics. Great platform and a really nice team (disclaimer: have previously done some freelance work with them). They were acquired by Qlik. My understanding is that this rounded out their product offering by having an off-the-shelf platform for visualising on-demand data. This is no longer what I’d consider a marketplace, more a curated set of data feeds.
  • Azure Data Marketplace is still around but seems to be largely offering only Microsoft’s own data and APIs. Seems to be in the middle of a revamp and refocus on cloud apps and more general APIs rather than a marketplace. In its early stages Microsoft explored iterating this into an enterprise data portal as well as deeper integration with some of their products like SQL Server.
  • Kasabi. Shutdown. Sob.
  • BuzzData. Shutdown.
  • FreeBase. Acquired by Google, continued as a free services for a while and shutdown in 2015. The data is now part of wikidata.
  • Infochimps. Originally a data marketplace, the team spent a lot of timing building out a data processing pipeline using Big Data technologies. They were acquired for this technology.
  • Timetric started out as a data platform focusing on statistical and time series data, now seems to have evolved in a slightly different direction.
  • Factual continue to focus on location data. I was always intrigued by their approach which (at least originally) included businesses pooling their data together to create a richer resource, which was then used to drive additional revenue and sales. While there were suggestions they may expand into other sectors, that hasn’t happened.
  • Gnip and Datasift are still around, both still focusing on services and data analysis around social media data

There are others that could be included in the list. There’s one interesting new contender that shares a lot of similarity with some things that we were building in Kasabi, but they’re currently in stealth mode so I won’t share more.

I also don’t include Amazon Public Datasets or Google Public Data as they’re not really marketplaces. They’re collections of large datasets that Amazon or Google are providing as an enabler or encouragement to use some of their cloud services. Difficult to demonstrate big data analysis unless there’s a nice collection of demo datasets.

So, really only the Microsoft offering is still around in its originaly form of a data marketplace and its clear that the emphasis is is shifting elsewhere. The other services that are still around are all focused on a specific vertical or business sector rather than offering a general purpose (“horizontal”) platform for the supply and selling of data.

This matches what we can see elsewhere, there are lots of businesses that have been selling data for some time. While the original emphasis was on the data, the move now is to sell services on top of it. But they’re all focused on a specific sector or vertical. I think cities are neither.

 

On accessibility of data

My third open data “parable”. You can read the first and second ones here. With apologies to Borges.

. . . In that Empire, the Art of Information attained such Perfection that the data of a single City occupied the entirety of a Spreadsheet, and the datasets of the Empire, the entirety of a Portal. In time, those Unconscionable Datasets no longer satisfied, and the Governance Guilds struck a Register of the Empire whose coverage was that of the Empire, and which coincided identifier for identifier with it. The following Governments, who were not so fond of the Openness of Data as their Forebears had been, saw that that vast register was Valuable, and not without some Pitilessness was it, that they delivered it up to the Voraciousness of Privatisation and Monopolies. In the Repositories of the Net, still today, there are Stale Copies of that Data, crowd-sourced by Startups and Citizens; in all the Commons there is no other Relic of the Disciplines of Transparency.

Sharon More, The data roads less travelled. London, 2058.

 

Loading the British National Bibliography into an RDF Database

This is the second in a series of posts (1, 2, 3, 4) providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

Note: while I’ve attempted to fix up these instructions to account with changes to the software and how the data is published, there may still be some errors. If there are then please leave a comment or drop me an email and I’ll endeavour to fix.

The British National Bibliography (BNB) is a bibliographic database that contains data on a wide range of books and serial publications published in the UK and Ireland since the 1950s. The database is published under a public domain license and is available for access online or as a bulk download.

This tutorial provides developers with guidance on how to download the BNB data and load it into an RDF database, or “triple store” for local processing. The tutorial covers:

  • An overview of the different formats available
  • How to download the BNB data
  • Instructions for loading the data into two different open source triple stores

The instructions given in this tutorial are for users of Ubuntu. Where necessary pointers to instructions for other operating systems are provided. It is assumed that the reader is confident in downloading and installing software packages and working with the command-line.

Bulk Access to the BNB

While the BNB is available for online access as Linked Data and via a SPARQL endpoint there are a number of reasons why working with the dataset locally might be useful, e.g:

  • Analysis of the data might require custom indexing or processing
  • Using a local triple store might offer more performance or functionality
  • Re-publishing the dataset as part of aggregating data from a number of data providers
  • The full dataset provides additional data which is not included in the Linked Data.

To support these and other use cases the BNB is available for bulk download, allowing developers the flexibilty to process the data in a variety of ways.

The BNB is actually available in two different packages. Both provide exports of the data in RDF but differ in both the file formats used and the structure of the data.

BNB Basic

The BNB Basic dataset is provided as an export in RDF/XML format. The individual files are available for download from the BL website.

This version provides the most basic export of the BNB data. Each record is mapped to a simple RDF/XML description that uses terms from several schemas including Dublin Core, SKOS, and Bibliographic Ontology.

As its provides a fairly raw version of the data, BNB Basic is likely to be most useful when the data is going to undergo further local conversion or analysis.

Linked Open BNB

The Linked Open BNB offers a much more structured view of the BNB data.

This version of the BNB has been modelled according to Linked Data principles:

  • Every resource, e.g. author, book, category, has been given a unique URI
  • Data has been modelled using a wider range of standard vocabularies, including the Bibliographic Ontology, Event Ontology and FOAF.
  • Where possible the data has been linked to other datasets, including LCSH and Geonames

It is this version of the data that is used to provide both the SPARQL endpoint and the Linked Data views, e.g. of The Hobbit.

This package provides the best option for mirroring or aggregating the BNB data because its contents matches that of the online versions. The additional structure to the dataset may also make it easier to work with in some cases. For example lists of unique authors or locations can be easily extracted from the data.

Downloading The Data

Both the BNB Basic and Linked Open BNB are available for download from the BL website

Each dataset is split over multiple zipped files. The BNB Basic is published in RDF/XML format while the Linked Open BNB is published as ntriples. The individual data files can be downloaded from CKAN although this can be time consuming to do manually.

The rest of this tutorial will assume that the packages have been downloaded to ~data/bl

Unpacking the files is a simple matter of unzipping them:

cd ~/data/bl
unzip \*.zip
#Remove original zip files
rm *.zip

The rest of this tutorial provides guidance on how to load and index the BNB data in two different open source triple stores.

Using the BNB with Fuseki

Apache Jena is an Open Source project that provides access to a number of tools and Java libraries for working with RDF data. One component of the project is the Fuseki SPARQL server.

Fuseki provides support for indexing and querying RDF data using the SPARQL protocol and query language.

The Fuseki documentation provides a full guide for installing and administering a local Fuseki server. The following sections provide a short tutorial on using Fuseki to work with the BNB data.

Installation

Firstly, if Java is not already installed then download the correct version for your operating system.

Once Java has been installed, download the latest binary distribution of Fuseki. At the time of writing this is Jena Fuseki 1.1.0.

The steps to download and unzip Fuseki are as follows:

#Make directory
mkdir -p ~/tools
cd ~/tools

#Download latest version using wget (or manually download)
wget http://www.apache.org/dist/jena/binaries/jena-fuseki-1.1.0-distribution.zip

#Unzip
unzip jena-fuseki-1.1.0-distribution.zip

Change the download URL and local path as required. Then ensure that the fuseki-server script is executable:

cd jena-fuseki-1.1.0
chmod +x fuseki-server

To test whether Fuseki is installed correctly, run the following (on Windows systems use fuseki-server.bat):

./fuseki-server --mem /ds

This will start Fuseki with a empty read-only in-memory database. Visiting http://localhost:3030/ in your browser should show the basic Fuseki server page. Use Ctrl-C to shutdown the server once the installation test is completed.

Loading the BNB Data into Fuseki

While Fuseki provides an API for loading RDF data into a running instance, for bulk loading it is more efficient to index the data separately. The manually created indexes can then be deployed by a Fuseki instance.

Fuseki is bundled with the TDB triple store. The TDB data loader can be run as follows:

java -cp fuseki-server.jar tdb.tdbloader --loc /path/to/indexes file.nt

This command would create TDB indexes in the /path/to/indexes directory and load the file.nt into it.

To index all of the Linked Open BNB run the following command, adjusting paths as required:

java -Xms1024M -cp fuseki-server.jar tdb.tdbloader --loc ~/data/indexes/bluk-bnb ~/data/bl/BNB*

This will process each of the data files and may take several hours to complete depending on the hardware being used.

Once the loader has completed the final step is to generate a statistics file for the TDB optimiser. Without this file SPARQL queries will be very slow. The file should be generated into a temporary location and then copied into the index directory:

java -Xms1024M -cp fuseki-server.jar tdb.stats --loc ~/data/indexes/bluk-bnb >/tmp/stats.opt
mv /tmp/stats.opt ~/data/indexes/bluk-bnb

Running Fuseki

Once the data load has completed Fuseki can be started and instructed to use the indexes as follows:

./fuseki-server --loc ~/data/indexes/bluk-bnb /bluk-bnb

The --loc parameter instructs Fuseki to use the TDB indexes from a specific directory. The second parameter tells Fuseki where to mount the index in the web application. Using a mount point of /bluk-bnb the SPARQL endpoint for the dataset would then be found at:

http://localhost:3030/bluk-bnb/query

To select the dataset and work with it in the admin interface visit the Fuseki control panel:

http://localhost:3030/control-panel.tpl

Fuseki has a basic SPARQL interface for testing out SPARQL queries, e.g. the following will return 10 triples from the data:

SELECT ?s ?p ?o WHERE {
  ?s ?p ?o
}

For more information on using and administering the server read the Fuseki documentation.

Using the BNB with 4Store

Like Fuseki, 4Store is an Open Source project that provides a SPARQL based server for managing RDF data. 4Store is written in C and has been proven to scale to very large datasets across multiple systems. It offers a similar level of SPARQL support as Fuseki so is good alternative for working with RDF in a production setting.

As the 4Store download page explains, the project has been packaged for a number of different operating systems.

Installation

As 4Store is available as an Ubuntu package installation is quite simple:

sudo apt-get install 4store

This will install a number of command-line tools for working with the 4Store server. 4Store works differently to Fuseki in that there are separate server processes for managing the data and serving the SPARQL interface.

The following command will create a 4Store database called bluk_bnb:

#ensure /var/lib/4store exists
sudo mkdir -p /var/lib/4store

sudo 4s-backend-setup bluk_bnb

By default 4Store puts all of its indexes in /var/lib/4store. In order to have more control over where the indexes are kept it is currently necessary to build 4store manually. The build configuration can be altered to instruct 4Store to use an alternate location.

Once a database has been created, start a 4Store backend to manage it:

sudo 4s-backend bluk_bnb

This process must to be running before data can be imported, or queried from the database.

Once the database is running a SPARQL interface can then be started to provide access to its contents. The following command will start a SPARQL server on port 8000:

sudo 4s-httpd -p 8000 bluk_bnb

To check whether the server is running correctly visit:

http://localhost:8000/status/

It is not possible to run a bulk import into 4Store while the SPARQL process is running. So after confirming that 4Store is running successfully, kill the httpd process before continuing:

sudo pkill '^4s-httpd'

Loading the Data

4Store ships with a command-line tool for importing data called 4s-import. It can be used to perform bulk imports of data once the database process has been started.

To bulk import the Linked Open BNB, run the following command adjusting paths as necessary:

4s-import bluk_bnb --format ntriples ~/data/bl/BNB*

Once the import is complete, restart the SPARQL server:

sudo 4s-httpd -p 8000 bluk_bnb

Testing the Data Load

4Store offers a simple SPARQL form for submitting queries against a dataset. Assuming that the SPARQL server is running on port 8000 this can be found at:

http://localhost:8000/test/

Alternatively 4Store provides a command-line tool for submitting queries:

 4s-query bluk_bnb 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'

Summary

The BNB dataset is not just available for use as Linked Data or via a SPARQL endpoint. The underlying data can be downloaded for local analysis or indexing.

To support this type of usage the British Library have made available two versions of the BNB. A “basic” version that uses a simple record-oriented data model and the “Linked Open BNB” which offers a more structured dataset.

This tutorial has reviewed how to access both of these datasets and how to download and index the data using two different open source triple stores: Fuseki and 4Store.

The BNB data could also be processed in other ways, e.g. to load into a standard relational database or into a document store like CouchDB.

The basic version of the BNB offers a raw version of the data that supports this type of usage, while the richer Linked Data version supports a variety of aggregation and mirroring use cases.

Interesting Papers from CIDR 2009

CIDR 2009 looks like it was an interesting conference, there were a lot of very interesting papers covering a whole range of data management and retrieval issues. The full list of papers can be browsed online, or downloaded as a zip file. There’s plenty of good stuff in there ranging from the energy costs of data management, forms of query analysis and computation on “big data”, and discussions on managing inconsistency in distributed systems.
Below I’ve pulled out a few of the papers that particularly caught my eye. You can find some other picks and summary on the Data Beta blog: part 1, and part 2.
Requirements for Science Databases and SciDB from Michael Stonebraker et al, presents the results of a requirement analysis covering the data management needs of scientific researchers in a number of different fields. Interestingly it seems that for none of the fields covered, which includes astronomy, oceanography, biologic, genomics and chemistry, is a relational structure a good fit for the underlying data models used in the data capture or analysis. In most cases an array based system is most suitable, while for biology, chemistry and genomics in particular a graph database would be best; semantic web folk take note. The paper goes on to discuss the design of SciDB which will be an open source array-based database suitable for use in a range of disciplines.
The Case for RodentStore, an Adaptive, Declarative Storage System, Cudre-Mauroux et al, introduces RodentStore an adaptive storage system that can be used at the heart of a number of different data management solutions. The system provides a declarative storage algebra that allows a logical schema to be mapped to a specific physical disk layout. This is interesting as it allows greater experimentation within the storage engine, allowing exploration of how different layouts may be used to optimise performance for specific applications and datasets. The system supports a range of different structures, including multi-dimensional data, and the authors note that the system can be used to manage RDF data.
Principles for Inconsistency, proposes some approaches for cleanly managing inconsistency in distributed applications, providing some useful additional context and implementation experience for those wrapping their heads around the notion of eventual consistency. I’m not sure that’d I’d follow all of these principles, mainly due to the implementation and/or storage overheads, but there’s a lot of good common sense here.
Harnessing the Deep Web: Present and Future, Madhavan et al, describes some recent work at Google to explore how to begin surfacing “Deep Web” information and data into search indexes. The Deep Web is defined by them as pages that are currently hidden behind search forms and that are not currently accessible to crawlers through other means. The work essentially involved discovering web forms, analysing existing pages from the same site in order to find candidate values to fill in fields in those forms, then automatically submitting the forms and indexing the results. The authors describe how this approach can be used to help answer factual queries, and is already in production on Google. This probably explains the factual answers that are appearing on search results pages. The approach is clearly in-line with Google’s mission to do as much as possible with statistical analysis of document corpora as possible, there’s very little synergy with other efforts going on elsewhere, e.g. linked data. There is reference to how understanding the semantics of forms, in particular the valid range of values for a field (e.g. a zip code) and co-dependencies between fields, could improve the results, but the authors also note that they’ve achieved a high level of accuracy in automated approaches to identifying common fields such as zip code, etc. A proposed further avenue for research is exploration of whether the contents of an underlying relational database can be reconsistuted through automated form submission and scraping of structured data from the resulting pages. Personally I think there are easier ways to achieve greater data publishing on the web! The authors reference some work on a search engine specifically for data surfaced in this way, called Web Tables which I’ve not looked at yet.
DBMSs Should Talk Back Too, Yannis Ioannidis and Alkis Simitsis, describes some work to explore how database query results and queries themselves can be turned into human-readable text (i.e. the reverse of a typical natural-language query system), arguing that this provides a good foundation for building more accessible data access mechanisms, as well as allowing easier summarisation of what a query is going to do, in order to validate it against the users expectations. The conversion of queries to text was less interesting to me than the exploration of how to walk a logical datamodel to generate text. I’ve very briefly explored summarising data in FOAF files, in order to generate an audible report using a text-to-speech engine, and so it was interesting to me to see that the authors were using a graph based representation of the data model to drive their engine. Class and relation labelling, with textual templates, are a key part of the system, and it seems much of this would work well against RDF datasets.
SocialScope: Enabling Information Discovery on Social Content Sites, Amer-Yahia et al, is a broad paper that introduces SocialScope a logical architecture for managing, analysing and presentation information derived from social content graphs. The paper introduces a logical algebra for describing operations on the social graph, e.g. producing recommendations based on analysis of a users social network; introduces a categorisation for types of content present in the social graph and means for managing it; and also discusses some ways to present results of searches against the content graph (e.g. for travel recommendations) using different facets and explanations of how recommendations are derived.