Monthly Archives: May 2012

Open Data for (Big) Kids?

This afternoon Emma Mulqueeny asked on twitter if anyone had any ideas about fun, exciting datasets to inspire kids new to Open Data hacking. I asked whether she was interested in downloadable datasets or just APIs, or both. The answer was both.

So below you’ll find a few suggestions from me about datasets that kids might find fun and interesting. It the kind of stuff my kids are interested in anyway. It’s also the kind of Open Data that excites me, so even if I’m off the mark there may be something in here for you big kids too.

Disclaimer: I make a few references to Kasabi here, which is a service that I’m involved in developing. There’s a getting started guide here.

While I think Kasabi a great service, I reference it here because I’ve already put some of these datasets online there. I’m not trying to promote it here and I’ve referenced sources where available. For educational purposes I think it’d be good for kids to try working with data directly to understand working with files, as well as working with APIs. And indeed services like ScraperWiki, etc.

Presumably the kids will be given some support & guidance on working with the data and API, so I’ll focus this posting on just pointers.

Lego

My first and best suggestion is Lego data! Bricklink maintain a community maintained dataset about Lego parts, sets, and their inventories. The data isn’t clearly licensed, but having checked with them, they ask that you just attribute your sources. Attribution being another good thing to teach kids about Open Data.

The Bricklink data files can be downloaded as XML files. You can get individual files with details of all sets, parts, etc. There are also files with inventory information.

I’ve taken a version of that data and turned into into Linked Data (RDF) and published it on Kasabi. So its accessible via a few APIs there.

The code to download all of the data and convert it to RDF is available on github. If you just want to cache the files locally run: rake download. That will save some clicking in the UI.

NASA

Another dataset I published on Kasabi is the NASA launch data. This is a conversion of the NSSDC Master Catalog which contains data about all satellite launches dating back to the 1950s.

Pokemon

There’s an online pokedex containing data about Pokemon on the Veekun website. The code for the Pokedex and the source data is up on github.

There are instructions on how to build a local database for the code which relies on Python and PostreSQL. However the data is all in the github project as CSV files so can easily be downloaded and processed using whatever tools you like.

Related to this is Bulbapedia which is a Pokemon wiki. It runs on MediaWiki which offers an API. So there may be scope to get data that way, or mashup data across these two sources. I’ve not heard good things about the MediaWiki API being user-friendly though.

Finally the Ultimate Pokemon Centre also offers a Pokedex. This is oriented towards the video games and includes references to sound files of Pokemon cries

As this post describes, there are ways to export the data from the pokedex as text files. For example you can download the names of all Pokemon in Pokemon Red. To build an export start at this page and select your options.

Games

Video game data tends to be locked down to review sites, but there are a few places you can get some good data for non-commercial uses.

I don’t have any personal experience of any of those APIs so can’t offer any guidance. Would be good to teach the kids about the limitations of particular data and API licensing terms here: e.g. how might they be limited by upstream providers?

Board games more your thing? What about the XML API and database export offered by BoardGameGeek?

TV

TV and Film sources tend to be similarly hampered by licensing terms, but TVdb offers an API to grab XML extracts of its database, e.g. details on TV series, episodes, etc.

BBC Wildlife

The BBC Wildlife site is a Linked Data site so you can grab RDF data from any of its URLs. The data includes description of the animals, references to images and links to BBC content and clips. For example here’s a page about Tigers and here’s that same page as RDF.

I crawled the site last year to gather up the data. That’s available in Kasabi but is clearly out of date now. (No dinosaurs!)

Hopefully that’s enough to get some enthusiastic kids started on open data hacking.

Dr Who

The Guardian Datablog is a trove of interesting data snippets. Two of which are about Dr Who.

The first is a spreadsheet that includes every time-travel event made by the Doctors. The data includes the episode, doctor, when/where the travel occured, etc. Available as a Google Spreadsheet which can also be downloaded.

There’s also another spreadsheet which contains the name of every Dr Who villain since the 1960s. Again it includes name of Doctor and episode, so the two datasets are ripe for mashing up.

There are numerous Dr Who fan sites and wikis so there is likely to be some scope for linking out to various websites for images, reviews, etc.

Layered Data: A Paper & Some Commentary

Two years ago I wrote a short paper about “layering” data but for various reasons never got round to putting it online. The paper tried to capture some of my thinking at the time about the opportunities and approaches for publishing and aggregating data on the web. I’ve finally got around to uploading it and you can read it here.

I’ve made a couple of minor tweaks in a few places but I think it stands up well, even given the recent pace of change around data publishing and re-use. I still think the abstraction that it describes is not only useful but necessary to take us forward on the next wave of data publishing.

Rather than edit the paper to bring it completely up to date with recent changes, I thought I’d publish it as is and then write some additional notes and commentary in this blog post.

You’re probably best off reading the paper, then coming back to the notes here. The illustration referenced in the paper is also now up on slideshare.

RDF & Layering

I see that the RDF Working Group, prompted by Dan Brickley, is now exploring the term. I should acknowledge that I also heard the term “layer” in conjunction with RDF from Dan, but I’ve tried to explore the concept from a number of perspectives.

The RDF Working Group may well end up using the term “layer” to mean a “named graph”. I’m using the term much more loosely in my paper. In my view an entire dataset could be a layer, as well as some easily identifiable sub-set of it. My usage might therefore be closer to Pat Hayes’s concept of a “Surface”, but I’m not sure.

I think that RDF is still an important factor in achieving the goal I outlined of allowing domain experts to quickly assemble aggregates through a layering metaphor. Or, if not RDF, then I think it would need to be based around a graph model, ideally one with a strong notion of identity. I also think that mechanisms to encourage sharing of both schemas and annotations are also useful. It’d be possible to build such a system without RDF, but I’m not sure why you’d go to the effort.

User Experience

One of the things that appeals to me about the concept of layering is that there are some nice ways to create visualisation and interfaces to support the creation, management and exploration of layers. It’s not hard to see how, given some descriptive metadata for a collection of layers, you could create:

  • A drag-and-drop tool for creating and managing new composite layers
  • An inspection tool that would let you explore how the dataset for an application or visualisation has been constructed, e.g. to explore provenance or to support sharing and customization. Think “view source” for data aggregation.
  • A recommendation engine that suggested new useful layers that could be added to a composite, including some indication of what additional query options might become available

There’s been some useful work done on describing datasets within the Linked Data community: VoiD and DCat for example. However there’s not yet enough data routinely available about the structure and relationships of individual datasets, nor enough research into how to provide useful summaries.

This is what prompted my work on an RDF Report Card to try and move the conversation forward beyond simply counting triples.

To start working with layers, we need to understand what each layer contains and how they relate to and complement one another.

Linked Data & Layers

In the paper I suggest that RDF & Linked Data alone aren’t enough and that we need systems, tools and vocabularies for capturing the required descriptive data and enabling the kinds of aggregation I envisage.

I also think that the Linked Data community is spending far too much effort on creating new identifiers for the same things and worrying how best to define equivalences.

I think the leap of faith that’s required, and that people like the BBC have already taken, is that we just need to get much more comfortable re-using other people’s identifiers and publishing annotations. Yes, there will be times when identifiers diverge, but there’s a lot to be gained, especially in terms of efficiency around data curation from just focusing on the value-added data, not re-publishing any copy of a core set of facts.

There are efficiency gains to be had from existing businesses, as well as faster routes to market for startups, if they can reliably build on some existing data. I suspect that there are also businesses that currently compete with one another — because they’re having to compile or re-compile the same core data assets — that could actually complement one another if they could instead focus on the data curation or collection tasks at which they excel.

Types of Data

In the paper I set out seven different facets which I think cover the majority of types of data that we routinely capture and publish. I think the classification could be debated, but I think its a reasonable first attempt.

The intention is to try and illustrate that we can usefully group together different types of data. And organisations may be particularly good at creating or collecting particular types of data. There’s scope for organisations to focus on being really good in a particular area and by avoiding needless competition around collecting and re-collecting the same core facts, there are almost certainly efficiency gains and cost savings to be had.

I’m sure there must be some prior work in this space, particularly around the core categories, so if anyone has pointers please share them.

There are also other ways to usefully categorise data. One area that springs to mind is how the data itself is collected, i.e. its provenance. E.g. is it collected automatically by sensors, or as a side-effect of user activity, or entered by hand by a human curator? Are those curators trained or are they self-selected contributors? Is the data derived from some form of statistical analysis?

I had toyed with provenance as a distinct facet, but I think its an orthogonal concern.

Layering & Big Data

A lot has happened in the last two years and I winced a bit at all of the Web 2.0 references in the paper. Remember that? If I were writing this now then the obvious trend to discuss as context to this approach is Big Data.

Chatting with Matt Biddulph recently he characterised a typical Big Data analysis as being based on “Activity Data” and “Reference Data”. Matt described reference data as being the core facts and information on top of which the activity data — e.g. from users of an application — is added. The analysis then draws on the combination to create some new insight, i.e. more data.

I referenced Matt’s characterisation in my Strata talk (with acknowledgement!). Currently Linked Data does really well in the Reference category but there’s not a great deal of Activity data. So while its potentially useful in a Big Data world, there’s a lot of value still not being captured.

I think Matt’s view of the world chimes well with both the layered data concept and the data classifications that I’ve proposed. Most of the facets in the paper really define different types of Reference data. The outcome of a typical Big Data analysis is usually a new facet, an obvious one being “Comparative” data, e.g. identifying the most popular, most connected, most referenced resources in a network.

However there’s clearly a different in approach between typical Big Data processing and the graph models that I think underpin a layered view of the world.

MapReduce workflows seem to work best with more regular data, however newer approaches like Pregel illustrate the potential for “graph-native” Big Data analysis. But setting that aside, there’s no real contention as a layering approach to combining data doesn’t say anything about how the data must actually be used: it can be easily projected out into structures that are amenable for indexing and processing in different ways.

Kasabi

Looking at the last section of the paper it should be obvious that much of the origin of this analysis was early preparation for Kasabi.

I still think that there’s a great deal of potential to create a marketplace around data layers and tools for interacting with them. But we’re not there yet though for several reasons. Firstly its taken time to get the underlying platform in place to support that. We’ve done that now and you can expect more information on that from more official sources shortly. Secondly I under estimated how much effort is still required to move the market forward: there’s still lots to be done to support organisations in opening up data before we can really explore more horizontal marketplaces. But that is a topic for another post.

This has been quite a ramble of a blog post but hopefully there are some useful thoughts here that chime with your own experience. Let me know what you think.

Tagged , , , , ,
Follow

Get every new post delivered to your Inbox.

Join 30 other followers