What is Derived Data?

A while ago I asked the question: “What is a Dataset?“. The idea was to look at how different data communities were using the term to see if there were any common themes. This week I’ve been considering how UPRNs can be a part of open data, a question made more difficult due to complex licensing issues.

One aspect of the discussion is the idea of “derived data“. Anyone who has worked with open data in the UK will have come across this term in relation to licensing of Ordnance Survey and Royal Mail data. But, as we’ll see shortly, the term is actually in wider use. I’ve realised though that like “dataset”, this is another term which hasn’t been well defined. So I thought I’d explore what definitions are available and whether we can bring any clarity.

I think there are several reasons why having a clearer definition and understanding of what constitutes “derived data”:

  1. When using data published under different licenses it’s important to understand what the implications are of reusing and mixing together datasets. While open data licenses create few issues, mixing together open and shared data, can create additional complexities due to non-open licensing terms. For further reading here see: “IPR and licensing issues in Derived Data ” (Korn et al, 2007)  and “Data as IP and Data License Agreements” (Practical Law, 2013).
  2. Understanding how data is derived is useful in understanding the provenance of a dataset and ensuring that sources are correctly attributed
  3. In the EU, at least, there are many open questions relating to the creation of services that use multiple data sources. As a community we should be trying to answer these questions to identify best practices, even if ultimately they might only be resolved through a legal process.

On that basis: what is derived data?

Definitions of derived data from the statistics community

The OECD Glossary of Statistical Terms defines “derived data element” as:

A derived data element is a data element derived from other data elements using a mathematical, logical, or other type of transformation, e.g. arithmetic formula, composition, aggregation.

This same definition is used in the data.gov.uk glossary, which has some comments.

The OECD definition of “derived statistics” also provides some examples of derivation, e.g. creating population-per-square-mile statistics from primary observations (e.g. population counts, geographical areas).

Staying in the statistical domain, this britannica.com article on censuses explains that (emphasis added):

there are two broad types of resulting data: direct data, the answers to specific questions on the schedule; and derived data, the facts discovered by classifying and interrelating the answers to various questions. Direct information, in turn, is of two sorts: items such as name, address, and the like, used primarily to guide the enumeration process itself; and items such as birthplace, marital status, and occupation, used directly for the compilation of census tables. From the second class of direct data, derived information is obtained, such as total population, rural-urban distribution, and family composition

I think this clearly indicates the basic idea that derived data is obtained when you apply a process or transformation to one or more source datasets.

What this basic definition doesn’t address is whether there any important differences between categories of data processing, e.g. does validating some data against a dataset yield derived data, or does the process have to be more transformative? We’ll come back to this later.

Legal definitions of derived data

The Open Database Licence (ODbL), which is now used by Open Streetmap, defines a “Derivative Database” as:

…a database based upon the Database, and includes any translation, adaptation, arrangement, modification, or any other alteration of the Database or of a Substantial part of the Contents. This includes, but is not limited to, Extracting or Re-utilising the whole or a Substantial part of the Contents in a new Database.

This itemises some additional types of process, namely that extracting portions of a dataset also creates a derivative and not just transformation or statistical calculations.

However, as noted in the legal summary for the Creative Commons No Derivatives licence, simply changing the format of a work doesn’t create a derivative. So, in their opinion at least, this type of transformation doesn’t yield a derived work. In the full legal code they don’t use the term “derived data”, largely because the licences can be applied to a wide range of different types of works, they instead define an “Adapted Material“:

…material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor.

The Ordnance Survey User Derived Dataset Contract (copy provided by Owen Boswarva), which allows others to create products using OS data, defines “User Derived Datasets” as:

datasets which you have created or obtained containing in part only or utilising in whole or in part Licensed Data in their creation together with additional information not obtained from any Licensed Data which is a fundamental component of the purpose of your Product and/or Service.

The definition stresses that the datasets consist of some geographical data, e.g. points or polygons, plus some additional data elements.

The Ordnance Survey derived data exemptions documentation has this to say about derived data:

data and or hard copy information created by you using (to a greater or lesser degree) data products supplied and licensed by OS, see our Intellectual Property (IP) policy.

For the avoidance of doubt, if you make a direct copy of a product supplied by OS – that copy is not derived data.

Their licensing jargon page just defines the term as:

…any data that you create using Ordnance Survey mapping data as a source

Unfortunately none of these definitions really provide any useful detail, which is no doubt part of the problems that everyone has with understanding OS policy and licensing terms. As my recent post highlights, the OS do have some pretty clear ideas of when and how derived data is created.

The practice note on “Data as IP and Data License Agreements” published by Practical Law provides a great summary of a range of IP issues relating to data and includes a discussion of derived data. Interestingly they highlight that it may be useful to consider not just data generated by processing a dataset but other data that may be generated through the interactions of a data publisher (or service provider) and a data consumer. (See “Original versus Derived Data“, page 7).

This leads them to define the following cases for when derived data might be generated:

  • Processing the licensed data to create new data that is either:
    • sufficiently different from the original data that the original data cannot be identified from analysis, processing or reverse engineering the derived data; or
    • „a modification, enhancement, translation or other derivation of the original data but from which the original data may be traced. „
  • Monitoring the licensee’s use of a provider’s service (commonly referred to as usage data).

From a general intellectual property stance I can see why usage data should be included here, but I would suggest that this category of derived data is quite different to what is understood by the (open) data community.

What I find helpful about this summary is that it starts to bring some clarity around the different types of processes that yield derived data.

The best existing approach to this that I’ve seen can be found in: “Discussion draft: IPR, liability and other issues in regard to Derived Data“. The document aims to clarify, or at least start a discussion around, what is considered to be derived data in the geographical and spatial data domain. They identify a number of different examples, including:

  • Transforming the spatial projection of a dataset, e.g. to/from Mercator
  • Aggregating data about a region to summarise to an administrative area
  • Layering together different datasets
  • Inferring new geographical entities from existing features, e.g. road centre lines derived from road edges

In my opinion these types of illustrative examples are a much better way of trying to identify when and how derived data is created. For most re-users its easier to relate to an example that legal definitions.

Another nice example is the Open Streetmap guidance on what they consider to be “trivial transformations” which don’t trigger the creation of derived works.

An expanded definition of derived data

With the above in mind, can we create a better definition of derived data by focusing on the types of processes and transformations that are carried out?

Firstly I’d suggest that the following types of process do not create derived data:

  1. Using a dataset – stating the obvious really, but simply using a dataset doesn’t trigger creating a derivative. Open Street Map calls these a “Produced Work”.
  2. Copying – again, I think this is should be well understood, but I mention it for completeness. This is distribution, not derivation.
  3. Changing the format – E.g. converting a JSON file to XML. The information content remains the same, only the format is changed. This is supported by the Creative Commons definitions of remixing/reuse.
  4. Packaging (or repackaging) – E.g. taking a CSV file and re-publishing it as a data package. This would also include taking several CSV files from different publishers and creating a single data package from them. I believe this is best understood as a “Collected Work” or “Compilation” as the original datasets remain intact.
  5. Validation – checking whether field(s) in dataset A are correct according to field(s) dataset B, so long as dataset A is not corrected as a result. This is a stance that Open Street Map seem to agree with.


This leaves us with a number of other processes which do create derived data:

  1. Extracting – extracting portions of a dataset, e.g. extracting some fields from a CSV file.
  2. Restructuring – changing the schema or internal layout of a database, e.g. parsing out existing data to create new fields such as breaking down an address into its constituent parts
  3. Annotation – enhancing an existing dataset to include new fields, e.g. adding UPRNs to a dataset that contains addresses
  4. Summarising or Analysing – e.g. creating statistical summaries of fields in a dataset, such as the population statistics examples given by the OECD. Whether the original dataset can be reconstructed from the derived will depend on the type of analysis being carried out, and how much of the original dataset is also included in the derived.
  5. Correcting – validating dataset A against dataset B, and then correcting dataset A with data from dataset B where there are discrepancies
  6. Inferencing – applying reasoning, heuristics, etc to generate an entirely new data based on one or more datasets as input.
  7. Model Generation – I couldn’t think of a better name for this, but I’m thinking of scenarios such as sharing a neural network that has used some datasets as a training set. I think this is different to inferencing.

What do you think of this? Does it capture the main categories of deriving data? If you have comments on this then please let me know by leaving a comment here or pinging me on twitter.


How and when can UPRNs be a part of open data?

I’m trying to understand when and how UPRNs can be a part of open data, whether published by councils or others organisations. I’m writing down what I understand in the hope that others might find this useful or might be able to correct any misunderstandings by leaving a comment. It’d be great to get some official confirmation from Ordnance Survey and others too.

On the 16th February 2015 there was an announcement from Ordnance Survey that said:

Supporting the local government transparency and government open data agendas, Ordnance Survey, GeoPlace and the Improvement Service are enabling AddressBase internal business use customers to release Unique Property Reference Numbers (UPRNs) on a royalty free and open basis. The move will facilitate the release and sharing of public and private sector addressing databases.

The announcement notes that this brings UPRNs into line with the terms that apply to reuse of the TOID identifier.

The relevant background documents are the following. Use these as your primary guidance:

The first policy statement is the significant document. It’s been updated to clarify some elements and has some example permitted and non-permitted uses which we’ll explore below.

What follows is my understanding of those documents and consideration of some additional scenarios of how and when UPRNs can be released as open data.

If any of the below is wrong, please leave a comment!


A log of updates and clarifications to this post:

  • 3/9/2015 – Added notes and comments about the ONS National Statistics Address Lookup dataset

Who can publish UPRNs in open data?

Current AddressBase licensees (specifically “Internal Business Use customers”) can publish data containing UPRNs without the need to place any restrictions on their downstream use. This only applies to the UPRN identifiers themselves as there are some provisos around what data can be published.

Anyone who obtains a UPRN from an open dataset can also use those identifiers to publish additional open data, e.g. by annotating a dataset with additional information, so long as they obey any licensing requirements from their source datasets.

This is perhaps best understood as distribution rather than publication though because the UPRNs must have previously been available in, and obtained from, an open dataset.

If you’re not an AddressBase customer, then you can’t publish new datasets containing previously unpublished UPRNs.

Note: the policy document refers only to AddressBase licensees. It doesn’t state a specific AddressBase product that must be licensed (there are several). It also doesn’t really define customers beyond that although the public sector presumption to publish does.

Who can use UPRNs in open data?

If you obtain a UPRN from an open dataset, then you can use it without incurring any additional licensing restrictions beyond what is stated in the licence for that dataset.

So, if a local authority publishes some data containing UPRNs under the OGL, then you can use it for commercial and non-commercial purposes so long as you attribute your sources.

What licence can be used when publishing UPRNs in open data?

Any open licence can be used to publish the UPRNs identifiers.

However there may be additional licensing restrictions that must be applied to the dataset if:

  • the dataset contains additional OS data
  • the dataset was constructed by using or referencing the geographical co-ordinates of the UPRNs

The first restriction seems obvious: if you include non-open data then this will impact your licensing options. The second is less clear: depending on how you constructed the dataset, you may not be able to publish it openly.

The specific wording is that UPRNs can only be published on a royalty-free basis, and with the option to sub-licence if:

licensees have not extracted UPRNs by using or making reference to the coordinates within AddressBase products data

Lets refer to that restriction as the “spatial reference restriction“. Examples are essential to help clarify where and when it applies.

Additional public sector permissions

It’s also worth highlighting that the presumption to publish document notes that for public sector customers of AddressBase the OS

…will permit the release of the OS x,y co-ordinates for your public sector assets, together with the UPRN, such that members are able to release datasets required to meet the requirements of the Local Government Transparency Code.

This means that as long as the presumption to publish process is followed, that it should be possible for local authorities to publish both UPRNs and their co-ordinates in derived datasets that aren’t substantial extracts of the source data. 

However this is expanded on in the OS licensing guidance which emphases that the permission applies to public sector assets only, and specifically those datasets within the Local Government Transparency Code. There’s also a note that:

This is subject to the member having permission from Royal Mail in relation to the release of any data derived from PAF

Which piles caveats upon caveats.

It’s not immediately clear to me if the spatial reference restriction is also intended to apply here or whether this is separate special dispensation for public sector customers. I’m assuming the latter, but it would be useful to have some confirmation.

Worked Examples

Lets work through a couple of examples of where UPRNs might be published as part of an open data release. The UPRN policy statement includes several examples which we’ll build on here.

Companies House address matching

This is the first permitted example in the policy statement:

A third party takes an open address dataset, such as the Free Company Data Product from Companies House, and matches the data contained within against one of the AddressBase products using non-spatial methods. It then appends the UPRN from the AddressBase products to this address data.

Emphasis is mine.

In this example a local authority could take the list of registered companies in its local area, match the addresses against AddressBase and publish a local extract that has been annotated with the UPRN. This could be published under the OGL (we’ll ignore the unclear licensing of Companies House data for now!).

The matching of addresses has to be done using non-spatial methods which means using text matching of the address components.

Food hygiene rating location matching

The FSA food hygiene rating open data includes the addresses and X,Y co-ordinates of places that have been assigned a food hygiene rating. Could a local authority do the same thing to this dataset as in the previous example? E.g. publishing a local subset enriched with the UPRN?

The answer seems to be:

  1. No – if the data is matched based on comparing the X,Y co-ordinates in the hygiene data to AddressBase, e.g. to find the nearest property. The spatial reference restriction doesn’t allow this.
  2. Yes – if the data is matched using the address fields only.

The end result will be exactly the same dataset but only one approach seems to be valid as using the X,Y co-ordinates is a spatial method.

Unfortunately the OS terms don’t define what constitutes a spatial (or non-spatial) method. Using a distance calculation as suggested in this example seems like its definitely a spatial method. But its not clear for example, whether finding addresses within a location, e.g. a post code, or administrative area, counts as a spatial method.

In fact, given that AddressBase is essentially just a list of addresses and locations, its hard to think of examples other than just address matching where it would be possible to extract UPRNs.

Local authority land and building assets

The local government transparency code (p15-16) requires local authorities to publish a list of its land and building assets. This includes the UPRN and full address of all properties.

This is expressly allowed by the “presumption to publish” process, so the authority can do this without requiring additional permission. The authority could use a spatial query in AddressBase to find and extract all of the necessary data and publish it under the OGL.

Note: if AddressBase contained an indicator of whether a property was owned by the public sector, it wouldn’t be permissible for a non-public sector licensee to publish exactly the same dataset as above along with the co-ordinates. The spatial reference restriction would apply, so using a spatial query to extract the data would not be allowed.

Local government incentive scheme

The local government incentive scheme datasets includes planning applications, public toilets, and premises licences. Many of the local authorities in the UK are publishing these datasets against a standard schema. All of the schemas have been defined to include addresses, co-ordinates and UPRNs.

To meet the terms of the incentive scheme the datasets are published as open data. Currently the UPRNs are often not populated except for public toilets which have been given an exemption by the OS. This is mentioned in the schema guidance but I’ve not found a better link for it and its not listed here.

So, can a local authority update the planning and licensing datasets to include UPRNs? Yes, I think so.

Assuming that each planning and licensing application is matched to its UPRN via the address then everything should be fine. This is a “non-spatial” method and is essentially the same as the Companies House example.

However because these datasets are not part of the transparency code, I don’t think the local authority could include the X,Y co-ordinates of the UPRN without permission from the OS.

Bin and recycling collection routes

This is the example that triggered me looking into this issue again. I wanted to know: could we publish a list of UPRNs in Bath along with the identifier of the bin collection route they are on and which day of the week the bins are collected.

In order to tell someone when their bin or recycling will be collected you need to know what bin collection route they are on. And different sides of the street may be covered by different routes, so you can’t just publish a list of which roads are covered by which routes, you need to know which addresses it covers.

Unfortunately because you need a spatial query to do this, the spatial reference restriction applies. This means you can’t publish that dataset with UPRNs. I also don’t think you can publish it by substituting UPRNs for the textual addresses as that would amount to publishing a significant extract of PAF, basically all properties in the local area.

So this type of service data can’t be published as open data currently. Only local authorities can build services that know when and where recycling and bin collection services are available.

How do the UPRN terms compare with TOIDs?

The basic terms of use for UPRNs and TOIDs are broadly similar. However the key important difference is that for TOIDs there is no equivalent of the spatial reference restriction: if you were licensed to use the data, or have access to them as open data, then there are no additional restrictions.

Although the “OS OpenData™ TOID look-up service” mentioned in the terms, and originally available at http://opentoids.ordnancesurvey.co.uk/toidservice/ no longer exists, they can be found in the various OS open data products so its easily to look them up.

That’s not true for addresses and UPRNs.

Does the ONS NSAL dataset make UPRNs open data?

Commenting on the first version of this post, Owen Boswarva wondered if the ONS National Statistics Address Lookup (NSAL) meant that UPRNs are open data?

The NSAL dataset is described in this blog post. It’s a list of UPRNs mapped to various administrative regions. This allows for easy reporting and recasting of statistics by different geographies. The blog post explains that the changes to the UPRN policy were encouraged to help support the release of this dataset, which is published under the OGL. This means that there is already a complete list of UPRNs published under an open licence.

So does this means that the UPRNs are open data? Clearly the full list of UPRN identifiers is now available under an open licence from the ONS. So the answer could be a qualified yes. However as the ONS explain in their version notes (copy here), the dataset may be out of date with the authoritative copy in AddressBase, so isn’t necessarily definitive.

There’s also none of the accompanying metadata that I’d expect to see if the UPRN identifier scheme was fully published as open data, e.g. administrative metadata around when UPRNs are added or removed, relationships between UPRNs, and perhaps the address data.

While the NSAL dataset itself is excellent, helping to solve problems with mapping between the various local geographies, it doesn’t provide any additional utility beyond giving us a reasonably up to date count of how many UPRNs there are. It doesn’t help us publish more open data that include UPRNs, or help us annotate existing datasets with UPRNs, for that you still need the address and co-ordinate information held in AddressBase.


UPRNs are not open data, but they can be included in some open datasets. There are some very specific cases where UPRNs could usefully be added to both existing and new open data sets.

However there are some subtleties in understanding what is allowed that includes both who is publishing the data and how the dataset is be constructed.

Hopefully this post has shed some light onto the issues that might help open data publishers and, importantly, local authorities in understanding what can and can’t be done.

I’ll update this post to make corrections as and when necessary. Please leave a comment if you have an issue with any of my reasoning. Also, please comment if you have additional examples of permitted or non-permitted publication.

Data and information in the city

For a while now I’ve been in the habit of looking for data as I travel to work or around Bath. You can’t really work with data and information systems for any length of time without becoming a little bit obsessive about numbers or becoming tuned into interesting little dashboards:

My eye gets drawn to gauges and displays on devices as I’m curious about not just what they’re showing but also for whom the information is intended.

I can also tell you that for at least ten years, perhaps longer, the electronic signs on some of the buses running the Number 10 route in Bath have been buggy. Instead of displaying “10 Southdown” they read “(ode1fsOs1ss1 10sit2 Southdown)” with a flashing “s” in “sit”.

Yes. I wrote it down. I was curious about whether it was some misplaced control codes, but I couldn’t find a reference.

Having spent so long working on data integration and with technologies like Linked Data, I’m also curious about how people assign identifiers to things. A lot of what I’ve learnt about that went into writing this paper, which is a piece of work of which I’m very proud. It’s become an ingrained habit to look out for identifiers wherever I can find them.  It’s not escaped me that this is pretty close to train spotting, btw!

I’ve also recently started contributing to Bath: Hacked, which is Bath’s community-led open data project. Its led me to pay even closer attention to the information around me in Bath, as it might turn up some useful data that could be published or indicate the potential for a useful digital service.

So to try and direct my “data magpie” habits into a more productive direction, I’ve started on a small project to photograph some of the information and data I find as I walk around the city. There are signs, information and data all around us but we don’t often really notice it or we just take the information for granted. I decided to try to catalogue some of the ways in which we might encounter data around Bath and, by extension, in other cities.

The entire set of photos is available on Flickr if you care to take a look. Think of it as a natural history study of data.

In the rest of this post I wanted to explore a few things that have occurred to me along the way. Areas where we can glimpse the digital environment and data infrastructure that is increasingly supporting the physical environment. And the ways in which data might be intentionally or incidentally shared with others.

Data as dark matter

For most people data is ephemeral stuff. It’s not something they tend to think about even though its being collected and recorded all around us. While there’s increasing awareness of how our personal data is collected and used by social networks and other services, there’s often little understanding of what data might be available about the built environment.

But you can see evidence of that data all around us. Data is a bit like dark matter: we often only know it exists based on its effects on other things which we more clearly understand. Once you start looking you can see identifiers everywhere:

Bridge identifiers

If something has an identifier then there will be data associated with it, creating a record that describes that object. As there is very likely to be a collection of those things then we can infer that there’s a database containing many similar records.

Once you start looking you can see databases everywhere: of lampposts, parking spaces, bins, and the monoliths that sit in our streets but which we rarely think about:

Traffic light control box

Once you realise all of these databases exist it’s natural to start asking questions such as how that information is collected, who is responsible for it, and when might it be useful?  There are databases everywhere and people are employed to look after them.

The bus driver’s role in data governance

Live bus times

I was looking forward to the installation of the Real Time Information sign at the bus stop (0180BAC30294) near my house. For a few years now I’ve been regularly taking a photo of the paper sign on the stop. Looking at that on my phone is still much quicker than using any of the online services or apps. A real time data feed was going to solve that. Only it didn’t. It’s made things worse:

My morning bus, the one that begins my commute to the Open Data Institute, is often not listed. I’ve had several morning conversations with Travelwest about it. Although, evoking Hello Lamppost, it feels like I’ve been arguing with the bus sign itself and would like to leave a note to others to say that, actually yes the Number 10 is really on its way.

I’m suddenly concerned that they may do away with that helpful paper sign. The real-time information feed exposes problems with data management that wouldn’t otherwise be evident. Real-time doesn’t necessarily always mean better.

Interestingly Travelwest have an FAQ that lists a number of reasons why some buses won’t appear on the RTI system. This includes the expected range of data and hardware problems, but also: “The bus driver has logged on to the ETM incorrectly, preventing the journey operated being ‘matched’ by the central RTI system“.

So it turns out that bus drivers have a key role in the data governance of this particular dataset. They’re not just responsible for getting the bus from A-B but also in ensuring that passengers know that its on its way. I wonder if that’s part of their induction?

The paperless city

There are more obvious signs of business processes that we can see around a city. These are stages in processes that require some public notice or engagement, such as planning applications or other “rights to object” to planned works:

Pole Objection Notice

In other cases the information is presented as an indication that a process has been completed successfully, such as gaining a premises licence, liability insurance or an energy rating certificate. If this information is being put on physical display then it’s natural to wonder whether there are digital versions that could or should be made available.

Also, in the majority of cases, making this information availability digitally would probably be much better. There are certainly opportunities to create better digital services to help engage people in these processes. But in order to be inclusive I suspect paper based approaches are going to be around for a while.

What would a digital public service look like that provided this type of city information, both on-demand and as notifications, to residents? The information might already be available on council websites, but you have to know that it’s there and then how find it.

Visible to the public, but not for the public

Interestingly, not all of the information we can find around the city is intended for wider public consumption. It may be published into a public space but it might only be intended for a particular group of people, or useful at a particular point in time, e.g. during an emergency such as this map of fire sensors.

Fire hydrant

Most of the identifier examples I referred to above fall into this category. Only a small number of people need to know the identifier for a specific bin, traffic light control both, or bridge.

It also means that information may often be provided without context as the intended audience knows how to read it or has the tools required to use it to unlock more information.  This means to properly interpret it you have to be able to understand the visual code that is used in these organisational hobo signs.

The importance of notice boards

For me there’s something powerful in the juxtaposition of these two examples:

Community notice board

Dynamic display board

The first is a community notice board. Anyone can come along and not only read it but also add to the available information. It’s a piece of community owned and operated information infrastructure. This manually updated map of the local farmers market is another nice example, as are the walls of flyers and event notices at the local library.

The second example is a sealed unit. It’s owned and operated by a single organisation who gets to choose what information is displayed. Community annotations aren’t possible. There’s no scope to add notices or grafitti to appropriate the structure for other purposes – something that you see everywhere else in the city. This is increasingly hard to do with digitial infrastructures.

In my opinion a truly open city will include both types of digital and physical infrastructure. I dislike the top-down view of the smart city and preview the vision of creating an open, annotable data infrastructure for residents and local businesses to share information.

Useful perspective

In this rambling post I’ve tried to capture some of the thoughts that have occurred to me whilst taking a more critical look at how data and information is published in our cities. I’ve really only scratched the surface, but it’s been fun to take a step back and look at Bath with a slightly more critical eye.

I think it’s interesting to see how data leaks into the physical environment, either intentionally or otherwise. Using environments that people are familiar with might also be a useful way to get a wider audience thinking about the data that helps our society function, and how it is owned and operated.

It’s also interesting to consider how a world of increasingly connected devices and real-time information is going to impact this environment. Will all of this information move onto our phones, watches or glasses and out of the physical infrastructure? Or are we going to end up with lots more cryptic icons and identifiers on all kinds of bits of infrastructure?


“The scribe and the djinn’s agreement”, an open data parable

In a time long past, in a land far away, there was once a great city. It was the greatest city in the land and the vast marketplace at its centre was the busiest, liveliest marketplace in the world. People of all nations could be found there buying and selling their wares. Indeed, the marketplace was so large that people would spend days, even weeks, exploring its length and breadth would still discover new stalls selling a myriad of items.

A frequent visitor to the marketplace was a woman known only as the Scribe. While the Scribe was often found roaming the marketplace even she did not know of all of the merchants to be found within its confines. Yet she spent many a day helping others to find their way to the stalls they were seeking, and was happy to do so.

One day, as a gift for providing useful guidance, a mysterious stranger gave the Scribe a gift: a small magical lamp. Upon rubbing the lamp a djinn appeared before the suprised Scribe and offered her a single wish.

“Oh venerable djinn” cried the Scribe, “grant me the power to help anyone that comes to this marketplace. I wish to help anyone who needs it to find their way to whatever they desire”.

With a sneer the djinn replied: “I will grant your wish. But know this: your new found power shall come with limits. For I am a capricious spirit who resents his confinement in this lamp”. And with a flash and a roll of thunder, the magic was completed. And in the hands of the Scribe appeared the Book.

The Book contained the name and location of every merchant in the marketplace. From that day forward, by reading from the Book, the Scribe was able to help anyone who needed assistance to find whatever they needed.

After several weeks of wandering the market, happily helping those in need, the Scribe was alarmed to discover that she was confronted by a long, long line of people.

“What is happening?” she asked of the person at the head of the queue.

“It is now widely known that no-one should come to the Market without consulting the Scribe” said the man, bowing. “Could you direct me to the nearest merchant selling the finest silks and tapestries?”

And from that point forward the Scribe was faced with a never-ending stream of people asking for help. Tired and worn and no longer able to enjoy wandering the marketplace as had been her whim, she was now confined to its gates. Directing all who entered, night and day.

After some time, a young man took pity on the Scribe, pushing his way to the front of the queue. “Tell me where all of the spice merchants are to be found in the market, and then I shall share this with others!”

But no sooner had he said this than the djinn appeared in a puff of smoke: “NO! I forbid it!”. With a wave of its arm the Scribe was struck dumb until the young man departed. With a smirk the djinn disappeared.

Several days passed and a group of people arrived at the head of queue of petitioners.

“We too are scribes.” they said. “We come from a neighbouring town having heard of your plight. Our plan is to copy out your Book so that we might share your burden and help these people”.

But whilst a spark of hope was still flaring in the heart of the scribe, the djinn appeared once again. “NO! I forbid this too! Begone!” And with scream and a flash of light the scribes vanished. Looking smug the djinn disappeared.

Some time passes before a troupe of performers approach the Scribe. As a chorus they cried: “Look yonder at our stage, and the many people gathered before it. By taking turns from reading from the book, in front of wide audience, we can easily share your burden”.

But shaking her head the Scribe could only turn away whilst the djinn visited ruin upon the troupe. “No more” she whispered sadly.

And so, for many years the Scribe remained as she had been, imprisoned within the subtle trap of the djinn of the lamp. Until, one day a traveller appeared in the market. Upon reaching the head of the endless line of penitents, the man asked of the Scribe:

“Where should you go to rid your self of the evil djinn?”.

Surprised, and with sudden hope, the Scribe turned the pages of her Book…

Open data and diabetes

In December my daughter was diagnosed with Type 1 diabetes. It was a pretty rough time. Symptoms can start and escalate very quickly. Hyperglycaemia and ketoacidosis are no joke.

But luckily we have one of the best health services in the world. We’ve had amazing care, help and support. And, while we’re only 4 months into dealing with a life-long condition, we’re all doing well.

Diabetes sucks though.

I’m writing this post to reflect a little on the journey we’ve been on over the last few months from a professional rather than a personal perspective. Basically, the first weeks of becoming a diabetic or the parent of a diabetic, is a crash course in physiology, nutrition, and medical monitoring. You have to adapt to new routines for blood glucose monitoring, learn to give injections (and teach your child to do them), become good at book-keeping, plan for exercise, and remember to keep needles, lancets, monitors, emergency glucose and insulin with you at all times, whilst ensuring prescriptions are regularly filled.

Oh, and there’s a stupid amount of maths because you’ll need to start calculating how much carbohydrates are in all of your meals and inject accordingly. No meal unless you do your sums.

Good job we had that really great health service to support us (there’s data to prove it). And an amazing daughter who has taken it all in her stride.

Diabetics live a quantified life. Tightly regulating blood glucose levels means knowing exactly what you’re eating, and learning how your body reacts to different foods and levels of exercise. For example we’ve learnt the different ways that a regular school day versus school holidays effects my daughters metabolism. That we need to treat ahead for the hypoglycaemia that follows a few hours after some fun on the trampoline. And that certain foods (cereals, risotto) seem to affect insulin uptake.

So to manage the condition we need to know how many carbohydrates are in:

  • any pre-packaged food my daughter eats
  • any ingredients we use when cooking, so we can calculate a total portion size
  • in any snack or meal that we eat out

Food labeling is pretty good these days so the basic information is generally available. But its not always available on menus or in an easy to use format.

The book and app that diabetic teams recommend is called Carbs and Cals. I was a little horrified by it initially as its just a big picture book of different portion sizes of food. You’re encouraged to judge everything by eye or weight. It seemed imprecise to me but with hindsight its perfectly suited to those early stages of learning to live with diabetes. No hunting over packets to get the data you need: just look at a picture, a useful visualisation. Simple is best when you’re overwhelmed with so many other things.

Having tried calorie counting I wanted to try an app to more easily track foods and calculate recipes. My Fitness Pal, for example, is pretty easy to use and does bar-code scanning of many foods. There are others that are more directly targeted at diabetics.

The problem is that, as I’ve learnt from my calorie counting experiments, the data isn’t always accurate. Many apps fill their databases through crowd-sourcing. But recipes and portion sizes change continually. And people make mistakes when they enter data, or enter just the bits they’re interested in. Look-up any food on My Fitness Pal and you’ll find many duplicate entries. It makes me distrust the data because I’m concerned its not reliable. So for now we’re still reading packets.

Eating out is another adventure. There have been recent legislative changes to require restaurants to make more nutritional information available. If you search you may find information on a company website and can plan ahead. Sometimes its only available if you contact customer support. If you ask in a (chain) restaurant they may have it available in a ring-binder you can consult with the menu. This doesn’t make a great experience for anyone. Recently we’ve been told in a restaurant to just check online for the data (when we know it doesn’t exist), because they didn’t want to risk any liability by providing information directly. On another occasion we found that certain dishes – items from the childrens menu – weren’t included on the nutritional charts.

Basically, the information we want is:

  • often not available at all
  • available, but only if you know were to look or who to ask
  • potentially out of date, as it comes from non-authoritative sources
  • incomplete or inaccurate, even from the authoritative sources
  • not regularly updated
  • not in easy to use formats
  • available electronically, e.g. in an app, but without any clear provenance

The reality is that this type of nutritional and ingredient data is basically in the same state as government data was 6-7 years ago. It’s something that really needs to change.

Legislation can help encourage supermarkets and restaurants to make data available, but really its time for them to recognize that this is essential information for many people. All supermarkets, manufacturers and major chains will have this data already, there should be little effort required in making it public.

I’ve wondered whether this type of data ought to be considered as part of the UK National Information Infrastructure. It could be collected as part of the remit of the Food Standards Agency. Having a national source would help remove ambiguity around how data has been aggregated.

Whether you’re calorie or carb counting, open data can make an important difference. Its about giving people the information they need to live healthy lives.

Creating an Application Using the British National Bibliography

This is the fourth and final post in a series (1, 2, 3, 4) of providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

The British National Bibliography (BNB) is a bibliographic database that contains data on a wide range of books and serial publications published in the UK and Ireland since the 1950s. The database is available under a public domain license and can be accessed via an online API which supports the SPARQL query language.

This tutorial provides an example of building a simple web application using the BNB SPARQL endpoint using Ruby and various open source libraries. The tutorial includes:

  • a description of the application and its intended behaviour
  • a summary of the various open source components used to build the application
  • a description of how SPARQL is used to implement the application functionality

The example is written in a mixture of Ruby and Javascript. The code is well documented to support readers more familiar with other languages.

The “Find Me A Book!” Application

The Find Me a Book! demonstration application illustrates how to use the data in the BNB to recommend books to readers. The following design brief describes the intended behaviour.

The application will allow a user to provide an ISBN which is used to query the BNB in order find other books that the user might potentially want to read. The application will also confirm the book title to the user to ensure that it has found the right information.

Book recommendations will be made in two ways:

  1. More By The Author: will provide a list of 10 other books by the same author(s)
  2. More From Reading Lists: will attempt to suggest 10 books based on series or categories in the BNB data

The first use case is quite straight-forward and should generate some “safe” recommendations: it’s likely that the user will like other works by the author.

The second attempts will use the BNB data a little more creatively and so the suggestions are likely to be a little more varied.

Related books will be found by looking to see if the user’s book is in a series. If it is then the application will recommend other books from that series. If the book is not included in any series, then recommendations will be driven off the standard subject classifications. The idea is that series present ready made reading lists that are a good source of suggestions. By falling back to a broader categorisation, the user should always be presented with some recommendations.

To explore the recommended books further, the user will be provided with links to LibraryThing.com.

The Application Code

The full source code of the application is available on github.com. The code has been placed into the Public Domain so can be freely reused or extended.

The application is written in Ruby and should run on Ruby 1.8.7 or higher. Several open source frameworks were used to build the application:

  • Sinatra — a light-weight Ruby web application framework
  • SPARQL Client — an client library for accessing SPARQL endpoints from Ruby
  • The JQuery javascript library for performing AJAX requests and HTML manipulation
  • The Boostrap CSS framework is used to build the basic page layout

The application code is very straight-forward and can be separated into server-side and client-side components.

Server Side

The server side implementation can be found in app.rb. The Ruby application delivers the application assets (CSS, images, etc) and also exposes several web services that act as proxies for the BNB dataset. These services submit SPARQL queries to the BNB SPARQL endpoint and then process the results to generate a custom JSON output.

The three services, which each accept an isbn parameter are:

Each of the services works in essentially the same way:

  • The isbn parameter is extracted from the request. If the parameter is not found then an error is returned to the client. The ISBN value is also normalised to remove any spaces or dashes
  • A SPARQL client object is created to provide a way to interact with the SPARQL endpoint
  • The ISBN parameter is injected into the SPARQL query that will be run against the BNB, using the add_parameters function
  • The final query is then submitted to the SPARQL endpoint and the results used to build the JSON response

The /related service may actually makes two calls to the endpoint. If the first query doesn’t return any results then a fallback query is used instead.


The client side Javascript code can all be found in find-me-a-book.js. It uses the JQuery library to trigger custom code to be executed when the user submits the search form with an ISBN.

The findTitle function calls the /title service to attempt to resolve the ISBN into the title of a book. This checks that the ISBN is in the BNB and provides useful feedback for the user.

If this initial call succeeds then the find function is called twice to submit parallel AJAX requests. One to the /by-author service, and one to the /related service. The function accepts two parameters, the first parameter identifies the service to call, the second provides a name that is used to guide the processing of the results.

The HTML markup uses a naming convention to allow the find function to write the results of the request into the correct parts of the page, depending on its second parameter.

The ISBN and title information found in the results from the AJAX requests are used to build links to the LibraryThing website. But these could also be processed in other ways, e.g. to provide multiple links or invoke other APIs.

Installing and Running the Application

A live instance of the application has been deployed to allow the code to be tested without having to install and run it locally. The application can be found at:


For readers interested in customising the application code, this section provides instructions on how to access the source code and run the application.

The instructions have been tested on Ubuntu. Follow the relevant documentation links for help with installation of the various dependencies on other systems.

Source Code

The application source code is available on Github and is organised into several directories:

  • public — static files including CSS, Javascript and Images. The main client-side Javascript code can be found in find-me-a-book.js
  • views — the templates used in the application
  • src — the application source code, which is contained in app.rb

The additional files in the project directory provide support for deploying the application and installing the dependencies.

Running the Application

To run the application locally, ensure that Ruby, RubyGems and
git are installed on the local machine.

To download all of the source code and asserts, clone the git repository:

git clone https://github.com/ldodds/bnb-example-app.git

This will create a bnb-example-app directory. To simplify the installation of further dependencies, the project uses the Bundler dependency management tool. This must be installed first:

sudo gem install bundle

Bundler can then be run to install the additional Ruby Gems required by the project:

cd bnb-example-app
sudo bundle install

Once complete the application can be run as follows:


The rackup application will then start the application as defined in config.ru. By default the application will launch on port 9292 and should be accessible from:



This tutorial has introduced a simple demonstration application that illustrates one way of interacting with the BNB SPARQL endpoint. The application uses SPARQL queries to build a very simple book recommendation tool. The logic used to build the recommendations is deliberately simple to help illustrate the basic principles of working with the dataset and the API.

The source code for the application is available under a public domain license so can be customised or reused as necessary. A live instance provides a way to test the application against the real data.

Accessing the British National Bibliography Using SPARQL

This is the third in a series of posts (1, 2, 3, 4) providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

Note: while I’ve attempted to fix up these instructions to account with changes to the platform on which the data is published, there may still be some errors. If there are then please leave a comment or drop me an email and I’ll endeavour to fix.

The British National Bibliography (BNB) is a bibliographic database that contains data on a wide range of books and serial publications published in the UK and Ireland since the 1950s. The database is available under a public domain license and can be accessed via an online API.

The tutorial introduces developers to the BNB API which supports querying of the dataset via the SPARQL query language and protocol. The tutorial provides:

  • Pointers to relevant background material and tutorials on SPARQL and the SPARQL Protocol
  • A collection of useful queries and query patterns for working with the BNB dataset

The queries described in this tutorial have been published as a collection of files that can be download from github.

What is SPARQL?

SPARQL is a W3C standard which defines a query language for RDF databases. Roughly speaking SPARQL is the equivalent of SQL for graph databases. SPARQL 1.0 was first published as an official W3C Recommendation in 2008. At the time of writing SPARQL 1.1, which provides a number of new language features, will shortly be published as a final recommendation.

A SPARQL endpoint implements the SPARQL protocol allowing queries to be submitted over the web. Public SPARQL endpoints offer an API that allows application developers to query and extract data from web or mobile applications.

A complete SPARQL tutorial is outside the scope of this document, but there are a number of excellent resources available for developers wishing to learn more about the query language. Some recommended tutorials and reference guides include:

The BNB SPARQL Endpoint

The BNB public SPARQL endpoint is available from:


No authentication or API keys are required to use this API.

The BNB endpoint supports SPARQL 1.0 only. Queries can be submitted to the endpoint using either GET or POST requests. For POST requests the query is submitted as the body of the request, while for GET requests the query is URL encoded and provided in the query parameter, e.g:


Refer to the SPARQL protocol specification for additional background on submitting queries. Client libraries for interacting with SPARQL endpoints are available in a variety of languages, including python, ruby, nodejs, PHP and Java.

Types of SPARQL Query and Result Formats

There are four different types of SPARQL query. Each of the different types supports a different use case:

  • ASK: returns a true or false response to test whether data is present in a dataset, e.g. to perform assertions or check for interesting data before submitting queries. Note these no longer seem to be supported by the BL SPARQL endpoint. All ASK queries now return an error.
  • SELECT: like the SQL SELECT statement this type of query returns a simple tabular result set. Useful for extracting values for processing in non-RDF systems
  • DESCRIBE: requests that the SPARQL endpoint provides a default description of the queried results in the form of an RDF graph
  • CONSTRUCT: builds a custom RDF graph based on data in the dataset

Query results can typically be serialized into multiple formats. ASK and SELECT queries have standard XML and JSON result formats. The graphs produced by DESCRIBE and CONSTRUCT queries can be serialized into any RDF format including Turtle and RDF/XML. The BNB endpoint also supports RDF/JSON output from these types of query. Alternate formats can be selected using the output URL parameter, e.g. output=json:


General Patterns

The following sections provide a number of useful query patterns that illustrate some basic ways to query the BNB.

Discovering URIs

One very common use case when working with a SPARQL endpoint is the need to discover the URI for a resource. For example, the ISBN number for a book or an ISSN number of a serial is likely to be found in a wide variety of databases. It would be useful to be able to use those identifiers to look up the corresponding resource in the BNB.

Here’s a simple SELECT query that looks up a book based on its ISBN-10:

#Declare a prefix for the bibo schema
PREFIX bibo: <http://purl.org/ontology/bibo/>
  #Match any resource that has the specific property and value
  ?uri bibo:isbn10 "0261102214".

As can be seen from executing this query there are actually 4 different editions of The
that have been published using this ISBN.

Here is a variation of the same query that identifies the resource with an ISSN of 1356-0069:

PREFIX bibo: <http://purl.org/ontology/bibo/>
  ?uri bibo:issn "1356-0069".

The basic query pattern is the same in each case. Resources are matched based on the value of a literal property. To find different resources just substitute in a different value or match on a different property. The results can be used in further queries or used to access the BNB Linked Data by performing a GET request on the URI.

In some cases it may just be useful to know whether there is a resource that has a matching identifier in the dataset. An ASK query supports this use case. The following query should return true as there is a resource in the BNB with the given ISSN:

PREFIX bibo: <http://purl.org/ontology/bibo/>
  ?uri bibo:issn "1356-0069".

Note ASK queries no longer seem to be supported by the BL SPARQL endpoint. All ASK queries now return an error

Extracting Data Using Identifiers

Rather than just request a URI or list of URIs it would be useful to extract some additional attributes of the resources. This is easily done by extending the query pattern to include more properties.

The following example extracts the URI, title and BNB number for all books with a given ISBN:

#Declare some additional prefixes
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX blterms: <http://www.bl.uk/schemas/bibliographic/blterms#>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?uri ?bnb ?title WHERE {
  #Match the books by ISBN
  ?uri bibo:isbn10 "0261102214";
       #bind some variables to their other attributes
       blterms:bnb ?bnb;
       dct:title ?title.

This patterns extends the previous examples in several ways. Firstly, some additional prefixes are declared because the properties of interest are from several different schemas. Secondly, the query pattern is extended to match the additional attributes of the resources. The values of those attributes are bound to variables. Finally the SELECT clause is extended to list all the variables that should be returned.

If the URI for is already known then this can be used to directly identify the resource of interest. Its properties can then be matched and extracted. The following query returns the ISBN, title and BNB number for a specific book:

PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX blterms: <http://www.bl.uk/schemas/bibliographic/blterms#>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?isbn ?title ?bnb WHERE {
  <http://bnb.data.bl.uk/id/resource/009910399> bibo:isbn10 ?isbn;
       blterms:bnb ?bnb;
       dct:title ?title.         

Whereas the former query identified resources indirectly, via the value of an attribute, this query directly references a resource using its URI. The query pattern then matches the properties that are of interest. Matching resources by URI is usually much faster than matching based on a literal property.

Itemising all of the properties of a resource can be tiresome. Using SPARQL it is possible to ask the SPARQL endpoint to generate a useful summary of a resource (called a Bounded Description. The endpoint will typically return all attributes and relationships of the resource. This can be achieved using a simple DESCRIBE query:

DESCRIBE <http://bnb.data.bl.uk/id/resource/009910399>

The query doesn’t need to define any prefixes or match any properties: the endpoint will simply return what it knows about a resource as RDF. If RDF/XML isn’t useful then the same results can be retrieved as JSON.

Reverting back to the previous approach of indirectly identifying resources, its possible to ask the endpoint to generate descriptions of all books with a given ISBN:

PREFIX bibo: <http://purl.org/ontology/bibo/>
  ?uri bibo:isbn10 "0261102214".

Matching By Relationship

Resources can also be matched based on their relationships, by traversing across the graph of data. For example it’s possible to lookup the author for a given book:

PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?author WHERE {
  #Match the book
  ?uri bibo:isbn10 "0261102214";
       #Match its author
       dct:creator ?author.

As there are four books with this ISBN the query results return the URI for Tolkien four times. Adding a DISTINCT will remove any duplicates:

PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>

  #Match the book
  ?uri bibo:isbn10 "0261102214";
       #Match its author
       dct:creator ?author.

Type Specific Patterns

The following sections provide some additional example queries that illustrate some useful queries for working with some specific types of resource in the BNB dataset. Each query is accompanied by links to the SPARQL endpoint that show the results.

For clarity the PREFIX declarations in each query have been ommited. It should be assumed that each query is preceded with the following prefix declarations:

PREFIX bio: <http://purl.org/vocab/bio/0.1/&gt;
PREFIX bibo: <http://purl.org/ontology/bibo/&gt;
PREFIX blterms: <http://www.bl.uk/schemas/bibliographic/blterms#&gt;
PREFIX dct: <http://purl.org/dc/terms/&gt;
PREFIX event: <http://purl.org/NET/c4dm/event.owl#&gt;
PREFIX foaf: <http://xmlns.com/foaf/0.1/&gt;
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX isbd: <http://iflastandards.info/ns/isbd/elements/&gt;
PREFIX org: <http://www.w3.org/ns/org#&gt;
PREFIX owl: <http://www.w3.org/2002/07/owl#&gt;
PREFIX rda: <http://RDVocab.info/ElementsGr2/&gt;
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
PREFIX skos: <http://www.w3.org/2004/02/skos/core#&gt;
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#&gt;

Not all of these are required for all of the queries, but they declare all of the prefixes that are likely to be useful when querying the BNB.


There are a number of interesting queries that can be used to interact with author data in the BNB.

List Books By An Author

The following query lists all published books written by C. S. Lewis, with the most recently published books returned first:

SELECT ?book ?isbn ?title ?year WHERE {
  #Match all books with Lewis as an author
  ?book dct:creator <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        #match the publication event
        blterms:publication ?publication.

  #match the time of the publication event
  ?publication event:time ?time.
  #match the label of the year
  ?time rdfs:label ?year          
#order by descending year, after casting year as an integer
ORDER BY DESC( xsd:int(?year) )

Identifying Genre of an Author

Books in the BNB are associated with one or more subject categories. By looking up the list of categories associated with an author’s works it may be possible to get a sense of what type of books they have written. Here is a query that returns the list of categories associated with C. S Lewis’s works:

SELECT DISTINCT ?category ?label WHERE {
  #Match all books with Lewis as an author
  ?book dct:creator <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>;
     dct:subject ?category.

  ?category rdfs:label ?label.     
ORDER BY ?label

Relationships Between Contributors

The following query extracts a list of all people who have contributed to one or more C. S. Lewis books:

  ?book dct:creator <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>;
     dct:contributor ?author.

  ?author foaf:name ?name.     

  FILTER (?author != <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>) 
ORDER BY ?name

Going one step further its possible to identify people that serve as connections between different authors. For example this query finds people that have contributed to books by both C. S. Lewis and J. R. R. Tolkien:

  ?book dct:creator <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>;
     dct:contributor ?author.

  ?otherBook dct:creator <http://bnb.data.bl.uk/id/person/TolkienJRR%28JohnRonaldReuel%291892-1973>;
     dct:contributor ?author.

  ?author foaf:name ?name.     
ORDER BY ?name

Authors Born in a Year

The basic biographical information in the BNB can also be used in queries. For example many authors have a recorded year of birth some a year of death. These are described as Birth or Death Events in the data. The following query illustrates how to find 50 authors born in 1944:

SELECT ?author ?name WHERE {
   ?event a bio:Birth;
      bio:date "1944"^^<http://www.w3.org/2001/XMLSchema#gYear>.

   ?author bio:event ?event;
      foaf:name ?name.

The years associated with Birth and Death events have am XML Schema datatype associated with them (xsd:Year). It is important to specific this type in the query, otherwise the query will fail to match any data.


There are a large number of published works in the BNB, extracting useful subsets involves identifying some useful dimensions in the data that can be used to filter the results. In addition to finding books by an author there are some other useful facets that relate to books, including:

  • Year of Publication
  • Location of Publication
  • Publisher

The following sections include queries that extract data along these dimensions. In each case the key step is to match the Publication Event associated with the book.

Books Published in a Year

Publication Events have a “time” relationship that refers to a resource for the year of publication. The following query extracts 50 books published in 2010:

SELECT ?book ?isbn ?title ?year WHERE {
  ?book dct:creator ?author;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        #match the publication event
        blterms:publication ?publication.

  #match the time of the publication event
  ?publication event:time ?time.
  #match the label of the year
  ?time rdfs:label "2010"      

Books Published in a Location

Finding books based on their place of publication is a variation of the above query. Rather than matching the time relationship, the query instead looks for the location associated with the publication event. This query finds 50 books published in Bath:

SELECT ?book ?isbn ?title ?year WHERE {
  ?book dct:creator ?author;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        blterms:publication ?publication.

  ?publication event:place ?place.
  ?place rdfs:label "Bath"

Books From a Publisher

In addition to the time and place relationships, Publication Events are also related to a publisher via an “agent” relationship. The following query uses a combination of the time and agent relationships to find 50 books published by Allen & Unwin in 2011:

SELECT ?book ?isbn ?title ?year WHERE {
  ?book dct:creator ?author;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        blterms:publication ?publication.

  ?publication event:agent ?agent;
       event:time ?time.

  ?agent rdfs:label "Allen & Unwin".
  ?time rdfs:label "2011".

These queries can easily be adapted to extend and combine the query patterns further, e.g. to limit results by a combination of place, time and publisher, or along different dimensions such as subject category.


The BNB includes nearly 20,000 book series. The following queries illustrate some useful ways to interact with that data.

Books in a Series

Finding the books associated with a specific series is relatively straight-forward. The following query is very similar to an earlier query to find books based on an author. However in this case the list of books to be returned is identified by matching those that have an “has part” relationship with a series. The query finds books that are part of the “Pleasure In Reading” series:

SELECT ?book ?isbn ?title ?year WHERE {
  <http://bnb.data.bl.uk/id/series/Pleasureinreading> dct:hasPart ?book.

  ?book dct:creator ?author;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        blterms:publication ?publication.

  ?publication event:agent ?agent;
       event:time ?time.

  ?time rdfs:label ?year.

Categories for a Series

The BNB only includes minimal metadata about each series: just a name and a list of books. In order to get a little more insight into the type of book included in a series, the following query finds a list of the subject categories associated with a series:

  <http://bnb.data.bl.uk/id/series/Pleasureinreading> dct:hasPart ?book.

  ?book dct:subject ?subject.

  ?subject rdfs:label ?label.

As with the previous query the “Pleasure in Reading” series is identified by its URI. As books in the series might share a category the query uses the DISTINCT keyword to filter the results.

Series Recommendation

A series could be considered as a reading list containing useful suggestions of books on particular topics. One way to find a reading list might be to find lists based on subject category, using a variation of the previous query.

Another approach would be to find lists that already contain works by a favourite author. For example the following query finds the URI and the label of all series that contain books by J. R. R. Tolkien:

SELECT DISTINCT ?series ?label WHERE {
  ?book dct:creator ?author.
  ?author foaf:name "J. R. R. Tolkien".

  ?series dct:hasPart ?book;
     rdfs:label ?label.



The rich subject categories in the BNB data provide a number of useful ways to slice and dice the data. For example it is often useful to just fetch a list of books based on their category. The following query finds a list of American Detective and Mystery books:

SELECT ?book ?title ?name WHERE {

   ?book dct:title ?title;
         dct:creator ?author;
         dct:subject <http://bnb.data.bl.uk/id/concept/lcsh/DetectiveandmysterystoriesAmericanFiction>.

  ?author foaf:name ?name.
ORDER BY ?name ?title

For common or broad categories these lists can become very large so filtering them down further into more manageable chunks may be necessary.


Many of the periodicals and newspapers published in the UK have a local or regional focus. This geographical relationship is recorded in the BNB via a “spatial” relationship of the serial resource. This relationship supports finding publications that are relevant to a particular location in the United Kingdom.

The following query finds serials that focus on the City of Bath:

SELECT ?title ?issn WHERE {

   ?serial dct:title ?title;
           bibo:issn ?issn;
           dct:spatial ?place.

   ?place rdfs:label "Bath (England)".

The exact name of the location is used in the match. While it would be possible to filter the results based on a regular expression, this can be very slow. The following query shows how to extract a list of locations referenced from the Dublin Core spatial relationship. This list could be used to populate a search form or application navigation to enable efficient filtering by place name:


   ?serial dct:spatial ?place.
   ?place rdfs:label ?label.


This tutorial has provided an introduction to using SPARQL to extract data from the BNB dataset. When working with a SPARQL endpoint it is often useful to have example queries that can be customised to support particular use cases. The tutorial has included multiple examples and these are all available to download.

The tutorial has covered some useful general approaches for matching resources based on identifiers and relationships. Looking up URIs in a dataset is an important step in mapping from systems that contain non-URI identifiers, e.g. ISSNs or ISBNs. Once a URI has been discovered it can be used to directly access the BNB Linked Data or used as a parameter to drive further queries.

A number of example queries have also been included showing how to ask useful and interesting questions of the dataset. These queries relate to the main types of resources in the BNB and illustrate how to slice and dice the dataset along a number of different dimensions.

While the majority of the sample queries are simple SELECT queries, it is possible to create variants that use CONSTRUCT or DESCRIBE queries to extract data in other ways. Several good SPARQL tutorials have been referenced to provide further background reading for developers interested in digging into this further.