Category Archives: Semantic Web

Creating an Application Using the British National Bibliography

This is the fourth and final post in a series (1, 2, 3, 4) of providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

The British National Bibliography (BNB) is a bibliographic database that contains data on a wide range of books and serial publications published in the UK and Ireland since the 1950s. The database is available under a public domain license and can be accessed via an online API which supports the SPARQL query language.

This tutorial provides an example of building a simple web application using the BNB SPARQL endpoint using Ruby and various open source libraries. The tutorial includes:

  • a description of the application and its intended behaviour
  • a summary of the various open source components used to build the application
  • a description of how SPARQL is used to implement the application functionality

The example is written in a mixture of Ruby and Javascript. The code is well documented to support readers more familiar with other languages.

The “Find Me A Book!” Application

The Find Me a Book! demonstration application illustrates how to use the data in the BNB to recommend books to readers. The following design brief describes the intended behaviour.

The application will allow a user to provide an ISBN which is used to query the BNB in order find other books that the user might potentially want to read. The application will also confirm the book title to the user to ensure that it has found the right information.

Book recommendations will be made in two ways:

  1. More By The Author: will provide a list of 10 other books by the same author(s)
  2. More From Reading Lists: will attempt to suggest 10 books based on series or categories in the BNB data

The first use case is quite straight-forward and should generate some “safe” recommendations: it’s likely that the user will like other works by the author.

The second attempts will use the BNB data a little more creatively and so the suggestions are likely to be a little more varied.

Related books will be found by looking to see if the user’s book is in a series. If it is then the application will recommend other books from that series. If the book is not included in any series, then recommendations will be driven off the standard subject classifications. The idea is that series present ready made reading lists that are a good source of suggestions. By falling back to a broader categorisation, the user should always be presented with some recommendations.

To explore the recommended books further, the user will be provided with links to LibraryThing.com.

The Application Code

The full source code of the application is available on github.com. The code has been placed into the Public Domain so can be freely reused or extended.

The application is written in Ruby and should run on Ruby 1.8.7 or higher. Several open source frameworks were used to build the application:

  • Sinatra — a light-weight Ruby web application framework
  • SPARQL Client — an client library for accessing SPARQL endpoints from Ruby
  • The JQuery javascript library for performing AJAX requests and HTML manipulation
  • The Boostrap CSS framework is used to build the basic page layout

The application code is very straight-forward and can be separated into server-side and client-side components.

Server Side

The server side implementation can be found in app.rb. The Ruby application delivers the application assets (CSS, images, etc) and also exposes several web services that act as proxies for the BNB dataset. These services submit SPARQL queries to the BNB SPARQL endpoint and then process the results to generate a custom JSON output.

The three services, which each accept an isbn parameter are:

Each of the services works in essentially the same way:

  • The isbn parameter is extracted from the request. If the parameter is not found then an error is returned to the client. The ISBN value is also normalised to remove any spaces or dashes
  • A SPARQL client object is created to provide a way to interact with the SPARQL endpoint
  • The ISBN parameter is injected into the SPARQL query that will be run against the BNB, using the add_parameters function
  • The final query is then submitted to the SPARQL endpoint and the results used to build the JSON response

The /related service may actually makes two calls to the endpoint. If the first query doesn’t return any results then a fallback query is used instead.

Client-Side

The client side Javascript code can all be found in find-me-a-book.js. It uses the JQuery library to trigger custom code to be executed when the user submits the search form with an ISBN.

The findTitle function calls the /title service to attempt to resolve the ISBN into the title of a book. This checks that the ISBN is in the BNB and provides useful feedback for the user.

If this initial call succeeds then the find function is called twice to submit parallel AJAX requests. One to the /by-author service, and one to the /related service. The function accepts two parameters, the first parameter identifies the service to call, the second provides a name that is used to guide the processing of the results.

The HTML markup uses a naming convention to allow the find function to write the results of the request into the correct parts of the page, depending on its second parameter.

The ISBN and title information found in the results from the AJAX requests are used to build links to the LibraryThing website. But these could also be processed in other ways, e.g. to provide multiple links or invoke other APIs.

Installing and Running the Application

A live instance of the application has been deployed to allow the code to be tested without having to install and run it locally. The application can be found at:

http://findmeabook.herokuapp.com/

For readers interested in customising the application code, this section provides instructions on how to access the source code and run the application.

The instructions have been tested on Ubuntu. Follow the relevant documentation links for help with installation of the various dependencies on other systems.

Source Code

The application source code is available on Github and is organised into several directories:

  • public — static files including CSS, Javascript and Images. The main client-side Javascript code can be found in find-me-a-book.js
  • views — the templates used in the application
  • src — the application source code, which is contained in app.rb

The additional files in the project directory provide support for deploying the application and installing the dependencies.

Running the Application

To run the application locally, ensure that Ruby, RubyGems and
git are installed on the local machine.

To download all of the source code and asserts, clone the git repository:

git clone https://github.com/ldodds/bnb-example-app.git

This will create a bnb-example-app directory. To simplify the installation of further dependencies, the project uses the Bundler dependency management tool. This must be installed first:

sudo gem install bundle

Bundler can then be run to install the additional Ruby Gems required by the project:

cd bnb-example-app
sudo bundle install

Once complete the application can be run as follows:

rackup

The rackup application will then start the application as defined in config.ru. By default the application will launch on port 9292 and should be accessible from:

http://localhost:9292

Summary

This tutorial has introduced a simple demonstration application that illustrates one way of interacting with the BNB SPARQL endpoint. The application uses SPARQL queries to build a very simple book recommendation tool. The logic used to build the recommendations is deliberately simple to help illustrate the basic principles of working with the dataset and the API.

The source code for the application is available under a public domain license so can be customised or reused as necessary. A live instance provides a way to test the application against the real data.

Tagged

Accessing the British National Bibliography Using SPARQL

This is the third in a series of posts (1, 2, 3, 4) providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

Note: while I’ve attempted to fix up these instructions to account with changes to the platform on which the data is published, there may still be some errors. If there are then please leave a comment or drop me an email and I’ll endeavour to fix.

The British National Bibliography (BNB) is a bibliographic database that contains data on a wide range of books and serial publications published in the UK and Ireland since the 1950s. The database is available under a public domain license and can be accessed via an online API.

The tutorial introduces developers to the BNB API which supports querying of the dataset via the SPARQL query language and protocol. The tutorial provides:

  • Pointers to relevant background material and tutorials on SPARQL and the SPARQL Protocol
  • A collection of useful queries and query patterns for working with the BNB dataset

The queries described in this tutorial have been published as a collection of files that can be download from github.

What is SPARQL?

SPARQL is a W3C standard which defines a query language for RDF databases. Roughly speaking SPARQL is the equivalent of SQL for graph databases. SPARQL 1.0 was first published as an official W3C Recommendation in 2008. At the time of writing SPARQL 1.1, which provides a number of new language features, will shortly be published as a final recommendation.

A SPARQL endpoint implements the SPARQL protocol allowing queries to be submitted over the web. Public SPARQL endpoints offer an API that allows application developers to query and extract data from web or mobile applications.

A complete SPARQL tutorial is outside the scope of this document, but there are a number of excellent resources available for developers wishing to learn more about the query language. Some recommended tutorials and reference guides include:

The BNB SPARQL Endpoint

The BNB public SPARQL endpoint is available from:

http://bnb.data.bl.uk/sparql

No authentication or API keys are required to use this API.

The BNB endpoint supports SPARQL 1.0 only. Queries can be submitted to the endpoint using either GET or POST requests. For POST requests the query is submitted as the body of the request, while for GET requests the query is URL encoded and provided in the query parameter, e.g:

http://bnb.data.bl.uk/sparql?query=SELECT+%3Fs+%3Fp+%3Fo+WHERE+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+1

Refer to the SPARQL protocol specification for additional background on submitting queries. Client libraries for interacting with SPARQL endpoints are available in a variety of languages, including python, ruby, nodejs, PHP and Java.

Types of SPARQL Query and Result Formats

There are four different types of SPARQL query. Each of the different types supports a different use case:

  • ASK: returns a true or false response to test whether data is present in a dataset, e.g. to perform assertions or check for interesting data before submitting queries. Note these no longer seem to be supported by the BL SPARQL endpoint. All ASK queries now return an error.
  • SELECT: like the SQL SELECT statement this type of query returns a simple tabular result set. Useful for extracting values for processing in non-RDF systems
  • DESCRIBE: requests that the SPARQL endpoint provides a default description of the queried results in the form of an RDF graph
  • CONSTRUCT: builds a custom RDF graph based on data in the dataset

Query results can typically be serialized into multiple formats. ASK and SELECT queries have standard XML and JSON result formats. The graphs produced by DESCRIBE and CONSTRUCT queries can be serialized into any RDF format including Turtle and RDF/XML. The BNB endpoint also supports RDF/JSON output from these types of query. Alternate formats can be selected using the output URL parameter, e.g. output=json:

http://bnb.data.bl.uk/sparql?query=SELECT+%3Fs+%3Fp+%3Fo+WHERE+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+1&output=json

General Patterns

The following sections provide a number of useful query patterns that illustrate some basic ways to query the BNB.

Discovering URIs

One very common use case when working with a SPARQL endpoint is the need to discover the URI for a resource. For example, the ISBN number for a book or an ISSN number of a serial is likely to be found in a wide variety of databases. It would be useful to be able to use those identifiers to look up the corresponding resource in the BNB.

Here’s a simple SELECT query that looks up a book based on its ISBN-10:


#Declare a prefix for the bibo schema
PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?uri WHERE {
  #Match any resource that has the specific property and value
  ?uri bibo:isbn10 "0261102214".
}

As can be seen from executing this query there are actually 4 different editions of The
that have been published using this ISBN.

Here is a variation of the same query that identifies the resource with an ISSN of 1356-0069:


PREFIX bibo: <http://purl.org/ontology/bibo/>
SELECT ?uri WHERE {
  ?uri bibo:issn "1356-0069".
}

The basic query pattern is the same in each case. Resources are matched based on the value of a literal property. To find different resources just substitute in a different value or match on a different property. The results can be used in further queries or used to access the BNB Linked Data by performing a GET request on the URI.

In some cases it may just be useful to know whether there is a resource that has a matching identifier in the dataset. An ASK query supports this use case. The following query should return true as there is a resource in the BNB with the given ISSN:


PREFIX bibo: <http://purl.org/ontology/bibo/>
ASK WHERE {
  ?uri bibo:issn "1356-0069".
}

Note ASK queries no longer seem to be supported by the BL SPARQL endpoint. All ASK queries now return an error

Extracting Data Using Identifiers

Rather than just request a URI or list of URIs it would be useful to extract some additional attributes of the resources. This is easily done by extending the query pattern to include more properties.

The following example extracts the URI, title and BNB number for all books with a given ISBN:


#Declare some additional prefixes
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX blterms: <http://www.bl.uk/schemas/bibliographic/blterms#>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?uri ?bnb ?title WHERE {
  #Match the books by ISBN
  ?uri bibo:isbn10 "0261102214";
       #bind some variables to their other attributes
       blterms:bnb ?bnb;
       dct:title ?title.
}

This patterns extends the previous examples in several ways. Firstly, some additional prefixes are declared because the properties of interest are from several different schemas. Secondly, the query pattern is extended to match the additional attributes of the resources. The values of those attributes are bound to variables. Finally the SELECT clause is extended to list all the variables that should be returned.

If the URI for is already known then this can be used to directly identify the resource of interest. Its properties can then be matched and extracted. The following query returns the ISBN, title and BNB number for a specific book:


PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX blterms: <http://www.bl.uk/schemas/bibliographic/blterms#>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?isbn ?title ?bnb WHERE {
  <http://bnb.data.bl.uk/id/resource/009910399> bibo:isbn10 ?isbn;
       blterms:bnb ?bnb;
       dct:title ?title.         
}

Whereas the former query identified resources indirectly, via the value of an attribute, this query directly references a resource using its URI. The query pattern then matches the properties that are of interest. Matching resources by URI is usually much faster than matching based on a literal property.

Itemising all of the properties of a resource can be tiresome. Using SPARQL it is possible to ask the SPARQL endpoint to generate a useful summary of a resource (called a Bounded Description. The endpoint will typically return all attributes and relationships of the resource. This can be achieved using a simple DESCRIBE query:


DESCRIBE <http://bnb.data.bl.uk/id/resource/009910399>

The query doesn’t need to define any prefixes or match any properties: the endpoint will simply return what it knows about a resource as RDF. If RDF/XML isn’t useful then the same results can be retrieved as JSON.

Reverting back to the previous approach of indirectly identifying resources, its possible to ask the endpoint to generate descriptions of all books with a given ISBN:


PREFIX bibo: <http://purl.org/ontology/bibo/>
DESCRIBE ?uri WHERE {
  ?uri bibo:isbn10 "0261102214".
}

Matching By Relationship

Resources can also be matched based on their relationships, by traversing across the graph of data. For example it’s possible to lookup the author for a given book:


PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?author WHERE {
  #Match the book
  ?uri bibo:isbn10 "0261102214";
       #Match its author
       dct:creator ?author.
}

As there are four books with this ISBN the query results return the URI for Tolkien four times. Adding a DISTINCT will remove any duplicates:


PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT DISTINCT ?author WHERE {
  #Match the book
  ?uri bibo:isbn10 "0261102214";
       #Match its author
       dct:creator ?author.
}

Type Specific Patterns

The following sections provide some additional example queries that illustrate some useful queries for working with some specific types of resource in the BNB dataset. Each query is accompanied by links to the SPARQL endpoint that show the results.

For clarity the PREFIX declarations in each query have been ommited. It should be assumed that each query is preceded with the following prefix declarations:


PREFIX bio: <http://purl.org/vocab/bio/0.1/&gt;
PREFIX bibo: <http://purl.org/ontology/bibo/&gt;
PREFIX blterms: <http://www.bl.uk/schemas/bibliographic/blterms#&gt;
PREFIX dct: <http://purl.org/dc/terms/&gt;
PREFIX event: <http://purl.org/NET/c4dm/event.owl#&gt;
PREFIX foaf: <http://xmlns.com/foaf/0.1/&gt;
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX isbd: <http://iflastandards.info/ns/isbd/elements/&gt;
PREFIX org: <http://www.w3.org/ns/org#&gt;
PREFIX owl: <http://www.w3.org/2002/07/owl#&gt;
PREFIX rda: <http://RDVocab.info/ElementsGr2/&gt;
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
PREFIX skos: <http://www.w3.org/2004/02/skos/core#&gt;
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#&gt;

Not all of these are required for all of the queries, but they declare all of the prefixes that are likely to be useful when querying the BNB.

People

There are a number of interesting queries that can be used to interact with author data in the BNB.

List Books By An Author

The following query lists all published books written by C. S. Lewis, with the most recently published books returned first:


SELECT ?book ?isbn ?title ?year WHERE {
  #Match all books with Lewis as an author
  ?book dct:creator <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        #match the publication event
        blterms:publication ?publication.

  #match the time of the publication event
  ?publication event:time ?time.
  #match the label of the year
  ?time rdfs:label ?year          
}
#order by descending year, after casting year as an integer
ORDER BY DESC( xsd:int(?year) )

Identifying Genre of an Author

Books in the BNB are associated with one or more subject categories. By looking up the list of categories associated with an author’s works it may be possible to get a sense of what type of books they have written. Here is a query that returns the list of categories associated with C. S Lewis’s works:


SELECT DISTINCT ?category ?label WHERE {
  #Match all books with Lewis as an author
  ?book dct:creator <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>;
     dct:subject ?category.

  ?category rdfs:label ?label.     
}
ORDER BY ?label

Relationships Between Contributors

The following query extracts a list of all people who have contributed to one or more C. S. Lewis books:


SELECT DISTINCT ?author ?name WHERE {
  ?book dct:creator <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>;
     dct:contributor ?author.

  ?author foaf:name ?name.     

  FILTER (?author != <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>) 
}
ORDER BY ?name

Going one step further its possible to identify people that serve as connections between different authors. For example this query finds people that have contributed to books by both C. S. Lewis and J. R. R. Tolkien:


SELECT DISTINCT ?author ?name WHERE {
  ?book dct:creator <http://bnb.data.bl.uk/id/person/LewisCS%28CliveStaples%291898-1963>;
     dct:contributor ?author.

  ?otherBook dct:creator <http://bnb.data.bl.uk/id/person/TolkienJRR%28JohnRonaldReuel%291892-1973>;
     dct:contributor ?author.

  ?author foaf:name ?name.     
}
ORDER BY ?name

Authors Born in a Year

The basic biographical information in the BNB can also be used in queries. For example many authors have a recorded year of birth some a year of death. These are described as Birth or Death Events in the data. The following query illustrates how to find 50 authors born in 1944:


SELECT ?author ?name WHERE {
   ?event a bio:Birth;
      bio:date "1944"^^<http://www.w3.org/2001/XMLSchema#gYear>.

   ?author bio:event ?event;
      foaf:name ?name.
}
LIMIT 50

The years associated with Birth and Death events have am XML Schema datatype associated with them (xsd:Year). It is important to specific this type in the query, otherwise the query will fail to match any data.

Books

There are a large number of published works in the BNB, extracting useful subsets involves identifying some useful dimensions in the data that can be used to filter the results. In addition to finding books by an author there are some other useful facets that relate to books, including:

  • Year of Publication
  • Location of Publication
  • Publisher

The following sections include queries that extract data along these dimensions. In each case the key step is to match the Publication Event associated with the book.

Books Published in a Year

Publication Events have a “time” relationship that refers to a resource for the year of publication. The following query extracts 50 books published in 2010:


SELECT ?book ?isbn ?title ?year WHERE {
  ?book dct:creator ?author;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        #match the publication event
        blterms:publication ?publication.

  #match the time of the publication event
  ?publication event:time ?time.
  #match the label of the year
  ?time rdfs:label "2010"      
}
LIMIT 50

Books Published in a Location

Finding books based on their place of publication is a variation of the above query. Rather than matching the time relationship, the query instead looks for the location associated with the publication event. This query finds 50 books published in Bath:


SELECT ?book ?isbn ?title ?year WHERE {
  ?book dct:creator ?author;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        blterms:publication ?publication.

  ?publication event:place ?place.
  ?place rdfs:label "Bath"
}
LIMIT 50

Books From a Publisher

In addition to the time and place relationships, Publication Events are also related to a publisher via an “agent” relationship. The following query uses a combination of the time and agent relationships to find 50 books published by Allen & Unwin in 2011:


SELECT ?book ?isbn ?title ?year WHERE {
  ?book dct:creator ?author;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        blterms:publication ?publication.

  ?publication event:agent ?agent;
       event:time ?time.

  ?agent rdfs:label "Allen & Unwin".
  ?time rdfs:label "2011".
}
LIMIT 50

These queries can easily be adapted to extend and combine the query patterns further, e.g. to limit results by a combination of place, time and publisher, or along different dimensions such as subject category.

Series

The BNB includes nearly 20,000 book series. The following queries illustrate some useful ways to interact with that data.

Books in a Series

Finding the books associated with a specific series is relatively straight-forward. The following query is very similar to an earlier query to find books based on an author. However in this case the list of books to be returned is identified by matching those that have an “has part” relationship with a series. The query finds books that are part of the “Pleasure In Reading” series:


SELECT ?book ?isbn ?title ?year WHERE {
  <http://bnb.data.bl.uk/id/series/Pleasureinreading> dct:hasPart ?book.

  ?book dct:creator ?author;
        bibo:isbn10 ?isbn;
        dct:title ?title;
        blterms:publication ?publication.

  ?publication event:agent ?agent;
       event:time ?time.

  ?time rdfs:label ?year.
}

Categories for a Series

The BNB only includes minimal metadata about each series: just a name and a list of books. In order to get a little more insight into the type of book included in a series, the following query finds a list of the subject categories associated with a series:


SELECT DISTINCT ?label WHERE {
  <http://bnb.data.bl.uk/id/series/Pleasureinreading> dct:hasPart ?book.

  ?book dct:subject ?subject.

  ?subject rdfs:label ?label.
}

As with the previous query the “Pleasure in Reading” series is identified by its URI. As books in the series might share a category the query uses the DISTINCT keyword to filter the results.

Series Recommendation

A series could be considered as a reading list containing useful suggestions of books on particular topics. One way to find a reading list might be to find lists based on subject category, using a variation of the previous query.

Another approach would be to find lists that already contain works by a favourite author. For example the following query finds the URI and the label of all series that contain books by J. R. R. Tolkien:


SELECT DISTINCT ?series ?label WHERE {
  ?book dct:creator ?author.
  ?author foaf:name "J. R. R. Tolkien".

  ?series dct:hasPart ?book;
     rdfs:label ?label.

}

Categories

The rich subject categories in the BNB data provide a number of useful ways to slice and dice the data. For example it is often useful to just fetch a list of books based on their category. The following query finds a list of American Detective and Mystery books:


SELECT ?book ?title ?name WHERE {

   ?book dct:title ?title;
         dct:creator ?author;
         dct:subject <http://bnb.data.bl.uk/id/concept/lcsh/DetectiveandmysterystoriesAmericanFiction>.

  ?author foaf:name ?name.
}
ORDER BY ?name ?title

For common or broad categories these lists can become very large so filtering them down further into more manageable chunks may be necessary.

Serials

Many of the periodicals and newspapers published in the UK have a local or regional focus. This geographical relationship is recorded in the BNB via a “spatial” relationship of the serial resource. This relationship supports finding publications that are relevant to a particular location in the United Kingdom.

The following query finds serials that focus on the City of Bath:


SELECT ?title ?issn WHERE {

   ?serial dct:title ?title;
           bibo:issn ?issn;
           dct:spatial ?place.

   ?place rdfs:label "Bath (England)".
}

The exact name of the location is used in the match. While it would be possible to filter the results based on a regular expression, this can be very slow. The following query shows how to extract a list of locations referenced from the Dublin Core spatial relationship. This list could be used to populate a search form or application navigation to enable efficient filtering by place name:


SELECT DISTINCT ?place ?label WHERE {

   ?serial dct:spatial ?place.
   ?place rdfs:label ?label.
}

Summary

This tutorial has provided an introduction to using SPARQL to extract data from the BNB dataset. When working with a SPARQL endpoint it is often useful to have example queries that can be customised to support particular use cases. The tutorial has included multiple examples and these are all available to download.

The tutorial has covered some useful general approaches for matching resources based on identifiers and relationships. Looking up URIs in a dataset is an important step in mapping from systems that contain non-URI identifiers, e.g. ISSNs or ISBNs. Once a URI has been discovered it can be used to directly access the BNB Linked Data or used as a parameter to drive further queries.

A number of example queries have also been included showing how to ask useful and interesting questions of the dataset. These queries relate to the main types of resources in the BNB and illustrate how to slice and dice the dataset along a number of different dimensions.

While the majority of the sample queries are simple SELECT queries, it is possible to create variants that use CONSTRUCT or DESCRIBE queries to extract data in other ways. Several good SPARQL tutorials have been referenced to provide further background reading for developers interested in digging into this further.

Tagged

Loading the British National Bibliography into an RDF Database

This is the second in a series of posts (1, 2, 3, 4) providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

Note: while I’ve attempted to fix up these instructions to account with changes to the software and how the data is published, there may still be some errors. If there are then please leave a comment or drop me an email and I’ll endeavour to fix.

The British National Bibliography (BNB) is a bibliographic database that contains data on a wide range of books and serial publications published in the UK and Ireland since the 1950s. The database is published under a public domain license and is available for access online or as a bulk download.

This tutorial provides developers with guidance on how to download the BNB data and load it into an RDF database, or “triple store” for local processing. The tutorial covers:

  • An overview of the different formats available
  • How to download the BNB data
  • Instructions for loading the data into two different open source triple stores

The instructions given in this tutorial are for users of Ubuntu. Where necessary pointers to instructions for other operating systems are provided. It is assumed that the reader is confident in downloading and installing software packages and working with the command-line.

Bulk Access to the BNB

While the BNB is available for online access as Linked Data and via a SPARQL endpoint there are a number of reasons why working with the dataset locally might be useful, e.g:

  • Analysis of the data might require custom indexing or processing
  • Using a local triple store might offer more performance or functionality
  • Re-publishing the dataset as part of aggregating data from a number of data providers
  • The full dataset provides additional data which is not included in the Linked Data.

To support these and other use cases the BNB is available for bulk download, allowing developers the flexibilty to process the data in a variety of ways.

The BNB is actually available in two different packages. Both provide exports of the data in RDF but differ in both the file formats used and the structure of the data.

BNB Basic

The BNB Basic dataset is provided as an export in RDF/XML format. The individual files are available for download from the BL website.

This version provides the most basic export of the BNB data. Each record is mapped to a simple RDF/XML description that uses terms from several schemas including Dublin Core, SKOS, and Bibliographic Ontology.

As its provides a fairly raw version of the data, BNB Basic is likely to be most useful when the data is going to undergo further local conversion or analysis.

Linked Open BNB

The Linked Open BNB offers a much more structured view of the BNB data.

This version of the BNB has been modelled according to Linked Data principles:

  • Every resource, e.g. author, book, category, has been given a unique URI
  • Data has been modelled using a wider range of standard vocabularies, including the Bibliographic Ontology, Event Ontology and FOAF.
  • Where possible the data has been linked to other datasets, including LCSH and Geonames

It is this version of the data that is used to provide both the SPARQL endpoint and the Linked Data views, e.g. of The Hobbit.

This package provides the best option for mirroring or aggregating the BNB data because its contents matches that of the online versions. The additional structure to the dataset may also make it easier to work with in some cases. For example lists of unique authors or locations can be easily extracted from the data.

Downloading The Data

Both the BNB Basic and Linked Open BNB are available for download from the BL website

Each dataset is split over multiple zipped files. The BNB Basic is published in RDF/XML format while the Linked Open BNB is published as ntriples. The individual data files can be downloaded from CKAN although this can be time consuming to do manually.

The rest of this tutorial will assume that the packages have been downloaded to ~data/bl

Unpacking the files is a simple matter of unzipping them:

cd ~/data/bl
unzip \*.zip
#Remove original zip files
rm *.zip

The rest of this tutorial provides guidance on how to load and index the BNB data in two different open source triple stores.

Using the BNB with Fuseki

Apache Jena is an Open Source project that provides access to a number of tools and Java libraries for working with RDF data. One component of the project is the Fuseki SPARQL server.

Fuseki provides support for indexing and querying RDF data using the SPARQL protocol and query language.

The Fuseki documentation provides a full guide for installing and administering a local Fuseki server. The following sections provide a short tutorial on using Fuseki to work with the BNB data.

Installation

Firstly, if Java is not already installed then download the correct version for your operating system.

Once Java has been installed, download the latest binary distribution of Fuseki. At the time of writing this is Jena Fuseki 1.1.0.

The steps to download and unzip Fuseki are as follows:

#Make directory
mkdir -p ~/tools
cd ~/tools

#Download latest version using wget (or manually download)
wget http://www.apache.org/dist/jena/binaries/jena-fuseki-1.1.0-distribution.zip

#Unzip
unzip jena-fuseki-1.1.0-distribution.zip

Change the download URL and local path as required. Then ensure that the fuseki-server script is executable:

cd jena-fuseki-1.1.0
chmod +x fuseki-server

To test whether Fuseki is installed correctly, run the following (on Windows systems use fuseki-server.bat):

./fuseki-server --mem /ds

This will start Fuseki with a empty read-only in-memory database. Visiting http://localhost:3030/ in your browser should show the basic Fuseki server page. Use Ctrl-C to shutdown the server once the installation test is completed.

Loading the BNB Data into Fuseki

While Fuseki provides an API for loading RDF data into a running instance, for bulk loading it is more efficient to index the data separately. The manually created indexes can then be deployed by a Fuseki instance.

Fuseki is bundled with the TDB triple store. The TDB data loader can be run as follows:

java -cp fuseki-server.jar tdb.tdbloader --loc /path/to/indexes file.nt

This command would create TDB indexes in the /path/to/indexes directory and load the file.nt into it.

To index all of the Linked Open BNB run the following command, adjusting paths as required:

java -Xms1024M -cp fuseki-server.jar tdb.tdbloader --loc ~/data/indexes/bluk-bnb ~/data/bl/BNB*

This will process each of the data files and may take several hours to complete depending on the hardware being used.

Once the loader has completed the final step is to generate a statistics file for the TDB optimiser. Without this file SPARQL queries will be very slow. The file should be generated into a temporary location and then copied into the index directory:

java -Xms1024M -cp fuseki-server.jar tdb.stats --loc ~/data/indexes/bluk-bnb >/tmp/stats.opt
mv /tmp/stats.opt ~/data/indexes/bluk-bnb

Running Fuseki

Once the data load has completed Fuseki can be started and instructed to use the indexes as follows:

./fuseki-server --loc ~/data/indexes/bluk-bnb /bluk-bnb

The --loc parameter instructs Fuseki to use the TDB indexes from a specific directory. The second parameter tells Fuseki where to mount the index in the web application. Using a mount point of /bluk-bnb the SPARQL endpoint for the dataset would then be found at:

http://localhost:3030/bluk-bnb/query

To select the dataset and work with it in the admin interface visit the Fuseki control panel:

http://localhost:3030/control-panel.tpl

Fuseki has a basic SPARQL interface for testing out SPARQL queries, e.g. the following will return 10 triples from the data:

SELECT ?s ?p ?o WHERE {
  ?s ?p ?o
}

For more information on using and administering the server read the Fuseki documentation.

Using the BNB with 4Store

Like Fuseki, 4Store is an Open Source project that provides a SPARQL based server for managing RDF data. 4Store is written in C and has been proven to scale to very large datasets across multiple systems. It offers a similar level of SPARQL support as Fuseki so is good alternative for working with RDF in a production setting.

As the 4Store download page explains, the project has been packaged for a number of different operating systems.

Installation

As 4Store is available as an Ubuntu package installation is quite simple:

sudo apt-get install 4store

This will install a number of command-line tools for working with the 4Store server. 4Store works differently to Fuseki in that there are separate server processes for managing the data and serving the SPARQL interface.

The following command will create a 4Store database called bluk_bnb:

#ensure /var/lib/4store exists
sudo mkdir -p /var/lib/4store

sudo 4s-backend-setup bluk_bnb

By default 4Store puts all of its indexes in /var/lib/4store. In order to have more control over where the indexes are kept it is currently necessary to build 4store manually. The build configuration can be altered to instruct 4Store to use an alternate location.

Once a database has been created, start a 4Store backend to manage it:

sudo 4s-backend bluk_bnb

This process must to be running before data can be imported, or queried from the database.

Once the database is running a SPARQL interface can then be started to provide access to its contents. The following command will start a SPARQL server on port 8000:

sudo 4s-httpd -p 8000 bluk_bnb

To check whether the server is running correctly visit:

http://localhost:8000/status/

It is not possible to run a bulk import into 4Store while the SPARQL process is running. So after confirming that 4Store is running successfully, kill the httpd process before continuing:

sudo pkill '^4s-httpd'

Loading the Data

4Store ships with a command-line tool for importing data called 4s-import. It can be used to perform bulk imports of data once the database process has been started.

To bulk import the Linked Open BNB, run the following command adjusting paths as necessary:

4s-import bluk_bnb --format ntriples ~/data/bl/BNB*

Once the import is complete, restart the SPARQL server:

sudo 4s-httpd -p 8000 bluk_bnb

Testing the Data Load

4Store offers a simple SPARQL form for submitting queries against a dataset. Assuming that the SPARQL server is running on port 8000 this can be found at:

http://localhost:8000/test/

Alternatively 4Store provides a command-line tool for submitting queries:

 4s-query bluk_bnb 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'

Summary

The BNB dataset is not just available for use as Linked Data or via a SPARQL endpoint. The underlying data can be downloaded for local analysis or indexing.

To support this type of usage the British Library have made available two versions of the BNB. A “basic” version that uses a simple record-oriented data model and the “Linked Open BNB” which offers a more structured dataset.

This tutorial has reviewed how to access both of these datasets and how to download and index the data using two different open source triple stores: Fuseki and 4Store.

The BNB data could also be processed in other ways, e.g. to load into a standard relational database or into a document store like CouchDB.

The basic version of the BNB offers a raw version of the data that supports this type of usage, while the richer Linked Data version supports a variety of aggregation and mirroring use cases.

Tagged

An Introduction to the British National Bibliography

This is the first of a series of posts (1, 2, 3, 4) providing background and tutorial material about the British National Bibliography. The tutorials were written as part of some freelance work I did for the British Library at the end of 2012. The material was used as input to creating the new documentation for their Linked Data platform but hasn’t been otherwise published. They are now published here with permission of the BL.

This tutorial provides an introduction to the British National Bibliography (BNB) for developers interested in understanding how it could be used in their applications. The tutorial provides:

  • A short introduction to the BNB, including its scope and contents
  • A look at how the BNB can be accessed online
  • An introduction to the data model that underpins the BNB Linked Data

What is the British National Bibliography?

The British National Bibliography has been in development for over 60 years with the first record added in 1950. It contains data about virtually every book and journal title published or distributed in the UK and Ireland since that date. In its role as the national library of the United Kingdom the British Library is responsible for preserving a wide variety of works, and the majority of these are catalogued in the BNB. The exclusions largely relate to some official government publications that are catalogued elsewhere, or other locally published or ephemeral material. With an increasing number of works being published electronically, in 2003 the BNB was extended to include records for UK online electronic resources. The 2013 regulations extended the scope further to include non-print materials.

As well as being an historical archive the BNB also includes data about forthcoming publications, in some cases up to 16 weeks in advance of their actual publication dates. There are over 50,000 new titles published each year in the UK and Ireland, which gives an indication of how quickly the database is growing.

Traditionally the BNB has had a role in helping publishers share metadata with libraries, to reduce the costs of cataloguing works and to inform purchasing decisions. But with the publication of the BNB under an open license, it is now available for anyone to use in a variety of ways. For example the database could be used as:

  • A reliable source of book metadata, e.g. to drive reading list or personal cataloguing applications
  • Insight into the publication output of the UK over different years and genres
  • A means of accessing bibliographies of individual authors, ranging over 60 years

Accessing the BNB Open Data

There are several ways in which the BNB can be accessed. While the online search interface provides a handy way to explore the contents of the bibliography to understand its scope, for application development a machine-readable interface is required.

Originally the primary way to access the BNB was via Z39.50: a specialised protocol for searching remote databases that is used in many library systems. However the BNB is now available through via several other routes that make is easier to use in other contexts. These include:

  • Bulk downloads of the database in RDF/XML format. This includes the full dataset and is intended to support local indexing and analysis
  • Online access as Linked Data, allowing individual records to be accessed in a variety of formats including XML and JSON. This is a subset that includes books and serials only
  • An API that allows the dataset to be queried using the SPARQL query language. This provides a number of ways of querying and extracting portions of the dataset

These different access methods support a range of use cases, e.g. allowing the BNB to be accessed in its entirety to support bulk processing and analysis, whilst also supporting online access from mobile or web applications.

Most importantly the BNB is available for use under an open license. The British Library have chosen to publish the data under a Creative Commons CC0 License which places the entire database into the public domain. Unlike some online book databases or APIs this means there are no restrictions whatsoever on how the BNB can be used.

The Linked Data version of the BNB is the most structured version of the BNB and is likely to be the best starting point for most applications. The same data and data model is used to power the SPARQL endpoint, so understanding its structure will help developers use the API. The Linked Data has also been cross-referenced with other data sources, offering additional sources of information. The rest of this tutorial therefore looks at this data model in more detail.

The BNB Linked Data

Linked Data is a technique for publishing data on the web. Every resource, e.g. every book, author, organisation or subject category, is assigned a URI which becomes its unique identifier. Accessing that URI in a web browser will result in the delivery of a web page that contains a useful summary of the specific resource including, for example, key attributes such as its name or title, and also relationships to other resources e.g. a link from a book to its author.

But the same URI can also be accessed by an application. But, instead of an HTML web page, the data can be retrieved in a variety of different machine-readable formats, including XML and JSON. With Linked Data a web site is also an API. Information can be accessed by both humans and application code in whatever format is most useful.

A database published as Linked Data can be thought of as containing:

  • Resources, which are identified by a URI
  • Attributes of those resources, e.g. names, titles, labels, dates, etc
  • Relationships between resources, e.g. author, publisher, etc.

One special relationship identifies the type of a resource. Depending on the data model, a resource might have multiple types.

Standard Schemas

The data model, or schema, used to publish a dataset as Linked Data consists of a set of terms. Terms are either properties, e.g. attributes or relationships between resources, or types of things, e.g. Book, Person, etc.

Unique URIs are not only used identify resources, they’re also used to identify terms in a schema. For example the unique identifier for the “title” attribute in the BNB is http://purl.org/dc/terms/title, whereas the unique identifier for the Person type is http://xmlns.com/foaf/0.1/Person.

By identifying terms with unique URIs it becomes possible to publish and share their definitions on the web. This supports re-use of schemas, allowing datasets from different organisations to be published using the same terms. This encourages convergence on standard ways to describe particular types of data, making it easier for consumers to use and integrate data from a number of different sources.

The BNB dataset makes use of a number of standard schemas. These are summarised in the following table along with their base URI and links to their individual documentation.

Name URI Role
Bibliographic Ontology http://purl.org/ontology/bibo/ A rich vocabulary for describing many types of publication and their related metadata
Bio http://purl.org/vocab/bio/0.1/ Contains terms for publishing biographical information
British Library Terms http://www.bl.uk/schemas/bibliographic/blterms A new schema published by the British Library which contains some terms not covered in the other vocabularies
Dublin Core http://purl.org/dc/terms/ Basic bibliographic metadata terms like title and creator
Event Ontology http://purl.org/NET/c4dm/event.owl# Properties for describing events and their participants
FOAF http://xmlns.com/foaf/0.1/ Contains terms for describing people, their names and relationships
ISBD http://iflastandards.info/ns/isbd/elements/ Terms from the International Standard Bibliographic Description standard
Org http://www.w3.org/ns/org# Contains terms for describing organisations
Web Ontology Language http://www.w3.org/2002/07/owl# A standard ontology for describing terms and equivalencies between resources
RDF Schema http://www.w3.org/2000/01/rdf-schema# The core RDF schema language which is used to publish new terms
Resource Description and Access http://rdvocab.info/ElementsGr2# Defines some standard library cataloguing terms
SKOS http://www.w3.org/2004/02/skos/core# Supports publication of subject classifications and taxonomies
WGS84 Geo Positioning http://www.w3.org/2003/01/geo/wgs84_pos# Geographic points, latitude and longitude

Returning to our earlier example we can now see that the title attribute in the BNB dataset (http://purl.org/dc/terms/title) is taken from the Dublin Core Schema (http://purl.org/dc/terms/) as they share a common base URI. Similarly the Person type is taken from the FOAF Schema.

It is common practice for datasets published as Linked Data to use terms from multiple schemas. But, while a dataset might mix together several schemas, it is very unlikely that it will use all of the terms from all of the schemas. More commonly only a few terms are taken from each schema.

So, while it is useful to know which schemas have been used in compiling a dataset, it is also important to understand how those schemas have been used to describe specific types of resource. This is covered in more detail in the next section.

The BNB Data Model

There are high level overview diagrams that show the main types of resources and relationships in the BNB dataset. One diagram summarises the data model for books while another summarises the model for serials (e.g periodicals and newspapers).

The following sections add some useful context to those diagrams, highlighting how the most important types of resources in the dataset are described. The descriptions include a list of the attributes and relationships that are commonly associated with each individual type.

The tables are not meant to be exhaustive documentation, but instead highlight the most common or more important properties. The goal is to help users understand the structure of the dataset and the relationships between resources. Further exploration is encouraged. With this in mind, links to example resources are included throughout.

It is also important to underline that not all resources will have all of the listed properties. The quality or availability of data might vary across different publications. Similarly a resource might have multiple instances of a given attribute or relationship. E.g. a book with multiple authors.

Finally, all resources will have an RDF type property (http://www.w3.org/1999/02/22-rdf-syntax-ns#type) and the values of this property are given in each section. As noted above, a resource may have multiple types.

Books

Unsurprisingly, books are one of the key types of resource in the BNB dataset. The important bibliographic metadata for a book is catalogued, including its title, language, number of pages, unique identifiers such as its ISBN and the British Bibliography Number, and references to its author, publisher, the event of its publication, and subject classifications.

Books are identified with an RDF type of http://purl.org/ontology/bibo/Book. The following table summarises the properties most commonly associated with those resources:

Property URI Notes
Abstract http://purl.org/dc/terms/abstract A short abstract or summary of the book
BNB http://www.bl.uk/schemas/bibliographic/blterms#bnb The British Bibliographic Number for the book
Creator http://purl.org/dc/terms/creator Reference to the author(s) of the book; a Person resource
Contributor http://purl.org/dc/terms/contributor Reference to other people involved in the creation of the work, e.g. an editor or illustrator.
Extent http://iflastandards.info/ns/isbd/elements/P1053 The “extent” of the book, i.e. the number of pages it contains
ISBN 10/13 http://purl.org/ontology/bibo/isbn10; http://purl.org/ontology/bibo/isbn13 10-digit and 13-digit [ISBN](http://en.wikipedia.org/wiki/International_Standard_Book_Number) number of the book
Language http://purl.org/dc/terms/language The language of the text
Publication Event http://www.bl.uk/schemas/bibliographic/blterms#publication Reference to an Event resource describing the year and location in which the book was
published
Subject http://purl.org/dc/terms/subject Reference to concept resources that describe the subject category of the book
Table of Contents http://purl.org/dc/terms/tableOfContents Text from the table of contents page
Title http://purl.org/dc/terms/title The title of the book

In some cases there may be multiple instances of these properties. For example a book might have several creators or be associated with multiple subject categories.

The Hobbit makes a good example. There was an edition of the book published in 1993 by Harper Collins. The edition has an ISBN of 0261102214. If you visit the Linked Page page for the resource you can view a description of the book. To get access to the raw data, choose one of the alternate formats, e.g. JSON or XML.

People

The BNB database includes some basic biographical and bibliographic information about people, e.g. authors, illustrators, etc. These resources all have an RDF type of http://xmlns.com/foaf/0.1/Person. The description of a person will typically include both family and given names and, if available, reference to birth and death events.

A person will also be associated with one or more books or other works in the database. A person is either the creator or a contributor to a work. The creator relationship is used to identify a significant contribution, e.g. the author of a book. The contributor relationship covers other forms of contribution, e.g. editor, illustrator, etc.

The following table identifies the individual properties used to describe people:

Property URI Notes
Created http://www.bl.uk/schemas/bibliographic/blterms#hasCreated Reference to a work which the person created
Contributed To http://www.bl.uk/schemas/bibliographic/blterms#hasContributedTo Reference to a work to which the Person has contributed
Event http://purl.org/vocab/bio/0.1/event Reference to an Event resource involving the author. Usually a birth event and/or a death event
Family Name http://xmlns.com/foaf/0.1/familyName The surname of the author
Given Name http://xmlns.com/foaf/0.1/givenName The first name of the author
Name http://xmlns.com/foaf/0.1/name The full name of the author

C. S. Lewis was a prolific author. Visiting the description of Lewis in the BNB database provides a bibliography for Lewis, listing the various works that he authored or those to which he contributed. There are also reference to his birth and death events.

Pauline Baynes was an illustrator who worked with a number of authors, including both Lewis and Tolkein. Baynes’s description in the BNB includes a list of all her contributions (many more than are mentioned on her Wikipedia page). Baynes provides one of many connections between Lewis and Tolkein in the BNB database, via the contributor relationships with their works.

Events, Places and Organizations

An event is something that happens at a particular point in time, involves one or more participants and usually occurs in a specific location. Book publications, births and deaths are all modelled as events in the BNB data. Each type of event has a different RDF type:

The publication of an edition of the Hobbit is an example of a Publication Event, whilst Tolkein’s birth and death illustrate the basic biographical detail associated with those events.

The following table summarises the key attributes of an event:

Property URI Notes
Agent http://purl.org/NET/c4dm/event.owl#agent Used to refer to an Organization involved in a Publication Event
Date http://purl.org/vocab/bio/0.1/date The year in which a birth or death took place. In this property the year is captured as a plain text value.
Place http://purl.org/NET/c4dm/event.owl#place Reference to a Place resource which describes the location in which a Publication Event took place. This will either be a resource in the BNB or the Geonames dataset (see “Links to Other Datasets”)
Time http://purl.org/NET/c4dm/event.owl#time The year in which a Publication Event took place. The value is a reference to a resource that describes the year. The URIs are taken from an official UK government dataset.

As noted above, Publication Events often refer to two other types of resource in the dataset:

While these resources provide minimal extra information in the BNB dataset they have been created to allow other organisations to link their data to the BNB

Concepts

The different types of bibliographic resource in the BNB can all be associated with subject categories that help to further describe them, e.g. to indicate their subject matter, theme or genre. The Dubline Core subject property (http://purl.org/dc/terms/subject) is used to associate a resources with one or more categories.

Individual subject categories are organised into a “scheme”. A scheme is a set of formally defined categories that has been published by a particular authority. The BNB data uses schemes from the Library of Congress and the Dewey Decimal Classification. These schemes provide a standard way to catalogue works that are in use in many different systems.

The BNB data uses several different RDF types to identify different types of category, e.g. to differentiate between categories used as topics, place labels, etc. The BNB data model diagram illustrates some of the variety of subject resources that can be found in the dataset.

However while the categories may come from different sources or be of different types, the data about each is essentially the same:

Property URI Notes
Label http://www.w3.org/2000/01/rdf-schema#label A label for the category
Notation http://www.w3.org/2004/02/skos/core#notation A formal identifier for the category, e.g. the Dewey Decimal Number
Scheme http://www.w3.org/2004/02/skos/core#inScheme A reference to the scheme that the concept is associated with. This may be a resource defined in another dataset

Examples of different types of category include:

  • Fiction in English and Childrens Stories. These are both taken from the Library of Congress Subject Headings (LCSH).
  • Dewey Decimal Classifications, e.g. 823.91 which is “English fiction–20th century” in the Dewey system.

Series

Some books in the BNB are not just organised into categories: they are also organised into collections or “series”. Series are collections of books that are usually based around a specific theme or topic. The BNB contains over 190,000 different series covering a wide range of different topics. Series are identified with an RDF type of http://purl.org/ontology/bibo/Series.

Examples include “Men and Machines“, “Science series” and “Pleasure In Reading“. Series provide ready made reading lists on a particular topic that could be used to drive recommendations of books to readers.

Series are essentially just lists of resources and so are described with just a few properties:

Property URI Notes
Label http://www.w3.org/2000/01/rdf-schema#label The name of the series
Part http://purl.org/dc/terms/hasPart A reference to the bibliographic resource that is included in the collection. A series will have many instances of this property, one for each work in the series

Periodical and Newspapers

The coverage of the BNB goes beyond just books. It also has data about a number of serial publications:

Descriptions of these resources share many similarities with books, e..g title, BNB numbers, subject classifications, etc. However there are several additional properties that are used to specific to serial publications. This includes notes on where and when they were first published, regional focus, etc.

In total there are approximately 10,000 different periodicals in the BNB data. The periodicals may be related to one another, e.g. for example one publication can replace another.

The following table lists some of the key properties of periodicals and newspapers. The serial data model diagram further illustrates some of the key relationships.

Property URI Notes
Alternate Title http://purl.org/dc/terms/alternative Alternative title(s) for the publication
BNB http://www.bl.uk/schemas/bibliographic/blterms#bnb The British Bibliographic Number for the serial
Contributor http://purl.org/dc/terms/contributor A relationship to a Person or Organization that contributed to the publication
Frequency http://iflastandards.info/ns/isbd/elements/P1065 A note on the publication frequency, e.g. “Weekly”
ISSN http://purl.org/ontology/bibo/issn The official [ISSN](http://en.wikipedia.org/wiki/International_Standard_Serial_Number) number for the serial
Language http://purl.org/dc/terms/language The language of the text
Publication Event http://www.bl.uk/schemas/bibliographic/blterms#publicationStart Reference to an Event resource describing the year and location in which the serial was first published
Replaced By http://purl.org/dc/terms/isReplacedBy Reference to a periodical that replaces or supercedes this one
Replaces http://purl.org/dc/terms/replaces Reference to another periodical that this resource replaces
Resource Specific Note http://iflastandards.info/ns/isbd/elements/P1038 Typically contains notes on the start (and end) publication dates of the periodical
Spatial http://purl.org/dc/terms/spatial Reference to Place resources that indicate the geographical focus of a periodical
Subject http://purl.org/dc/terms/subject Reference to Concept resources that describe the subject category of the serial
Title http://purl.org/dc/terms/title The title of the serial

Examples of periodicals include the Whitley Bay News Guardian, the Bath Chronicle and iSight.

The publication Coaching News was replaced by Cycle Coaching, providing an example of direct relationships between publications. As noted in the serial data model diagram other relationships from the Dublin Core vocabulary are used to capture alternate formats and versions of publications.

Links to Other Datasets

The final aspect of the BNB dataset to highlight is its relationships with other datasets. When publishing Linked Data it is best practice to include links or references to other datasets. There are two main forms that this cross-referencing can take:

  • Declaring equivalence links to indicate that a resource in the current data is the same as another resource (with a different identifier) in a different dataset. Publishing this equivalencies help to integrate datasets across the web. It is achieved by using the OWL “Same As” property (http://www.w3.org/2002/07/owl#sameAs) to relate together the two resources.
  • In other cases resources are simply directly referenced as the value of a relationship. For example references to geographical places or subject categories may be made directly to resources in third-party datasets. This avoids creating new descriptions of the same resource.

In both cases the links between the datasets can be followed, by a user or an application, in order to discover additional useful data. For example BNB includes references to places in the GeoNames dataset. Following those links can help an application discover the latitude and longitude of the location.

The BNB uses both of these forms of cross-referencing to create links to a number of different datasets:

All of these datasets contain additional useful contextual data that might be useful for application developers to explore.

Summary

This tutorial has aimed to provide an introduction to the scope, contents and structure of the British National Bibliography.

Starting with some introductory material that briefly summarised the history of the dataset, the various means of accessing the data were then summarised. The dataset contains over 60 years worth of data which has been placed into the public domain, making it freely available for developers to use as they see fit. The data can be downloaded for bulk processing or used online as Linked Data or via a SPARQL endpoint.

The main focus of the tutorial has been on providing an overview of the key types of resource in the dataset, complete with a summary of their key attributes and relationships. Example resources that highlight important features have been given throughout, to help provide useful pointers for further exploration.

The British National Bibliography is an important national resource that has the potential to be used as a key piece of infrastructure in a range of different types of application.

Tagged

Building the new Ordnance Survey Linked Data platform

Disclaimer: the following is my own perspective on the build & design of the Ordnance Survey Linked Data platform. I don’t presume to speak for the OS and don’t have any inside knowledge of their long term plans.

Having said that I wanted to share some of the goals we (Julian Higman, Benjamin Nowack and myself) had when approaching the design of the platform. I will say that we had the full support and encouragement of the Ordnance Survey throughout the project, especially John Goodwin and others in the product management team.

Background & Goals

The original Ordnance Survey Linked Data site launched in April 2010. At the time it was a leading example of adoption of Linked Data by a public sector organisation. But time moves on and both the site and the data were due for a refresh. With Talis’ withdrawal from the data hosting business, the OS decided to bring the data hosting in-house and contracted Julian, Benjamin and myself to carry out the work.

While the migration from Talis was a key driver, the overall goal was to deliver a new Linked Data platform that would make a great showcase for the Ordnance Survey Linked Data. The beta of the new site was launched in April and went properly live at the beginning of June.

We had a number of high-level goals that we set out to achieve in the project:

  • Provide value for everyone, not just developers — the original site was very developer-centric, offering a very limited user experience with no easy way to browse the data. We wanted everyone to begin sharing links to the Ordnance Survey pages and that meant that the site needed a clean, user-friendly design. This meant we approached it from the point of building an application, not just a data portal
  • Deliver more than Linked Data — we wanted to offer a set of APIs that made the data accessible and useful for people who weren’t familiar with Linked Data or SPARQL. This meant offering some simpler tools to enable people to search and link to the data
  • Deliver a good developer user experience –this meant integrating API explorers, plenty of examples, and clear documentation. We wanted to shorten the “time to first JSON” to get developers into the data as fast as possible
  • Showcase the OS services and products — the OS offer a number of other web services and location products. The data should provide a way to show that value. Integrating mapping tools was the obvious first step
  • Support latest standards and best practices — where possible we wanted to make sure that the site offered standard APIs and formats, and conformed to the latest best practices around open data publishing
  • Support multiple datasets — the platform has been designed to support multiple datasets, allowing users to use just the data they need or the whole combined dataset. This provides more options for both publishing and consuming the data
  • Build a solid platform to support further innovation — we wanted to leave the OS with an extensible, scalable platform to allow them to further experiment with Linked Data

Best Practices & Standards

From a technical perspective we need to refresh not just the data but the APIs used to access it. This meant replacing the SPARQL 1.0 endpoint and custom search interface offered in the original with more standard APIs.

We also wanted to make the data and APIs discoverable and adopted a “completionist” approach to try and tick all the boxes for publishing and exposing dataset metadata, including basic versioning and licensing information.

As a result we ended up with:

  • SPARQL 1.1 query endpoints for every dataset, which expose a basic SPARQL 1.1 Service Description as well as the newer CSV and TSV response formats
  • Well populated VoID descriptions for each dataset, including all of the key metadata items including publication dates, licensing, coverage, and some initial dataset statistics
  • Autodiscovery support for datasets, APIs, and for underlying data about individual Linked Data resources
  • OpenSearch 1.1 compliant search APIs that support keyword and geo search over the data. The Atom and RSS response formats include the relevance and geo extensions
  • Licensing metadata is clearly labelled not just on the datasets, but as a Link HTTP header in every Linked Data or API result, so you can probe resources to learn more
  • Basic support for the OpenRefine Reconciliation API as a means to offer a simple linking API that can be used in a variety of applications but also, importantly, with people curating and publishing small datasets using OpenRefine
  • Support for CORS, allowing cross-browser requests to be made to the Linked Data and all of the APIs
  • Caching support through the use of ETags and Last-Modified headers. If you’re using the APIs then you can optimise your requests and cache data by making Conditional GET requests
  • Linked Data pages that offer more than just a data dump, the integrated mapping and links to other products and services makes the data more engaging.
  • Custom ontology pages that allow you to explore terms and classes within individual ontologies, e.g. see for example the definition of “London Borough

Clearly there’s more that could be potentially done. Tools can always be improved, but the best way for that to happen is through user feedback. I’d love to know what you think of the platform.

Overall I think we’ve achieved our goal of making a site that, while clearly developer oriented, offers a good user experience for non-developers. I’ll be interested to see what people do with the data over the coming months

How Do We Attribute Data?

This post is another in my ongoing series of “basic questions about open data”, which includes “What is a Dataset?” and “What does a dataset contain?“. In this post I want to focus on dataset attribution and in particular questions such as:

  • Why should we attribute data?
  • How are data publishers asking to be attributed?
  • What are some of the issues with attribution?
  • Can we identify some common conventions around attribution?
  • Can we monitor or track attribution?

I started to think about this because I’ve encountered a number of data publishers recently that have published Open Data but are now struggling to highlight how and where that data has been used or consumed. If data is published for anonymous download, or accessible through an open API then a data publisher only has usage logs to draw on.

I had thought that attribution might help here: if we can find links back to sources, then perhaps we can help data publishers mine the web for links and help them build evidence of usage. But it quickly became clear, as we’ll see in a moment, that there really aren’t any conventions around attribution, making it difficult to achieve this.

So lets explore the topic from first principles and tick off my questions individually.

Why Attribute?

The obvious answer here is simply that if we are building on the work of others, then it’s only fair that those efforts should be acknowledged. This helps the creator of the data (or work, or code) be recognised for their creativity and effort, which is the very least we can do if we’re not exchanging hard cash.

There are also legal reasons why the source of some data might be need to be acknowledged. Some licenses require attribution, copyright may need to be acknowledged. As a consumer I might also want to (or need to) clearly indicate that I am not the originator of some data in case it is find to be false, or misleading, etc.

Acknowledging my sources may also help guarantee that the data I’m using continues to be available: a data publisher might be collecting evidence of successful re-use in order to justify ongoing budget for data collection, curation and publishing. This is especially true when the data publisher is not directly benefiting from the data supply; and I think it’s almost always true for public sector data. If I’m reusing some data I should make it as clear as possible that I’m so doing.

There’s some additional useful background on attribution from a public sector perspective in a document called “Supporting attribution, protecting reputation, and preserving integrity“.

It might also be useful to distinguish between:

  • Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
  • Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

While these two cases clearly overlap, the intention is often slightly different. As a user of an application, or the reader of an academic paper, I might want a clear citation to the underlying dataset so I can re-use it myself, or do some fact checking. The important use case there is tracking facts and figures back to their sources. Attribution is more about crediting the effort involved in collecting that information.

It may be possible to achieve both goals with a simple link, but I think recognising the different use cases is important.

How are data publishers asking to be attributed?

So how are data publishers asking for attribution? What follows isn’t an exhaustive survey but should hopefully illustrate some of the variety.

Lets look first at some of the suggested wordings in some common Open Data licenses, then poke around in some terms and conditions to see how these are being applied in practice.

Attribution Statements in Common Open Data Licenses

The Open Data Commons Attribution license includes some recommended text (Section 4.3a – Example Notice):

Contains information from DATABASE NAME which is made available under the ODC Attribution License.

Where DATABASE NAME is the name of the dataset and is linked to the dataset homepage. Notice no mention of the originator, just the database. The license notes that in plain text the links should be included as text. The Open Data Commons Database license has the same text (again, section 4.3a)

The UK Open Government License notes that re-users should:

…acknowledge the source of the Information by including any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence

Where no attribution is provided, or multiple sources must be attributed, then the suggested default text, which should include a link to the license is:

Contains public sector information licensed under the Open Government Licence v1.0.

So again, no reference to the publisher but also no reference to the dataset either. The National Archives have some guidance on attribution which includes some other variations.  These variants do suggest including more detail including name of department, date of publication, etc. These look more like typical bibliographic citations.

As another data point we can look at the Ordnance Survey Open Data License. This is a variation of the Open Government License but carries some additional requirements, specifically around attribution. The basic attribution statement is:

Contains Ordnance Survey data © Crown copyright and database right [year]

However the Code Point Open dataset has some additional attribution requirements, which also acknowledge copyright of the Royal Mail and National Statistics. All of these statements acknowledge the originators and there’s no requirement to cite the dataset itself.

Interestingly, while the previous licenses state that re-publication of data should be under a compatible license, only the OS Open Data license explicitly notes that the attribution statements must also be preserved. So both the license and attribution have viral qualities.

Attribution Statements in Terms and Conditions

Now lets look at some specific Open Data services to see what attribution provisions they include.

Freebase is an interesting example. It draws on multiple datasets which are supplemented by contributions of its user community. Some of that data is under different licenses. As you can see from their attribution page, there are variants in attribution statements depending on whether the data is about one or several resources and whether it includes Wikipedia content, which must be specially acknowledged.

They provide a handy HTML snippet for you to include in your webpage to make sure you get the attribution exactly right. Ironically at the time of writing this service is broken (“User Rate Limit Exceeded”). If you want a slightly different attribution, then you’re asked to contact them.

Now, while Freebase might not meet everyone’s definition of Open Data, its an interesting data point.  Particularly as they ask for deep links back to the dataset, as well as having a clear expectation of where/how the attribution will be surfaced.

OpenCorporates is another illustrative example. Their legal/license info page examples that their dataset is licensed under the Open Data Commons Database License and explains that:

Use of any data must be accompanied by a hyperlink reading “from OpenCorporates” and linking to either the OpenCorporates homepage or the page referring to the information in question

There are also clear expectations around the visibility of that attribution:

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

So there is a clear expectation that the attribution should be displayed alongside any data. Like the OS license these attribution requirements are also viral as they must be passed on by aggregators.

My intention isn’t to criticise either OpenCorporates or Freebase, but merely to highlight some real world examples.

What are some of the issues with data attribution?

Clearly we could undertake a much more thorough review than I have done here. But this is sufficient to highlight what I think are some of the key issues. Put yourself in the position of a developer consuming some Open Data under any or all of these conditions. How do you responsibly provide attribution?

The questions that occur to me, at least are:

  • Do I need to put attribution on every page of my application, or can I simply add it to a colophon? (Aside: lanyrd has a great colophon page). In some cases it seems like I might have some freedom of choice, in others I don’t
  • If I do have to put a link or some text on a page, then do I have any flexibility around its size, positioning, visibility, etc? Again, in some cases I may do, but in others I have some clear guidance to follow. This might be challenging if I’m creating a mobile application with limited screen space. Or creating a voice or SMS application.
  • What if I just re-use the data as part of some back-end analysis, but none of that data is actually surfaced to the user? How do I attribute in this scenario?
  • Do I need to acknowledge the publisher, or a link to the source page(s)?
  • What if I need to address multiple requirements, e.g. if I mashed up data from data.gov.uk, the Ordnance Survey, Freebase and OpenCorporates? That might get awkward.

There are no clear answers to these questions. For individual datasets I might be able to get guidance, but it requires me to read the detailed terms and conditions for the dataset or API I’m using. Isn’t the whole purpose in having off-the-shelf licenses like the OGL or ODbL supposed to help us streamline data sharing? Attribution, or rather unclear or overly detailed attribution requirements are a clear source of friction. Especially if there are legal consequences for getting it wrong.

And that’s just when we’re considering integrating data sources by hand. What about if we want to automatically combine data? How is a machine going to understand these conditions? I suspect that every Linked Data browser and application fails to comply with the attribution requirements of the data its consuming.

Of course these issues have been explored already. The Science Commons Protocol encourages publishing data into the public domain — so no legal requirement for attribution at all. It also acknowledges the “Attribution Stacking” problem (section 5.3) which occurs when trying to attribute large numbers of datasets, each with their own requirements. Too much friction discourages use, whether its research or commercial.

Unfortunately the recently published Amsterdam Manifesto on data citation seems to overlook these issues, requiring all authors/contributors to be attributed.

The scientific community may be more comfortable with a public domain licensing approach and a best effort attribution model because it is supported by strong social norms: citation and attribution is essential to scientific discourse. We don’t have anything like that in the broader open data community. Maybe its not achievable, but it seems like clear guidance would be very useful.

There’s some useful background on problems with attribution and marking requirements on the Creative Commons wiki that also references some possible amendments and clarifications.

Can we convergence on some common conventions?

So would it be possible to converge on a simple set of conventions or norms around data re-use? Ideally to the extent that attribution can be simplified and ideally automated as far as possible.

How about the following:

  • Publishers should clearly describe their attribution requirements. Ideally this should be a short simple statement (similar to the Open Government License) which includes their name and a link to their homepage. This attribution could be included anywhere on the web site or application that consumes the data.
  • Publishers should be aware that the consumers of their data will be doing so in a variety of applications and on a variety of platforms. This means allowing a deal of flexibility around how/where attribution is displayed.
  • Publishers should clearly indicate whether attribution must be passed on to down-stream users
  • Publishers should separately document their citation requirements. If they want to encourage users to link to the dataset, or an individual page on their site, to allow users to find the original context, then they should publish instructions on how to do it. However this kind of linking is for citation so consumers should be bound to include it
  • Consumers should comply with publishers wishes and include an about page on their site or within their application that attributes the originators of the data they use. Where feasible they should also provide citations to specific resources or datasets from within their applications. This provides their users with clear citations to sources of data
  • Both sides should collaborate on structured markup to support publication of these attribution and citation requirements, as well as harvesting of links

Whether attribution should be a legally enforced is another discussion. Personally I’d be keen to see a common set of conventions regardless of the legal basis for doing it. Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

What Does Your Dataset Contain?

Having explored some ways that we might find related data and services, as well as different definitions of “dataset”, I wanted to look at the topic of dataset description and analysis. Specifically, how can we answer the following questions:

  • what kinds of information does this dataset contain?
  • what types of entity are described in this dataset?
  • how can I determine if this dataset will fulfil my requirements?

There’s been plenty of work done around trying to capture dataset metadata, e.g. VoiD and DCAT; there’s also the upcoming working on Open Data on the Web. Much of that work has focused on capturing the core metadata about a dataset, e.g. who published it, when was it last updated, where can I find the data files, etc. But there’s still plenty of work to be done here, to encourage broader adoption of best practices, and also to explore ways to expose more information about the internals of a dataset.

This is a topic I’ve touched on before, and which we experimented with in Kasabi. I wanted to move “beyond the triple count” and provide a “report card” that gave a little more insight into a dataset. A report card could usefully complement an ODI Open Data Certificate, for example. Understanding the composition of a dataset can also help support new ways of manipulating and combining datasets.

In this post I want to propose a conceptual framework for capturing metadata about datasets. Its intended as a discussion point, so I’m interested in getting feedback. (I would have submitted this to the ODW workshop but ran out of time before the deadline).

At the top level I think there are five broad categories of dataset information: Descriptive Data; Access Information; Indicators; Compositional Data; and Relationships. Compositional data can be broken down into smaller categories — this is what I described as an “information spectrum” in the Beyond the Triple Count post.

While I’ve thought about this largely from the perspective of Linked Data, I think its applicable to any format/technology.

Descriptive Data

This kind of information helps us understand a dataset as a “work”: its name, a human-readable description or summary, its license, and pointers to other relevant documentation such as quality control or feedback processes. This information is typically created and maintained directly by the data publisher, whereas the other categories of data I describe here can potentially be derived automatically by data analysis

Examples:

  • Title
  • Description
  • License
  • Publisher
  • Subject Categories

Access Information

Basically, where do I get the data?

  • Where do I download the latest data?
  • Where can I download archived or previous versions of the data?
  • Are there mirrors for the dataset?
  • Are there APIs that use this data?
  • How do I obtain access to the data or API?

Indicators

This is statistical information that can help provide some insight into the data set, for example its size. But indicators can also build confidence in re-users by highlighting useful statistics such as the timeliness of releases, speed of responding to data fixes, etc.

While a data publisher might publish some of these indicators as targets that they are aiming to achieve, many of these figures could be derived automatically from an underlying publishing platform or service.

Examples of indicators:

  • Size
  • Rate of Growth
  • Date of Last Update
  • Frequency of Updates
  • Number of Re-users (e.g. size of user community, or number of apps that use it)
  • Number of Contributors
  • Frequency of Use
  • Turn-around time for data fixes
  • Number of known errors
  • Availability (for API based access)

Relationships

Relationship data primarily drives discovery use cases: to which other datasets does this dataset relate? For example the dataset might re-use identifiers or directly link to resources in other datasets. Knowing the source of that information can help us build trust in the reliability of the combined data, as well as give us sign-posts to other useful context. This is where Linked Data excels.

Annotation Datasets provide context to, and enrich other reference datasets. Annotations might be limited to linking information (“Link Sets”) or they may add new facts/properties about existing resources. Independently sourced quality control information could be published as annotations.

Provenance is also a form of relationship information. Derived datasets, e.g. created through analysis or data conversions, should refer to their original input datasets, and ideally also the algorithms and/or code that were applied.

Again, much of this information can be derived from data analysis. Recommendations for relevant related datasets might be created based on existing links between datasets or by analysing usage patterns. Set algebra on URIs in datasets can be used to do analysis on their overlap, to discover linkages and to determine whether one dataset contains annotations of another.

Examples:

  • List of dataset(s) that this dataset draws on (e.g. re-uses identifiers, controlled vocabulary, etc)
  • List of datasets that this datasets references, e.g. via links
  • List of source datasets used to compile or create this dataset
  • List of datasets that link to this dataset (“back links”)
  • Which datasets are often used in conjunction with this dataset?

Compositional Data

This is information about the internals of a dataset: e.g. what kind of data does it contain, how is that data organized, and what kinds of things are being described?

This is the most complex area as there are potentially a number of different audiences and abilities to cater for. At one end of the spectrum we want to provide high level summaries of the contents of a dataset, while at the other end we want to provide detailed schema information to support developers. I’ve previously advocated a “progressive disclosure” approach to allow re-users to quickly find the data they need; a product manager looking for data to support a new feature will be looking for different information to a developer constructing queries over a dataset.

I think there are three broad ways that we can decompose Compositional Data further. There are particular questions and types of information that relate to each of them:

  • Scope or Coverage 
    • What kinds of things does this dataset describe? Is it people, places, or other objects?
    • How many of these things are in the dataset?
    • Is there a geographical focus to the dataset, e.g. a county, region, country or is it global?
    • Is the data confined to a particular data period (archival data) or does it contain recent information?
  • Structure
    • What are some typical example records from the dataset?
    • What schema does it conform to?
    • What graph patterns (e.g. combinations of vocabularies) are commonly found in the data?
    • How are various types of resource related to one another?
    • What is the logical data model for the data?
  • Internals
    • What RDF terms and vocabularies that are used in the data?
    • What formats are used for capturing dates, times, or other structured values?
    • Are there custom validation rules for particular fields or properties?
    • Are there caveats or qualifiers to individual schema elements or data items?
    • What is the physical data model
    • How is the dataset laid out in a particular database schema, across a collection of files, or named graphs?

The experiments we did in Kasabi around the report card (see the last slides for examples) were exploring ways to help visualise the scope of a dataset. It was based on identifying broad categories of entity in a dataset. I’m not sure we got the implementation quite right, but I think it was a useful visual indicator to help understand a dataset.

This is a project I plan to revive when I get some free time. Related to this is the work I did to map the Schema.org Types to the Noun Project Icons.

Summary

I’ve tried to present a framework that captures most, if not all of the kinds of questions that I’ve seen people ask when trying to get to grips with a new dataset. If we can understand the types of information people need and the questions they want to answer, then we can create a better set of data publishing and analysis tools.

To date, I think there’s been a tendency to focus on the Descriptive Data and Access Information — because we want to be able to discover data — and its Internals — so we know how to use it.

But for data to become more accessible to a non-technical audience we need to think about a broader range of information and how this might be surfaced by data publishing platforms.

If you have feedback on the framework, particularly if you think I’ve missed a category of information, then please leave a comment. The next step is to explore ways to automatically derive and surface some of this information.

What is a Dataset?

As my last post highlighted, I’ve been thinking about how we can find and discover datasets and their related APIs and services. I’m thinking of putting together some simple tools to help explore and encourage the kind of linking that my diagram illustrated.

There’s some related work going on in a few areas which is also worth mentioning:

  • Within the UK Government Linked Data group there’s some work progressing around the notion of a “registry” for Linked Data that could be used to collect dataset metadata as well as supporting dataset discovery. There’s a draft specification which is open for comment. I’d recommend you ignore the term “registry” and see it more as a modular approach for supporting dataset discovery, lightweight Linked Data publishing, and “namespace management” (aka URL redirection). A registry function is really just one aspect of the model.
  • There’s an Open Data on the Web workshop in April which will cover a range of topics including dataset discovery. My current thoughts are partly preparation for that event (and I’m on the Programme Committee)
  • There’s been some discussion and a draft proposal for adding the Dataset type to Schema.org. This could result in the publication of more embedded metadata about datasets. I’m interested in tools that can extract that information and do something useful with it.

Thinking about these topics I realised that there are many definitions of “dataset”. Unsurprisingly it means different things in different contexts. If we’re defining models, registries and markup for describing datasets we may need to get a sense of what these different definitions actually are.

As a result, I ended up looking around for a series of definitions and I thought I’d write them down here.

Definitions of Dataset

Lets start with the most basic, for example Dictionary.com has the following definition:

“a collection of data records for computer processing”

Which is pretty vague. Wikipedia has a definition which derives from the terms use in a mainframe environment:

“A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

Nontabular datasets can take the form of marked up strings of characters, such as an XML file.”

The W3C Data Catalog Vocabulary defines a dataset as:

“A collection of data, published or curated by a single source, and available for access or download in one or more formats.”

The JISC “Data Information Specialists Committee” have a definition of dataset as:

“…a group of data files–usually numeric or encoded–along with the documentation files (such as a codebook, technical or methodology report, data dictionary) which explain their production or use. Generally a dataset is un-usable for sound analysis by a second party unless it is well documented.”

Which is a good definition as it highlights that the dataset is more than just the individual data files or facts, it also consists of some documentation that supports its use or analysis. I also came across a document called “A guide to data development” (2007) from the National Data Development and Standards Unit in Australia which describes a dataset as

“A data set is a set of data that is collected for a specific purpose. There are many ways in which data can be collected—for example, as part of service delivery, one-off surveys, interviews, observations, and so on. In order to ensure that the meaning of data in the data set is clearly understood and data can be consistently collected and used, data are defined using metadata…”

This too has the notion of context and clear definitions to support usage, but also notes that the data may be collected in a variety of ways.

A Legal Definition

As it happens, there’s also a legal definition of a dataset in the UK, at least as far as it relates to the Freedom of Information. The “Protections of Freedom Act 2012 Part 6, (102) c” includes the following definition:

In this Act “dataset” means information comprising a collection of information held in electronic form where all or most of the information in the collection—

  • (a)has been obtained or recorded for the purpose of providing a public authority with information in connection with the provision of a service by the authority or the carrying out of any other function of the authority,
  • (b)is factual information which—
    • (i)is not the product of analysis or interpretation other than calculation, and
    • (ii)is not an official statistic (within the meaning given by section 6(1) of the Statistics and Registration Service Act 2007), and
  • (c)remains presented in a way that (except for the purpose of forming part of the collection) has not been organised, adapted or otherwise materially altered since it was obtained or recorded.”

This definition is useful as it defines the boundaries for what type of data is covered by Freedom of Information requests. It clearly states that the data is collected as part of the normal business of the public body and also that the data is essentially “raw”, i.e. not the result of analysis or has not been adapted or altered.

Raw data (as defined here!) is more useful as it supports more downstream usage. Raw data has more potential.

Statistical Datasets

The statistical community has also worked towards having a clear definition of dataset. The OECD Glossary defines a Dataset as “any organised collection of data”, but then includes context that describes that further. For example that a dataset is a set of values that have a common structure and are usually thematically related. However there’s also this note that suggests that a dataset may also be made up of derived data:

“A data set is any permanently stored collection of information usually containing either case level data, aggregation of case level data, or statistical manipulations of either the case level or aggregated survey data, for multiple survey instances”

Privacy is one key reason why a dataset may contain derived information only.

The RDF Data Cube vocabulary, which borrows heavily from SDMX — a key standard in the statistical community — defines a dataset as being made up of several parts:

  1. “Observations - This is the actual data, the measured numbers. In a statistical table, the observations would be the numbers in the table cells.
  2. Organizational structure - To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation…
  3. Internal metadata - Having located an observation, we need certain metadata in order to be able to interpret it. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated?…
  4. External metadata – This is metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, and a SPARQL endpoint where it can be accessed.”

The SDMX implementors guide has a long definition of dataset (page 7) which also focuses on the organisation of the data and specifically how individual observations are qualified along different dimensions and measures.

Scientific and Research Datasets

Over the last few years the scientific and research community have been working towards making their datasets more open, discoverable and accessible. Organisations like the Welcome Foundation have published guidance for researchers on data sharing; services like CrossRef and DataCite provide the means for giving datasets stable identifiers; and platforms like FigShare support the publishing and sharing process.

While I couldn’t find a definition of dataset from that community (happy to take pointers!) its clear that the definition of dataset is extremely broad. It could cover both raw results, e.g. output from sensors or equipment, through to more analysed results. The boundaries are hard to define.

Given the broad range of data formats and standards, services like FigShare accept any or all data formats. But as the Welcome Trust note:

“Data should be shared in accordance with recognised data standards where these exist, and in a way that maximises opportunities for data linkage and interoperability. Sufficient metadata must be provided to enable the dataset to be used by others. Agreed best practice standards for metadata provision should be adopted where these are in place.”

This echoes the earlier definitions that included supporting materials as being part of the dataset.

RDF Datasets

I’ve mentioned a couple of RDF vocabularies already, but within the RDF and Linked Data community there are a couple of other definitions of dataset to be found. The Vocabulary for Organising Interlinked Datasets (VoiD) is similar to, but predates, DCAT. Whereas DCAT focuses on describing a broad class of different datasets, VoiD describes a dataset as:

“…a set of RDF triples that are published, maintained or aggregated by a single provider…the term dataset has a social dimension: we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian. Also, typically a dataset is accessible on the Web, for example through resolvable HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many triples that there is benefit in providing a concise summary.”

Like the more general definitions this includes the notion that the data may relate to a specific topic or be curated by a single organisation. But this definition also makes some assumption about the technical aspects of how the data is organised and published. VoiD also includes support for linking to the services that relate to a dataset.

Along the same lines, SPARQL also has a definition of a Dataset:

“A SPARQL query is executed against an RDF Dataset which represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI…”

Unsurprisingly for a technical specification this is a very narrow definition of dataset. It also differs from the VoiD definition. While both assume RDF as the means for organising the data, the VoiD term is more general, e.g. it glosses over details of the internal organisation of the dataset into named graphs. This results in some awkwardness when attempting to navigate between a VoiD description and a SPARQL Service Description.

Summary

If you’ve gotten this far, then well done :)

I think there’s a couple of things we can draw out from these definitions which might help us when discussing “datasets”:

  • There’s a clear sense that a dataset relates to specific topic and is collected for a particular purpose.
  • The means by which a dataset is collected and the definitions of its contents are important for supporting proper re-use
  • Whether a dataset consists of “raw data” or more analysed results can vary across communities. Both forms of dataset might be available, but in some circumstances (e.g. for privacy reasons) only derived data might be published
  • Depending on your perspective and your immediate use case the dataset may be just the data items, perhaps expressed in a particular way (e.g. as RDF).  But in a broader sense, the dataset also includes the supporting documentation, definitions, licensing statements, etc.

While there’s a common core to these definitions, different communities do have slightly different outlooks that are likely to affect how they expect to publish, describe and share data on the web.

Dataset and API Discovery in Linked Data

I’ve been recently thinking about how applications can discover additional data and relevant APIs in Linked Data. While there’s been lots of research done on finding and using (semantic) web services I’m initially interested in supporting the kind of bootstrapping use cases covered by Autodiscovery.

We can characterise that use case as helping to answer the following kinds of questions:

  • Given a resource URI, how can I find out which dataset it is from?
  • Given a dataset URI, how can I find out which resources it contains and which APIs might let me interact with it?
  • Given a domain on the web, how can I find out whether it exposes some machine-readable data?
  • Where is the SPARQL endpoint for this dataset?

More succinctly: can we follow our nose to find all related data and APIs?

I decided to try and draw a diagram to illustrate the different resources involved and their connections. I’ve included a small version below:

Data and API Discovery with Linked Data

Lets run through the links between different types of resources:

  • From Dataset to Sparql Endpoint (and Item Lookup, and Open Search Description): this is covered by VoiD which provides simple predicates for linking a dataset to three types of resources. I’m not aware of other types of linking yet, but it might be nice to support reconciliation APIs.
  • From Well-Known VoiD Description (background) to Dataset. This well known URL allows a client to find the “top-level” VoiD description for a domain. It’s not clear what that entails, but I suspect the default option will be to serve a basic description of a single dataset, with reference to sub-sets (void:subset) where appropriate. There might also just be rdfs:seeAlso links.
  • From a Dataset to a Resource. A VoiD description can include example resources, this blesses a few resources in the dataset with direct links. Ideally these resources ought to be good representative examples of resources in the dataset, but they might also be good starting points for further browsing or crawling.
  • From a Resource to a Resource Description. If you’re using “slash” URIs in your data, then URIs will usually redirect to a resource description that contains the actual data. The resource description might be available in multiple formats and clients can content negotiation or follow Link headers to find alternative representations.
  • From a Resource Description to a Resource. A description will typically have a single primary topic, i.e. the resource its describing. It might also reference other related resources, either as direct relationships between those resources or via rdfs:seeAlso type links (“more data over here”).
  • From a Resource Description to a Dataset. This is where we might use a dct:source relationship to state that the current description has been extracted from a specific dataset.
  • From a SPARQL Endpoint (Service Description) to a Dataset. Here we run into some differences between definitions of dataset, but essentially we can describe in some detail the structure of the SPARQL dataset that is used in an endpoint and tie that back to the VoiD description. I found myself looking for a simple predicate that linked to a void:Dataset rather than describing the default and named graphs, but couldn’t find one.
  • I couldn’t find any way to relate a Graph Store to a Dataset or SPARQL endpoint. Early versions of the SPARQL Graph Store protocol had some notes on autodiscovery of descriptions, but these aren’t in the latest versions.

These links are expressed, for the most part, in the data but could also be present as Link headers in HTTP responses or in HTML (perhaps with embedded RDFa).

I’ve also not covered sitemaps at all, which provide a way to exhaustively list the key resources in a website or dataset to support mirroring and crawling. But I thought this diagram might be useful.

I’m not sure that the community has yet standardised on best practices for all of these cases and across all formats. That’s one area of discussion I’m keen to explore further.

Publishing SPARQL queries and documentation using github

Yesterday I released an early version of sparql-doc a SPARQL documentation generator. I’ve got plans for improving the functionality of the tool, but I wanted to briefly document how to use github and sparql-doc to publish SPARQL queries and their documentation.

Create a github project for the queries

First, create a new github project to hold your queries. If you’re new to github then you can follow their guide to get a basic repository set up with a README file.

The simplest way to manage queries in github is to publish each query as a separate SPARQL query file (with the extension “.rq”). You should also add a note somewhere specifying the license associated with your work.

As an example, here is how I’ve published a collection of SPARQL queries for the British National Bibliography.

When you create a new query be sure to add it to your repository and regularly commit and push the changes:

git add my-new-query.rq
git commit my-new-query.rq -m "Added another example"
git push origin master

The benefit of using github is that users can report bugs or submit pull requests to contribute improvements, optimisations, or new queries.

Use sparql-doc to document your queries

When you’re writing your queries follow the commenting conventions encouraged by sparql-doc. You should also install sparql-doc so you can use it from the command-line. (Note: installation guide and notes still needs some work!)

As a test you can try generating the documentation from your queries as follows. Make a new local directory called, e.g. ~/projects/examples. Execute the following from your github project directory to create the docs:

sparql-doc ~/projects/examples ~/projects/docs

You should then be able to open up the index.html document in ~/projects/examples to see how the documentation looks.

If you have an existing web server somewhere then you can just zip up those docs and put them somewhere public to share them with others.However you can also publish them via Github pages. This means you don’t have to setup any web hosting at all.

Use github pages to publish your docs

Github Pages allows github users to host public, static websites directly from github projects. It can be used to publish blogs or other project documentation. But using github pages can seem a little odd if you’re not familiar with git.

Effectively what we’re going to do is create a separate collection of files — the documentation — that sits in parallel to the actual queries. In git terms this is done by creating a separate “orphan” branch. The documentation lives in the branch, which must be called gh-pages, while the queries will remain in the master branch.

Again, github have a guide for manually adding and pushing files for hosting as pages. The steps below follow the same process, but using sparql-doc to create the files.

Github recommend starting with a fresh separate checkout of your project. You then create a new branch in that checkout, remove the existing files and replace them with your documentation.

As a convention I suggest that when you checkout the project a second time, that you give it a separate name, e.g. by adding a “-pages” suffix.

So for my bnb-queries project, I will have two separate checkouts:

The original checkout I did when setting up the project for the BNB queries was:

git clone git@github.com:ldodds/bnb-queries.git

This gave me a local ~/projects/bnb-queries directory containing the main project code. So to create the required github pages branch, I would do this in the ~/projects directory:

#specify different directory name on clone
git clone git@github.com:ldodds/bnb-queries.git bnb-queries-pages
#then follow the steps as per the github guide to create the pages branch
cd bnb-queries-pages
git checkout --orphan gh-pages
#then remove existing files
git rm -rf .

This gives me two parallel directories one containing the master branch and the other the documentation branch. Make sure you really are working with a separate checkout before deleting and replacing all of the files!

To generate the documentation I then run sparql-doc telling it to read the queries from the project directory containing the master branch, and then use the directory containing the gh-pages branch as the output, e.g.:

sparql-doc ~/projects/bnb-queries ~/projects/bnb-queries-pages

Once that is done, I then add, commit and push the documentation, as per the final step in the github guide:

cd ~/projects/bnb-queries-pages
git add *
git commit -am "Initial commit"
git push origin gh-pages

The first time you do this, it’ll take about 10 minutes for the page to become active. They will appear at the following URL:

http://USER.github.com/PROJECT

E.g. http://ldodds.github.com/sparql-doc

If you add new queries to the project, be sure to re-run sparql-doc and add/commit the updated files.

Hopefully that’s relatively clear.

The main thing to understand is that locally you’ll have your github project checked out twice: once for the main line of code changes, and once for the output of sparql-doc.

These will need to be separately updated to add, commit and push files. In practice this is very straight-forward and means that you can publicly share queries and their documentation without the need for web hosting.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers