Category Archives: Open Data

Data and information in the city

For a while now I’ve been in the habit of looking for data as I travel to work or around Bath. You can’t really work with data and information systems for any length of time without becoming a little bit obsessive about numbers or becoming tuned into interesting little dashboards:

My eye gets drawn to gauges and displays on devices as I’m curious about not just what they’re showing but also for whom the information is intended.

I can also tell you that for at least ten years, perhaps longer, the electronic signs on some of the buses running the Number 10 route in Bath have been buggy. Instead of displaying “10 Southdown” they read “(ode1fsOs1ss1 10sit2 Southdown)” with a flashing “s” in “sit”.

Yes. I wrote it down. I was curious about whether it was some misplaced control codes, but I couldn’t find a reference.

Having spent so long working on data integration and with technologies like Linked Data, I’m also curious about how people assign identifiers to things. A lot of what I’ve learnt about that went into writing this paper, which is a piece of work of which I’m very proud. It’s become an ingrained habit to look out for identifiers wherever I can find them.  It’s not escaped me that this is pretty close to train spotting, btw!

I’ve also recently started contributing to Bath: Hacked, which is Bath’s community-led open data project. Its led me to pay even closer attention to the information around me in Bath, as it might turn up some useful data that could be published or indicate the potential for a useful digital service.

So to try and direct my “data magpie” habits into a more productive direction, I’ve started on a small project to photograph some of the information and data I find as I walk around the city. There are signs, information and data all around us but we don’t often really notice it or we just take the information for granted. I decided to try to catalogue some of the ways in which we might encounter data around Bath and, by extension, in other cities.

The entire set of photos is available on Flickr if you care to take a look. Think of it as a natural history study of data.

In the rest of this post I wanted to explore a few things that have occurred to me along the way. Areas where we can glimpse the digital environment and data infrastructure that is increasingly supporting the physical environment. And the ways in which data might be intentionally or incidentally shared with others.

Data as dark matter

For most people data is ephemeral stuff. It’s not something they tend to think about even though its being collected and recorded all around us. While there’s increasing awareness of how our personal data is collected and used by social networks and other services, there’s often little understanding of what data might be available about the built environment.

But you can see evidence of that data all around us. Data is a bit like dark matter: we often only know it exists based on its effects on other things which we more clearly understand. Once you start looking you can see identifiers everywhere:

Bridge identifiers

If something has an identifier then there will be data associated with it, creating a record that describes that object. As there is very likely to be a collection of those things then we can infer that there’s a database containing many similar records.

Once you start looking you can see databases everywhere: of lampposts, parking spaces, bins, and the monoliths that sit in our streets but which we rarely think about:

Traffic light control box

Once you realise all of these databases exist it’s natural to start asking questions such as how that information is collected, who is responsible for it, and when might it be useful?  There are databases everywhere and people are employed to look after them.

The bus driver’s role in data governance

Live bus times

I was looking forward to the installation of the Real Time Information sign at the bus stop (0180BAC30294) near my house. For a few years now I’ve been regularly taking a photo of the paper sign on the stop. Looking at that on my phone is still much quicker than using any of the online services or apps. A real time data feed was going to solve that. Only it didn’t. It’s made things worse:

My morning bus, the one that begins my commute to the Open Data Institute, is often not listed. I’ve had several morning conversations with Travelwest about it. Although, evoking Hello Lamppost, it feels like I’ve been arguing with the bus sign itself and would like to leave a note to others to say that, actually yes the Number 10 is really on its way.

I’m suddenly concerned that they may do away with that helpful paper sign. The real-time information feed exposes problems with data management that wouldn’t otherwise be evident. Real-time doesn’t necessarily always mean better.

Interestingly Travelwest have an FAQ that lists a number of reasons why some buses won’t appear on the RTI system. This includes the expected range of data and hardware problems, but also: “The bus driver has logged on to the ETM incorrectly, preventing the journey operated being ‘matched’ by the central RTI system“.

So it turns out that bus drivers have a key role in the data governance of this particular dataset. They’re not just responsible for getting the bus from A-B but also in ensuring that passengers know that its on its way. I wonder if that’s part of their induction?

The paperless city

There are more obvious signs of business processes that we can see around a city. These are stages in processes that require some public notice or engagement, such as planning applications or other “rights to object” to planned works:

Pole Objection Notice

In other cases the information is presented as an indication that a process has been completed successfully, such as gaining a premises licence, liability insurance or an energy rating certificate. If this information is being put on physical display then it’s natural to wonder whether there are digital versions that could or should be made available.

Also, in the majority of cases, making this information availability digitally would probably be much better. There are certainly opportunities to create better digital services to help engage people in these processes. But in order to be inclusive I suspect paper based approaches are going to be around for a while.

What would a digital public service look like that provided this type of city information, both on-demand and as notifications, to residents? The information might already be available on council websites, but you have to know that it’s there and then how find it.

Visible to the public, but not for the public

Interestingly, not all of the information we can find around the city is intended for wider public consumption. It may be published into a public space but it might only be intended for a particular group of people, or useful at a particular point in time, e.g. during an emergency such as this map of fire sensors.

Fire hydrant

Most of the identifier examples I referred to above fall into this category. Only a small number of people need to know the identifier for a specific bin, traffic light control both, or bridge.

It also means that information may often be provided without context as the intended audience knows how to read it or has the tools required to use it to unlock more information.  This means to properly interpret it you have to be able to understand the visual code that is used in these organisational hobo signs.

The importance of notice boards

For me there’s something powerful in the juxtaposition of these two examples:

Community notice board

Dynamic display board

The first is a community notice board. Anyone can come along and not only read it but also add to the available information. It’s a piece of community owned and operated information infrastructure. This manually updated map of the local farmers market is another nice example, as are the walls of flyers and event notices at the local library.

The second example is a sealed unit. It’s owned and operated by a single organisation who gets to choose what information is displayed. Community annotations aren’t possible. There’s no scope to add notices or grafitti to appropriate the structure for other purposes – something that you see everywhere else in the city. This is increasingly hard to do with digitial infrastructures.

In my opinion a truly open city will include both types of digital and physical infrastructure. I dislike the top-down view of the smart city and preview the vision of creating an open, annotable data infrastructure for residents and local businesses to share information.

Useful perspective

In this rambling post I’ve tried to capture some of the thoughts that have occurred to me whilst taking a more critical look at how data and information is published in our cities. I’ve really only scratched the surface, but it’s been fun to take a step back and look at Bath with a slightly more critical eye.

I think it’s interesting to see how data leaks into the physical environment, either intentionally or otherwise. Using environments that people are familiar with might also be a useful way to get a wider audience thinking about the data that helps our society function, and how it is owned and operated.

It’s also interesting to consider how a world of increasingly connected devices and real-time information is going to impact this environment. Will all of this information move onto our phones, watches or glasses and out of the physical infrastructure? Or are we going to end up with lots more cryptic icons and identifiers on all kinds of bits of infrastructure?

 

“The scribe and the djinn’s agreement”, an open data parable

In a time long past, in a land far away, there was once a great city. It was the greatest city in the land and the vast marketplace at its centre was the busiest, liveliest marketplace in the world. People of all nations could be found there buying and selling their wares. Indeed, the marketplace was so large that people would spend days, even weeks, exploring its length and breadth would still discover new stalls selling a myriad of items.

A frequent visitor to the marketplace was a woman known only as the Scribe. While the Scribe was often found roaming the marketplace even she did not know of all of the merchants to be found within its confines. Yet she spent many a day helping others to find their way to the stalls they were seeking, and was happy to do so.

One day, as a gift for providing useful guidance, a mysterious stranger gave the Scribe a gift: a small magical lamp. Upon rubbing the lamp a djinn appeared before the suprised Scribe and offered her a single wish.

“Oh venerable djinn” cried the Scribe, “grant me the power to help anyone that comes to this marketplace. I wish to help anyone who needs it to find their way to whatever they desire”.

With a sneer the djinn replied: “I will grant your wish. But know this: your new found power shall come with limits. For I am a capricious spirit who resents his confinement in this lamp”. And with a flash and a roll of thunder, the magic was completed. And in the hands of the Scribe appeared the Book.

The Book contained the name and location of every merchant in the marketplace. From that day forward, by reading from the Book, the Scribe was able to help anyone who needed assistance to find whatever they needed.

After several weeks of wandering the market, happily helping those in need, the Scribe was alarmed to discover that she was confronted by a long, long line of people.

“What is happening?” she asked of the person at the head of the queue.

“It is now widely known that no-one should come to the Market without consulting the Scribe” said the man, bowing. “Could you direct me to the nearest merchant selling the finest silks and tapestries?”

And from that point forward the Scribe was faced with a never-ending stream of people asking for help. Tired and worn and no longer able to enjoy wandering the marketplace as had been her whim, she was now confined to its gates. Directing all who entered, night and day.

After some time, a young man took pity on the Scribe, pushing his way to the front of the queue. “Tell me where all of the spice merchants are to be found in the market, and then I shall share this with others!”

But no sooner had he said this than the djinn appeared in a puff of smoke: “NO! I forbid it!”. With a wave of its arm the Scribe was struck dumb until the young man departed. With a smirk the djinn disappeared.

Several days passed and a group of people arrived at the head of queue of petitioners.

“We too are scribes.” they said. “We come from a neighbouring town having heard of your plight. Our plan is to copy out your Book so that we might share your burden and help these people”.

But whilst a spark of hope was still flaring in the heart of the scribe, the djinn appeared once again. “NO! I forbid this too! Begone!” And with scream and a flash of light the scribes vanished. Looking smug the djinn disappeared.

Some time passes before a troupe of performers approach the Scribe. As a chorus they cried: “Look yonder at our stage, and the many people gathered before it. By taking turns from reading from the book, in front of wide audience, we can easily share your burden”.

But shaking her head the Scribe could only turn away whilst the djinn visited ruin upon the troupe. “No more” she whispered sadly.

And so, for many years the Scribe remained as she had been, imprisoned within the subtle trap of the djinn of the lamp. Until, one day a traveller appeared in the market. Upon reaching the head of the endless line of penitents, the man asked of the Scribe:

“Where should you go to rid your self of the evil djinn?”.

Surprised, and with sudden hope, the Scribe turned the pages of her Book…

Open data and diabetes

In December my daughter was diagnosed with Type 1 diabetes. It was a pretty rough time. Symptoms can start and escalate very quickly. Hyperglycaemia and ketoacidosis are no joke.

But luckily we have one of the best health services in the world. We’ve had amazing care, help and support. And, while we’re only 4 months into dealing with a life-long condition, we’re all doing well.

Diabetes sucks though.

I’m writing this post to reflect a little on the journey we’ve been on over the last few months from a professional rather than a personal perspective. Basically, the first weeks of becoming a diabetic or the parent of a diabetic, is a crash course in physiology, nutrition, and medical monitoring. You have to adapt to new routines for blood glucose monitoring, learn to give injections (and teach your child to do them), become good at book-keeping, plan for exercise, and remember to keep needles, lancets, monitors, emergency glucose and insulin with you at all times, whilst ensuring prescriptions are regularly filled.

Oh, and there’s a stupid amount of maths because you’ll need to start calculating how much carbohydrates are in all of your meals and inject accordingly. No meal unless you do your sums.

Good job we had that really great health service to support us (there’s data to prove it). And an amazing daughter who has taken it all in her stride.

Diabetics live a quantified life. Tightly regulating blood glucose levels means knowing exactly what you’re eating, and learning how your body reacts to different foods and levels of exercise. For example we’ve learnt the different ways that a regular school day versus school holidays effects my daughters metabolism. That we need to treat ahead for the hypoglycaemia that follows a few hours after some fun on the trampoline. And that certain foods (cereals, risotto) seem to affect insulin uptake.

So to manage the condition we need to know how many carbohydrates are in:

  • any pre-packaged food my daughter eats
  • any ingredients we use when cooking, so we can calculate a total portion size
  • in any snack or meal that we eat out

Food labeling is pretty good these days so the basic information is generally available. But its not always available on menus or in an easy to use format.

The book and app that diabetic teams recommend is called Carbs and Cals. I was a little horrified by it initially as its just a big picture book of different portion sizes of food. You’re encouraged to judge everything by eye or weight. It seemed imprecise to me but with hindsight its perfectly suited to those early stages of learning to live with diabetes. No hunting over packets to get the data you need: just look at a picture, a useful visualisation. Simple is best when you’re overwhelmed with so many other things.

Having tried calorie counting I wanted to try an app to more easily track foods and calculate recipes. My Fitness Pal, for example, is pretty easy to use and does bar-code scanning of many foods. There are others that are more directly targeted at diabetics.

The problem is that, as I’ve learnt from my calorie counting experiments, the data isn’t always accurate. Many apps fill their databases through crowd-sourcing. But recipes and portion sizes change continually. And people make mistakes when they enter data, or enter just the bits they’re interested in. Look-up any food on My Fitness Pal and you’ll find many duplicate entries. It makes me distrust the data because I’m concerned its not reliable. So for now we’re still reading packets.

Eating out is another adventure. There have been recent legislative changes to require restaurants to make more nutritional information available. If you search you may find information on a company website and can plan ahead. Sometimes its only available if you contact customer support. If you ask in a (chain) restaurant they may have it available in a ring-binder you can consult with the menu. This doesn’t make a great experience for anyone. Recently we’ve been told in a restaurant to just check online for the data (when we know it doesn’t exist), because they didn’t want to risk any liability by providing information directly. On another occasion we found that certain dishes – items from the childrens menu – weren’t included on the nutritional charts.

Basically, the information we want is:

  • often not available at all
  • available, but only if you know were to look or who to ask
  • potentially out of date, as it comes from non-authoritative sources
  • incomplete or inaccurate, even from the authoritative sources
  • not regularly updated
  • not in easy to use formats
  • available electronically, e.g. in an app, but without any clear provenance

The reality is that this type of nutritional and ingredient data is basically in the same state as government data was 6-7 years ago. It’s something that really needs to change.

Legislation can help encourage supermarkets and restaurants to make data available, but really its time for them to recognize that this is essential information for many people. All supermarkets, manufacturers and major chains will have this data already, there should be little effort required in making it public.

I’ve wondered whether this type of data ought to be considered as part of the UK National Information Infrastructure. It could be collected as part of the remit of the Food Standards Agency. Having a national source would help remove ambiguity around how data has been aggregated.

Whether you’re calorie or carb counting, open data can make an important difference. Its about giving people the information they need to live healthy lives.

What is an Open API?

I was reading a document this week that referred to an “Open API”. It occurred to me that I hadn’t really thought about what that term was supposed to mean before. Having looked at the API in question, it turned out it did not mean what I thought it meant. The definition of Open API on Wikipedia and the associated list of Open APIs are also both a bit lacklustre.

We could probably do with being more precise about what we mean by that term, particularly in how it relates to Open Source and Open Data. So far I’ve seen it used in several different ways:

  1. An API that is free for anyone to use — I think it would be clearer to refer to these as “Public APIs”. Some may require authentication, some may only have a limited free tier of usage, but the API is accessible to anyone that wants to use it
  2. An API that is backed by open data — the data that is extracted by the API is covered by an open licence. A Public API isn’t necessarily backed by Open Data. While it might be free for me to use an API, I may be limited in how I can use the data by API terms and/or a non-open data licence that applies to the data
  3. An API that is based on an open standard — the data available via an API might not be open, but the means of accessing and querying the data is covered by a specification that has been created by a standards body or has otherwise be openly published, e.g. the specification of the API is covered by an open licence. The important thing here is that the API could be (re-)implemented in an open source or commercial product without infringing on anyone’s rights or intellectual property. The specification of APIs that serve open data aren’t necessarily open. A commercial vendor may provide a data publishing service whose API is entirely proprietary.

Personally I think an Open API is one that meets that final definition.

These are important distinctions and I’d encourage you to look at the APIs you’re using or the API’s you’re publishing and considering into which category they fall. APIs built on open source software typically fall into the third category: a reference implementation and API documentation are already in the open. It’s easy to create alternate versions, improve an existing code base, or run a copy of a service.

While the data in a platform may be open, lock-in (whether planned or otherwise) can happen when APIs are proprietary. This limits competition and the ability for both data publishers and consumers to choose other vendors. This is also one reason why APIs shouldn’t be the default for open government data: at some level the raw data should be portable and useful outside of whatever platform the organisation may choose to deploy. Ideally platforms aimed at supporting open government data publishing should be open source or should, at the very least, openly licence their API documentation.

Building the new Ordnance Survey Linked Data platform

Disclaimer: the following is my own perspective on the build & design of the Ordnance Survey Linked Data platform. I don’t presume to speak for the OS and don’t have any inside knowledge of their long term plans.

Having said that I wanted to share some of the goals we (Julian Higman, Benjamin Nowack and myself) had when approaching the design of the platform. I will say that we had the full support and encouragement of the Ordnance Survey throughout the project, especially John Goodwin and others in the product management team.

Background & Goals

The original Ordnance Survey Linked Data site launched in April 2010. At the time it was a leading example of adoption of Linked Data by a public sector organisation. But time moves on and both the site and the data were due for a refresh. With Talis’ withdrawal from the data hosting business, the OS decided to bring the data hosting in-house and contracted Julian, Benjamin and myself to carry out the work.

While the migration from Talis was a key driver, the overall goal was to deliver a new Linked Data platform that would make a great showcase for the Ordnance Survey Linked Data. The beta of the new site was launched in April and went properly live at the beginning of June.

We had a number of high-level goals that we set out to achieve in the project:

  • Provide value for everyone, not just developers — the original site was very developer-centric, offering a very limited user experience with no easy way to browse the data. We wanted everyone to begin sharing links to the Ordnance Survey pages and that meant that the site needed a clean, user-friendly design. This meant we approached it from the point of building an application, not just a data portal
  • Deliver more than Linked Data — we wanted to offer a set of APIs that made the data accessible and useful for people who weren’t familiar with Linked Data or SPARQL. This meant offering some simpler tools to enable people to search and link to the data
  • Deliver a good developer user experience –this meant integrating API explorers, plenty of examples, and clear documentation. We wanted to shorten the “time to first JSON” to get developers into the data as fast as possible
  • Showcase the OS services and products — the OS offer a number of other web services and location products. The data should provide a way to show that value. Integrating mapping tools was the obvious first step
  • Support latest standards and best practices — where possible we wanted to make sure that the site offered standard APIs and formats, and conformed to the latest best practices around open data publishing
  • Support multiple datasets — the platform has been designed to support multiple datasets, allowing users to use just the data they need or the whole combined dataset. This provides more options for both publishing and consuming the data
  • Build a solid platform to support further innovation — we wanted to leave the OS with an extensible, scalable platform to allow them to further experiment with Linked Data

Best Practices & Standards

From a technical perspective we need to refresh not just the data but the APIs used to access it. This meant replacing the SPARQL 1.0 endpoint and custom search interface offered in the original with more standard APIs.

We also wanted to make the data and APIs discoverable and adopted a “completionist” approach to try and tick all the boxes for publishing and exposing dataset metadata, including basic versioning and licensing information.

As a result we ended up with:

  • SPARQL 1.1 query endpoints for every dataset, which expose a basic SPARQL 1.1 Service Description as well as the newer CSV and TSV response formats
  • Well populated VoID descriptions for each dataset, including all of the key metadata items including publication dates, licensing, coverage, and some initial dataset statistics
  • Autodiscovery support for datasets, APIs, and for underlying data about individual Linked Data resources
  • OpenSearch 1.1 compliant search APIs that support keyword and geo search over the data. The Atom and RSS response formats include the relevance and geo extensions
  • Licensing metadata is clearly labelled not just on the datasets, but as a Link HTTP header in every Linked Data or API result, so you can probe resources to learn more
  • Basic support for the OpenRefine Reconciliation API as a means to offer a simple linking API that can be used in a variety of applications but also, importantly, with people curating and publishing small datasets using OpenRefine
  • Support for CORS, allowing cross-browser requests to be made to the Linked Data and all of the APIs
  • Caching support through the use of ETags and Last-Modified headers. If you’re using the APIs then you can optimise your requests and cache data by making Conditional GET requests
  • Linked Data pages that offer more than just a data dump, the integrated mapping and links to other products and services makes the data more engaging.
  • Custom ontology pages that allow you to explore terms and classes within individual ontologies, e.g. see for example the definition of “London Borough

Clearly there’s more that could be potentially done. Tools can always be improved, but the best way for that to happen is through user feedback. I’d love to know what you think of the platform.

Overall I think we’ve achieved our goal of making a site that, while clearly developer oriented, offers a good user experience for non-developers. I’ll be interested to see what people do with the data over the coming months

Summarising Geographic Coverage of Dbpedia (and Wikipedia)

In “What Does Your Dataset Contain?” I outlined a conceptual framework for thinking about how we might want to describe datasets, e.g. how they’re produced, what they contain, etc. I’ve been reading with interest the series on dataset summaries in Scraperwiki which is exploring similar ideas.

I finally found the time to do some quick practical exploration of my own. One area that interests me is understanding the geographic coverage of a dataset. There’s lots of ways to approach that, mainly because datasets can vary widely in how they include geographical data. Some might include direct references to regions, whilst others might have more fine-grained latitude/longitude data.

I recently discovered local-geocoder which allows bulk reverse geocoding of lat/lng data to country names. I decided to apply this to data to dbpedia to see if I could get a sense of its overall coverage.

The result is a simple shell script that:

  1. Downloads the geographic data from the English version of Dbpedia 3.8
  2. Extracts the georss:point predicates and runs them through the local_geocode command-line tool
  3. Runs the results through some command-line tools to sort and summarise the data to create a simple CSV file

I created a gist that contains the script and the output as formatted text and CSV.

Quick summary of the results:

  • 475,001 geographic points in Dbpedia 3.8.
  • 26,763 (recorded as “nil” in the results) were unmatched, giving 448,238 points that can be geocoded to a country
  • 122,230 points were from US (25.7% of full set)
  • US, Poland (46,316; 9.75%), and United Kingdom (45,917, 9.67%) are the three most represented countries
  • 178 countries referenced in totaal

From a quick inspection, I think the results that can’t be geocoded are simply those that are outside country boundaries. E.g. the location for Apollo 8 is the middle of the Pacific).

The main caveat with the results (other than potential bugs) is that the boundary data used in local-geocoder is of unclear provenance. Its intended for quick prototyping only. However I’ve had a pull request accepted to local-geocoder to make it easier to use alternate data so there are now options to use alternative sources.

Most online geocoders are rate-limited or have specific terms and conditions that limit re-use of the resulting data. It would be interesting to create a good reference set of open boundary data for countries and administrative regions for use in open source geocoding tools.

I’ve been exploring how the Ordnance Survey data could be converted to GeoJSON for use with the tool. This would give more fine-grained data for England, Scotland and Wales.

 

How Do We Attribute Data?

This post is another in my ongoing series of “basic questions about open data”, which includes “What is a Dataset?” and “What does a dataset contain?“. In this post I want to focus on dataset attribution and in particular questions such as:

  • Why should we attribute data?
  • How are data publishers asking to be attributed?
  • What are some of the issues with attribution?
  • Can we identify some common conventions around attribution?
  • Can we monitor or track attribution?

I started to think about this because I’ve encountered a number of data publishers recently that have published Open Data but are now struggling to highlight how and where that data has been used or consumed. If data is published for anonymous download, or accessible through an open API then a data publisher only has usage logs to draw on.

I had thought that attribution might help here: if we can find links back to sources, then perhaps we can help data publishers mine the web for links and help them build evidence of usage. But it quickly became clear, as we’ll see in a moment, that there really aren’t any conventions around attribution, making it difficult to achieve this.

So lets explore the topic from first principles and tick off my questions individually.

Why Attribute?

The obvious answer here is simply that if we are building on the work of others, then it’s only fair that those efforts should be acknowledged. This helps the creator of the data (or work, or code) be recognised for their creativity and effort, which is the very least we can do if we’re not exchanging hard cash.

There are also legal reasons why the source of some data might be need to be acknowledged. Some licenses require attribution, copyright may need to be acknowledged. As a consumer I might also want to (or need to) clearly indicate that I am not the originator of some data in case it is find to be false, or misleading, etc.

Acknowledging my sources may also help guarantee that the data I’m using continues to be available: a data publisher might be collecting evidence of successful re-use in order to justify ongoing budget for data collection, curation and publishing. This is especially true when the data publisher is not directly benefiting from the data supply; and I think it’s almost always true for public sector data. If I’m reusing some data I should make it as clear as possible that I’m so doing.

There’s some additional useful background on attribution from a public sector perspective in a document called “Supporting attribution, protecting reputation, and preserving integrity“.

It might also be useful to distinguish between:

  • Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
  • Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

While these two cases clearly overlap, the intention is often slightly different. As a user of an application, or the reader of an academic paper, I might want a clear citation to the underlying dataset so I can re-use it myself, or do some fact checking. The important use case there is tracking facts and figures back to their sources. Attribution is more about crediting the effort involved in collecting that information.

It may be possible to achieve both goals with a simple link, but I think recognising the different use cases is important.

How are data publishers asking to be attributed?

So how are data publishers asking for attribution? What follows isn’t an exhaustive survey but should hopefully illustrate some of the variety.

Lets look first at some of the suggested wordings in some common Open Data licenses, then poke around in some terms and conditions to see how these are being applied in practice.

Attribution Statements in Common Open Data Licenses

The Open Data Commons Attribution license includes some recommended text (Section 4.3a – Example Notice):

Contains information from DATABASE NAME which is made available under the ODC Attribution License.

Where DATABASE NAME is the name of the dataset and is linked to the dataset homepage. Notice no mention of the originator, just the database. The license notes that in plain text the links should be included as text. The Open Data Commons Database license has the same text (again, section 4.3a)

The UK Open Government License notes that re-users should:

…acknowledge the source of the Information by including any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence

Where no attribution is provided, or multiple sources must be attributed, then the suggested default text, which should include a link to the license is:

Contains public sector information licensed under the Open Government Licence v1.0.

So again, no reference to the publisher but also no reference to the dataset either. The National Archives have some guidance on attribution which includes some other variations.  These variants do suggest including more detail including name of department, date of publication, etc. These look more like typical bibliographic citations.

As another data point we can look at the Ordnance Survey Open Data License. This is a variation of the Open Government License but carries some additional requirements, specifically around attribution. The basic attribution statement is:

Contains Ordnance Survey data © Crown copyright and database right [year]

However the Code Point Open dataset has some additional attribution requirements, which also acknowledge copyright of the Royal Mail and National Statistics. All of these statements acknowledge the originators and there’s no requirement to cite the dataset itself.

Interestingly, while the previous licenses state that re-publication of data should be under a compatible license, only the OS Open Data license explicitly notes that the attribution statements must also be preserved. So both the license and attribution have viral qualities.

Attribution Statements in Terms and Conditions

Now lets look at some specific Open Data services to see what attribution provisions they include.

Freebase is an interesting example. It draws on multiple datasets which are supplemented by contributions of its user community. Some of that data is under different licenses. As you can see from their attribution page, there are variants in attribution statements depending on whether the data is about one or several resources and whether it includes Wikipedia content, which must be specially acknowledged.

They provide a handy HTML snippet for you to include in your webpage to make sure you get the attribution exactly right. Ironically at the time of writing this service is broken (“User Rate Limit Exceeded”). If you want a slightly different attribution, then you’re asked to contact them.

Now, while Freebase might not meet everyone’s definition of Open Data, its an interesting data point.  Particularly as they ask for deep links back to the dataset, as well as having a clear expectation of where/how the attribution will be surfaced.

OpenCorporates is another illustrative example. Their legal/license info page examples that their dataset is licensed under the Open Data Commons Database License and explains that:

Use of any data must be accompanied by a hyperlink reading “from OpenCorporates” and linking to either the OpenCorporates homepage or the page referring to the information in question

There are also clear expectations around the visibility of that attribution:

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

So there is a clear expectation that the attribution should be displayed alongside any data. Like the OS license these attribution requirements are also viral as they must be passed on by aggregators.

My intention isn’t to criticise either OpenCorporates or Freebase, but merely to highlight some real world examples.

What are some of the issues with data attribution?

Clearly we could undertake a much more thorough review than I have done here. But this is sufficient to highlight what I think are some of the key issues. Put yourself in the position of a developer consuming some Open Data under any or all of these conditions. How do you responsibly provide attribution?

The questions that occur to me, at least are:

  • Do I need to put attribution on every page of my application, or can I simply add it to a colophon? (Aside: lanyrd has a great colophon page). In some cases it seems like I might have some freedom of choice, in others I don’t
  • If I do have to put a link or some text on a page, then do I have any flexibility around its size, positioning, visibility, etc? Again, in some cases I may do, but in others I have some clear guidance to follow. This might be challenging if I’m creating a mobile application with limited screen space. Or creating a voice or SMS application.
  • What if I just re-use the data as part of some back-end analysis, but none of that data is actually surfaced to the user? How do I attribute in this scenario?
  • Do I need to acknowledge the publisher, or a link to the source page(s)?
  • What if I need to address multiple requirements, e.g. if I mashed up data from data.gov.uk, the Ordnance Survey, Freebase and OpenCorporates? That might get awkward.

There are no clear answers to these questions. For individual datasets I might be able to get guidance, but it requires me to read the detailed terms and conditions for the dataset or API I’m using. Isn’t the whole purpose in having off-the-shelf licenses like the OGL or ODbL supposed to help us streamline data sharing? Attribution, or rather unclear or overly detailed attribution requirements are a clear source of friction. Especially if there are legal consequences for getting it wrong.

And that’s just when we’re considering integrating data sources by hand. What about if we want to automatically combine data? How is a machine going to understand these conditions? I suspect that every Linked Data browser and application fails to comply with the attribution requirements of the data its consuming.

Of course these issues have been explored already. The Science Commons Protocol encourages publishing data into the public domain — so no legal requirement for attribution at all. It also acknowledges the “Attribution Stacking” problem (section 5.3) which occurs when trying to attribute large numbers of datasets, each with their own requirements. Too much friction discourages use, whether its research or commercial.

Unfortunately the recently published Amsterdam Manifesto on data citation seems to overlook these issues, requiring all authors/contributors to be attributed.

The scientific community may be more comfortable with a public domain licensing approach and a best effort attribution model because it is supported by strong social norms: citation and attribution is essential to scientific discourse. We don’t have anything like that in the broader open data community. Maybe its not achievable, but it seems like clear guidance would be very useful.

There’s some useful background on problems with attribution and marking requirements on the Creative Commons wiki that also references some possible amendments and clarifications.

Can we convergence on some common conventions?

So would it be possible to converge on a simple set of conventions or norms around data re-use? Ideally to the extent that attribution can be simplified and ideally automated as far as possible.

How about the following:

  • Publishers should clearly describe their attribution requirements. Ideally this should be a short simple statement (similar to the Open Government License) which includes their name and a link to their homepage. This attribution could be included anywhere on the web site or application that consumes the data.
  • Publishers should be aware that the consumers of their data will be doing so in a variety of applications and on a variety of platforms. This means allowing a deal of flexibility around how/where attribution is displayed.
  • Publishers should clearly indicate whether attribution must be passed on to down-stream users
  • Publishers should separately document their citation requirements. If they want to encourage users to link to the dataset, or an individual page on their site, to allow users to find the original context, then they should publish instructions on how to do it. However this kind of linking is for citation so consumers should be bound to include it
  • Consumers should comply with publishers wishes and include an about page on their site or within their application that attributes the originators of the data they use. Where feasible they should also provide citations to specific resources or datasets from within their applications. This provides their users with clear citations to sources of data
  • Both sides should collaborate on structured markup to support publication of these attribution and citation requirements, as well as harvesting of links

Whether attribution should be a legally enforced is another discussion. Personally I’d be keen to see a common set of conventions regardless of the legal basis for doing it. Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

What Does Your Dataset Contain?

Having explored some ways that we might find related data and services, as well as different definitions of “dataset”, I wanted to look at the topic of dataset description and analysis. Specifically, how can we answer the following questions:

  • what kinds of information does this dataset contain?
  • what types of entity are described in this dataset?
  • how can I determine if this dataset will fulfil my requirements?

There’s been plenty of work done around trying to capture dataset metadata, e.g. VoiD and DCAT; there’s also the upcoming working on Open Data on the Web. Much of that work has focused on capturing the core metadata about a dataset, e.g. who published it, when was it last updated, where can I find the data files, etc. But there’s still plenty of work to be done here, to encourage broader adoption of best practices, and also to explore ways to expose more information about the internals of a dataset.

This is a topic I’ve touched on before, and which we experimented with in Kasabi. I wanted to move “beyond the triple count” and provide a “report card” that gave a little more insight into a dataset. A report card could usefully complement an ODI Open Data Certificate, for example. Understanding the composition of a dataset can also help support new ways of manipulating and combining datasets.

In this post I want to propose a conceptual framework for capturing metadata about datasets. Its intended as a discussion point, so I’m interested in getting feedback. (I would have submitted this to the ODW workshop but ran out of time before the deadline).

At the top level I think there are five broad categories of dataset information: Descriptive Data; Access Information; Indicators; Compositional Data; and Relationships. Compositional data can be broken down into smaller categories — this is what I described as an “information spectrum” in the Beyond the Triple Count post.

While I’ve thought about this largely from the perspective of Linked Data, I think its applicable to any format/technology.

Descriptive Data

This kind of information helps us understand a dataset as a “work”: its name, a human-readable description or summary, its license, and pointers to other relevant documentation such as quality control or feedback processes. This information is typically created and maintained directly by the data publisher, whereas the other categories of data I describe here can potentially be derived automatically by data analysis

Examples:

  • Title
  • Description
  • License
  • Publisher
  • Subject Categories

Access Information

Basically, where do I get the data?

  • Where do I download the latest data?
  • Where can I download archived or previous versions of the data?
  • Are there mirrors for the dataset?
  • Are there APIs that use this data?
  • How do I obtain access to the data or API?

Indicators

This is statistical information that can help provide some insight into the data set, for example its size. But indicators can also build confidence in re-users by highlighting useful statistics such as the timeliness of releases, speed of responding to data fixes, etc.

While a data publisher might publish some of these indicators as targets that they are aiming to achieve, many of these figures could be derived automatically from an underlying publishing platform or service.

Examples of indicators:

  • Size
  • Rate of Growth
  • Date of Last Update
  • Frequency of Updates
  • Number of Re-users (e.g. size of user community, or number of apps that use it)
  • Number of Contributors
  • Frequency of Use
  • Turn-around time for data fixes
  • Number of known errors
  • Availability (for API based access)

Relationships

Relationship data primarily drives discovery use cases: to which other datasets does this dataset relate? For example the dataset might re-use identifiers or directly link to resources in other datasets. Knowing the source of that information can help us build trust in the reliability of the combined data, as well as give us sign-posts to other useful context. This is where Linked Data excels.

Annotation Datasets provide context to, and enrich other reference datasets. Annotations might be limited to linking information (“Link Sets”) or they may add new facts/properties about existing resources. Independently sourced quality control information could be published as annotations.

Provenance is also a form of relationship information. Derived datasets, e.g. created through analysis or data conversions, should refer to their original input datasets, and ideally also the algorithms and/or code that were applied.

Again, much of this information can be derived from data analysis. Recommendations for relevant related datasets might be created based on existing links between datasets or by analysing usage patterns. Set algebra on URIs in datasets can be used to do analysis on their overlap, to discover linkages and to determine whether one dataset contains annotations of another.

Examples:

  • List of dataset(s) that this dataset draws on (e.g. re-uses identifiers, controlled vocabulary, etc)
  • List of datasets that this datasets references, e.g. via links
  • List of source datasets used to compile or create this dataset
  • List of datasets that link to this dataset (“back links”)
  • Which datasets are often used in conjunction with this dataset?

Compositional Data

This is information about the internals of a dataset: e.g. what kind of data does it contain, how is that data organized, and what kinds of things are being described?

This is the most complex area as there are potentially a number of different audiences and abilities to cater for. At one end of the spectrum we want to provide high level summaries of the contents of a dataset, while at the other end we want to provide detailed schema information to support developers. I’ve previously advocated a “progressive disclosure” approach to allow re-users to quickly find the data they need; a product manager looking for data to support a new feature will be looking for different information to a developer constructing queries over a dataset.

I think there are three broad ways that we can decompose Compositional Data further. There are particular questions and types of information that relate to each of them:

  • Scope or Coverage 
    • What kinds of things does this dataset describe? Is it people, places, or other objects?
    • How many of these things are in the dataset?
    • Is there a geographical focus to the dataset, e.g. a county, region, country or is it global?
    • Is the data confined to a particular data period (archival data) or does it contain recent information?
  • Structure
    • What are some typical example records from the dataset?
    • What schema does it conform to?
    • What graph patterns (e.g. combinations of vocabularies) are commonly found in the data?
    • How are various types of resource related to one another?
    • What is the logical data model for the data?
  • Internals
    • What RDF terms and vocabularies that are used in the data?
    • What formats are used for capturing dates, times, or other structured values?
    • Are there custom validation rules for particular fields or properties?
    • Are there caveats or qualifiers to individual schema elements or data items?
    • What is the physical data model
    • How is the dataset laid out in a particular database schema, across a collection of files, or named graphs?

The experiments we did in Kasabi around the report card (see the last slides for examples) were exploring ways to help visualise the scope of a dataset. It was based on identifying broad categories of entity in a dataset. I’m not sure we got the implementation quite right, but I think it was a useful visual indicator to help understand a dataset.

This is a project I plan to revive when I get some free time. Related to this is the work I did to map the Schema.org Types to the Noun Project Icons.

Summary

I’ve tried to present a framework that captures most, if not all of the kinds of questions that I’ve seen people ask when trying to get to grips with a new dataset. If we can understand the types of information people need and the questions they want to answer, then we can create a better set of data publishing and analysis tools.

To date, I think there’s been a tendency to focus on the Descriptive Data and Access Information — because we want to be able to discover data — and its Internals — so we know how to use it.

But for data to become more accessible to a non-technical audience we need to think about a broader range of information and how this might be surfaced by data publishing platforms.

If you have feedback on the framework, particularly if you think I’ve missed a category of information, then please leave a comment. The next step is to explore ways to automatically derive and surface some of this information.

What is a Dataset?

As my last post highlighted, I’ve been thinking about how we can find and discover datasets and their related APIs and services. I’m thinking of putting together some simple tools to help explore and encourage the kind of linking that my diagram illustrated.

There’s some related work going on in a few areas which is also worth mentioning:

  • Within the UK Government Linked Data group there’s some work progressing around the notion of a “registry” for Linked Data that could be used to collect dataset metadata as well as supporting dataset discovery. There’s a draft specification which is open for comment. I’d recommend you ignore the term “registry” and see it more as a modular approach for supporting dataset discovery, lightweight Linked Data publishing, and “namespace management” (aka URL redirection). A registry function is really just one aspect of the model.
  • There’s an Open Data on the Web workshop in April which will cover a range of topics including dataset discovery. My current thoughts are partly preparation for that event (and I’m on the Programme Committee)
  • There’s been some discussion and a draft proposal for adding the Dataset type to Schema.org. This could result in the publication of more embedded metadata about datasets. I’m interested in tools that can extract that information and do something useful with it.

Thinking about these topics I realised that there are many definitions of “dataset”. Unsurprisingly it means different things in different contexts. If we’re defining models, registries and markup for describing datasets we may need to get a sense of what these different definitions actually are.

As a result, I ended up looking around for a series of definitions and I thought I’d write them down here.

Definitions of Dataset

Lets start with the most basic, for example Dictionary.com has the following definition:

“a collection of data records for computer processing”

Which is pretty vague. Wikipedia has a definition which derives from the terms use in a mainframe environment:

“A dataset (or data set) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the dataset in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

Nontabular datasets can take the form of marked up strings of characters, such as an XML file.”

The W3C Data Catalog Vocabulary defines a dataset as:

“A collection of data, published or curated by a single source, and available for access or download in one or more formats.”

The JISC “Data Information Specialists Committee” have a definition of dataset as:

“…a group of data files–usually numeric or encoded–along with the documentation files (such as a codebook, technical or methodology report, data dictionary) which explain their production or use. Generally a dataset is un-usable for sound analysis by a second party unless it is well documented.”

Which is a good definition as it highlights that the dataset is more than just the individual data files or facts, it also consists of some documentation that supports its use or analysis. I also came across a document called “A guide to data development” (2007) from the National Data Development and Standards Unit in Australia which describes a dataset as

“A data set is a set of data that is collected for a specific purpose. There are many ways in which data can be collected—for example, as part of service delivery, one-off surveys, interviews, observations, and so on. In order to ensure that the meaning of data in the data set is clearly understood and data can be consistently collected and used, data are defined using metadata…”

This too has the notion of context and clear definitions to support usage, but also notes that the data may be collected in a variety of ways.

A Legal Definition

As it happens, there’s also a legal definition of a dataset in the UK, at least as far as it relates to the Freedom of Information. The “Protections of Freedom Act 2012 Part 6, (102) c” includes the following definition:

In this Act “dataset” means information comprising a collection of information held in electronic form where all or most of the information in the collection—

  • (a)has been obtained or recorded for the purpose of providing a public authority with information in connection with the provision of a service by the authority or the carrying out of any other function of the authority,
  • (b)is factual information which—
    • (i)is not the product of analysis or interpretation other than calculation, and
    • (ii)is not an official statistic (within the meaning given by section 6(1) of the Statistics and Registration Service Act 2007), and
  • (c)remains presented in a way that (except for the purpose of forming part of the collection) has not been organised, adapted or otherwise materially altered since it was obtained or recorded.”

This definition is useful as it defines the boundaries for what type of data is covered by Freedom of Information requests. It clearly states that the data is collected as part of the normal business of the public body and also that the data is essentially “raw”, i.e. not the result of analysis or has not been adapted or altered.

Raw data (as defined here!) is more useful as it supports more downstream usage. Raw data has more potential.

Statistical Datasets

The statistical community has also worked towards having a clear definition of dataset. The OECD Glossary defines a Dataset as “any organised collection of data”, but then includes context that describes that further. For example that a dataset is a set of values that have a common structure and are usually thematically related. However there’s also this note that suggests that a dataset may also be made up of derived data:

“A data set is any permanently stored collection of information usually containing either case level data, aggregation of case level data, or statistical manipulations of either the case level or aggregated survey data, for multiple survey instances”

Privacy is one key reason why a dataset may contain derived information only.

The RDF Data Cube vocabulary, which borrows heavily from SDMX — a key standard in the statistical community — defines a dataset as being made up of several parts:

  1. “Observations – This is the actual data, the measured numbers. In a statistical table, the observations would be the numbers in the table cells.
  2. Organizational structure – To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation…
  3. Internal metadata – Having located an observation, we need certain metadata in order to be able to interpret it. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated?…
  4. External metadata — This is metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, and a SPARQL endpoint where it can be accessed.”

The SDMX implementors guide has a long definition of dataset (page 7) which also focuses on the organisation of the data and specifically how individual observations are qualified along different dimensions and measures.

Scientific and Research Datasets

Over the last few years the scientific and research community have been working towards making their datasets more open, discoverable and accessible. Organisations like the Welcome Foundation have published guidance for researchers on data sharing; services like CrossRef and DataCite provide the means for giving datasets stable identifiers; and platforms like FigShare support the publishing and sharing process.

While I couldn’t find a definition of dataset from that community (happy to take pointers!) its clear that the definition of dataset is extremely broad. It could cover both raw results, e.g. output from sensors or equipment, through to more analysed results. The boundaries are hard to define.

Given the broad range of data formats and standards, services like FigShare accept any or all data formats. But as the Welcome Trust note:

“Data should be shared in accordance with recognised data standards where these exist, and in a way that maximises opportunities for data linkage and interoperability. Sufficient metadata must be provided to enable the dataset to be used by others. Agreed best practice standards for metadata provision should be adopted where these are in place.”

This echoes the earlier definitions that included supporting materials as being part of the dataset.

RDF Datasets

I’ve mentioned a couple of RDF vocabularies already, but within the RDF and Linked Data community there are a couple of other definitions of dataset to be found. The Vocabulary for Organising Interlinked Datasets (VoiD) is similar to, but predates, DCAT. Whereas DCAT focuses on describing a broad class of different datasets, VoiD describes a dataset as:

“…a set of RDF triples that are published, maintained or aggregated by a single provider…the term dataset has a social dimension: we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian. Also, typically a dataset is accessible on the Web, for example through resolvable HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many triples that there is benefit in providing a concise summary.”

Like the more general definitions this includes the notion that the data may relate to a specific topic or be curated by a single organisation. But this definition also makes some assumption about the technical aspects of how the data is organised and published. VoiD also includes support for linking to the services that relate to a dataset.

Along the same lines, SPARQL also has a definition of a Dataset:

“A SPARQL query is executed against an RDF Dataset which represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI…”

Unsurprisingly for a technical specification this is a very narrow definition of dataset. It also differs from the VoiD definition. While both assume RDF as the means for organising the data, the VoiD term is more general, e.g. it glosses over details of the internal organisation of the dataset into named graphs. This results in some awkwardness when attempting to navigate between a VoiD description and a SPARQL Service Description.

Summary

If you’ve gotten this far, then well done :)

I think there’s a couple of things we can draw out from these definitions which might help us when discussing “datasets”:

  • There’s a clear sense that a dataset relates to specific topic and is collected for a particular purpose.
  • The means by which a dataset is collected and the definitions of its contents are important for supporting proper re-use
  • Whether a dataset consists of “raw data” or more analysed results can vary across communities. Both forms of dataset might be available, but in some circumstances (e.g. for privacy reasons) only derived data might be published
  • Depending on your perspective and your immediate use case the dataset may be just the data items, perhaps expressed in a particular way (e.g. as RDF).  But in a broader sense, the dataset also includes the supporting documentation, definitions, licensing statements, etc.

While there’s a common core to these definitions, different communities do have slightly different outlooks that are likely to affect how they expect to publish, describe and share data on the web.

Dataset and API Discovery in Linked Data

I’ve been recently thinking about how applications can discover additional data and relevant APIs in Linked Data. While there’s been lots of research done on finding and using (semantic) web services I’m initially interested in supporting the kind of bootstrapping use cases covered by Autodiscovery.

We can characterise that use case as helping to answer the following kinds of questions:

  • Given a resource URI, how can I find out which dataset it is from?
  • Given a dataset URI, how can I find out which resources it contains and which APIs might let me interact with it?
  • Given a domain on the web, how can I find out whether it exposes some machine-readable data?
  • Where is the SPARQL endpoint for this dataset?

More succinctly: can we follow our nose to find all related data and APIs?

I decided to try and draw a diagram to illustrate the different resources involved and their connections. I’ve included a small version below:

Data and API Discovery with Linked Data

Lets run through the links between different types of resources:

  • From Dataset to Sparql Endpoint (and Item Lookup, and Open Search Description): this is covered by VoiD which provides simple predicates for linking a dataset to three types of resources. I’m not aware of other types of linking yet, but it might be nice to support reconciliation APIs.
  • From Well-Known VoiD Description (background) to Dataset. This well known URL allows a client to find the “top-level” VoiD description for a domain. It’s not clear what that entails, but I suspect the default option will be to serve a basic description of a single dataset, with reference to sub-sets (void:subset) where appropriate. There might also just be rdfs:seeAlso links.
  • From a Dataset to a Resource. A VoiD description can include example resources, this blesses a few resources in the dataset with direct links. Ideally these resources ought to be good representative examples of resources in the dataset, but they might also be good starting points for further browsing or crawling.
  • From a Resource to a Resource Description. If you’re using “slash” URIs in your data, then URIs will usually redirect to a resource description that contains the actual data. The resource description might be available in multiple formats and clients can content negotiation or follow Link headers to find alternative representations.
  • From a Resource Description to a Resource. A description will typically have a single primary topic, i.e. the resource its describing. It might also reference other related resources, either as direct relationships between those resources or via rdfs:seeAlso type links (“more data over here”).
  • From a Resource Description to a Dataset. This is where we might use a dct:source relationship to state that the current description has been extracted from a specific dataset.
  • From a SPARQL Endpoint (Service Description) to a Dataset. Here we run into some differences between definitions of dataset, but essentially we can describe in some detail the structure of the SPARQL dataset that is used in an endpoint and tie that back to the VoiD description. I found myself looking for a simple predicate that linked to a void:Dataset rather than describing the default and named graphs, but couldn’t find one.
  • I couldn’t find any way to relate a Graph Store to a Dataset or SPARQL endpoint. Early versions of the SPARQL Graph Store protocol had some notes on autodiscovery of descriptions, but these aren’t in the latest versions.

These links are expressed, for the most part, in the data but could also be present as Link headers in HTTP responses or in HTML (perhaps with embedded RDFa).

I’ve also not covered sitemaps at all, which provide a way to exhaustively list the key resources in a website or dataset to support mirroring and crawling. But I thought this diagram might be useful.

I’m not sure that the community has yet standardised on best practices for all of these cases and across all formats. That’s one area of discussion I’m keen to explore further.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers