In this post I want to highlight what I think are some fairly large gaps in the standards we have for publishing and consuming data on the web. My purpose for writing these down is to try and fill in gaps in my own knowledge, so leave a comment if you think I’m missing something (there’s probably loads!)
To define the scope of those standards, lets try and answer two questions.
Question 1: What are the various activities that we might want to carry out around an open dataset?
- A. Discover the metadata and documentation about a dataset
- B. Download or otherwise extract the contents of a dataset
- C. Manage a dataset within a platform, e.g. create and publish it, update or delete it
- D. Monitor a dataset for updates
- E. Extract metrics about a dataset, e.g. a description of its contents or quality metrics
- F. Mirror a dataset to another location, e.g. exporting its metadata and contents
- G. Link or reconcile some data against a dataset or register
Question 2: What are the various activities that we might want to carry out around an open data catalogue?
- V. Find whether a dataset exists, e.g. via a search or similar interface
- X. List the contents of the platform, e.g. its datasets or other published assets
- Y. Manage user accounts, e.g. to create accounts, or grant or remove rights from specific accounts
- Z. Extract usage statistics, e.g. metrics on use of the platform and the datasets it contains
Now, based on that quick review: which of these areas of functionality are covered by existing standards?
- DCAT and its extensions gives us a way to describe a dataset (A) and can be used to find download links which addresses part of (B). But it doesn’t say how the metadata is to be discovered by clients.
- The draft Data Catalog Vocabulary starts to address parts of (E) but also doesn’t address discovery of published metrics
- ODATA provides a means for querying and manipulating data via a RESTful interface (B, C). Although I don’t think it recognises a dataset as such, just resources exposed over the web
- SPARQL (query, update, etc) also provides a means for similar operations (B, C), but on RDF data.
- The Linked Data Platform specification also offers a similar set of functionality (B, C)
- If a platform exposes its catalogue using DCAT then a client could use that to list its contents (X)
- The draft Linked Data Notifications specification covers monitoring and synchronising of data (D)
- Data Packages provide a means for packaging metadata and contents of dataset for download and mirroring (B, F)
I think there’s a number of obvious gaps around discovery and platform (portal) functionality. API and metadata discovery is also something could usefully be addressed.
If you’re managing and publishing data as RDF and Linked Data then you’re slightly better covered at least in terms of standards, if not in actual platform and tool support. The majority of current portals don’t manage data as RDF or Linked Data. They’re focused on either tabular or maybe geographic datasets.
This means that portability among the current crop of portals is actually pretty low. Moving between a platform means moving between different entirely different sets of APIs and workflows. I’m not sure that’s ideal. I don’t feel like we’ve yet created a very coherent set of standards.
What do you think? What am I missing?