In this post I want to highlight what I think are some fairly large gaps in the standards we have for publishing and consuming data on the web. My purpose for writing these down is to try and fill in gaps in my own knowledge, so leave a comment if you think I’m missing something (there’s probably loads!)
To define the scope of those standards, lets try and answer two questions.
Question 1: What are the various activities that we might want to carry out around an open dataset?
- A. Discover the metadata and documentation about a dataset
- B. Download or otherwise extract the contents of a dataset
- C. Manage a dataset within a platform, e.g. create and publish it, update or delete it
- D. Monitor a dataset for updates
- E. Extract metrics about a dataset, e.g. a description of its contents or quality metrics
- F. Mirror a dataset to another location, e.g. exporting its metadata and contents
- G. Link or reconcile some data against a dataset or register
Question 2: What are the various activities that we might want to carry out around an open data catalogue?
- V. Find whether a dataset exists, e.g. via a search or similar interface
- X. List the contents of the platform, e.g. its datasets or other published assets
- Y. Manage user accounts, e.g. to create accounts, or grant or remove rights from specific accounts
- Z. Extract usage statistics, e.g. metrics on use of the platform and the datasets it contains
Now, based on that quick review: which of these areas of functionality are covered by existing standards?
- DCAT and its extensions gives us a way to describe a dataset (A) and can be used to find download links which addresses part of (B). But it doesn’t say how the metadata is to be discovered by clients.
- The draft Data Catalog Vocabulary starts to address parts of (E) but also doesn’t address discovery of published metrics
- ODATA provides a means for querying and manipulating data via a RESTful interface (B, C). Although I don’t think it recognises a dataset as such, just resources exposed over the web
- SPARQL (query, update, etc) also provides a means for similar operations (B, C), but on RDF data.
- The Linked Data Platform specification also offers a similar set of functionality (B, C)
- If a platform exposes its catalogue using DCAT then a client could use that to list its contents (X)
- The draft Linked Data Notifications specification covers monitoring and synchronising of data (D)
- Data Packages provide a means for packaging metadata and contents of dataset for download and mirroring (B, F)
I think there’s a number of obvious gaps around discovery and platform (portal) functionality. API and metadata discovery is also something could usefully be addressed.
If you’re managing and publishing data as RDF and Linked Data then you’re slightly better covered at least in terms of standards, if not in actual platform and tool support. The majority of current portals don’t manage data as RDF or Linked Data. They’re focused on either tabular or maybe geographic datasets.
This means that portability among the current crop of portals is actually pretty low. Moving between a platform means moving between different entirely different sets of APIs and workflows. I’m not sure that’s ideal. I don’t feel like we’ve yet created a very coherent set of standards.
What do you think? What am I missing?
One thought on “Current gaps in the open data standards framework”
A. (Discover the metadata about a dataset) The US government recommends publishing metadata about datasets on a domain at /data.json (https://project-open-data.cio.gov/v1.1/schema/). This has been adopted by Socrata and others. RFC 5785 (https://tools.ietf.org/html/rfc5785) and many others provide more generic ways of assisting discovery. If you have a specific dataset URL, you can sometimes access the metadata by adding .json or .rdf to the URL. More generically, meta tags can be used to provide rel=”alternate” links. See also http://rdf-vocabulary.ddialliance.org/discovery.html
A. (Discover the documentation about a dataset) Isn’t this covered by existing DCAT metadata? (description, dataDictionary, etc.)
B. (Download or otherwise extract the contents of a dataset) In what way isn’t this entirely covered by DCAT? Do you mean extracting a subset of a dataset?
D. (Monitor a dataset for updates) You mention a ‘push’ solution, but, for most use cases, ‘pull’ is fine, for which monitoring a dataset’s modified property (if its metadata is machine-readable) seems sufficient. http://dat-data.com/ is another solution.
E. There are several projects extracting metrics from a dataset, but the metrics are typically specific to a use case. I’m not sure that there’s a real generic use case for ‘dataset metrics’ writ large.
F. Data federation is essentially solved but not widely adopted. I list the solutions on page 11 of http://public.opennorth.ca.s3.amazonaws.com/reports/ODWG-Standards-1st-deliverable-Gaps-and-opportunities-for-standardization.pdf#page=11
G. Doman-specific OpenRefine-based tools exist. The Metaweb Query Language developed for Freebase also provides a good base for building reconciliation tools.
I’m not sure that C, E, Y or Z are actually important large gaps. They certainly don’t seem to be important barriers to achieving impact with open data.
Comments are closed.