This post is part of my ongoing series of basic questions about data, this time prompted by a tweet by Andy Dickinson asking the same question.
There are lots of open data portals. OpenDataMonitor lists 161 in the EU alone. The numbers have grown rapidly over the last few years. Encouraged by exemplars such as data.gov.uk they’re usually the first item on the roadmap for any open data initiative.
But what is a data portal and what role does it play?
A Basic Definition
I’d suggest that the most basic definition of an open data portal is:
A list of datasets with pointers to how those datasets can be accessed.
A web page on an existing website meets this criteria. It’s the minimum viable open data portal. And, quite rightly, this is still where many projects begin.
Once you have more than a handful of datasets then you’re likely to need something more sophisticated to help users discover datasets that are of interest to them. A more sophisticated portal will provide the means to capture metadata about each dataset and then use that to provide the ability to search and browse through the list, e.g. by theme, licence, or other facets.
Portals rarely place any restrictions on the type of data that is catalogued or the means by which data is accessed. However more sophisticated portals offer additional capabilities for both the end user and the publisher.
Publisher features include:
- File storage to make it easier to get data made available online
- Additional curation tools, e.g. addition of custom metadata, creation of collections, and promotion of datasets
- Integrated data stores, e.g. to allow data files to be uploaded into a database that will allow data to be queried and accessed by users in more sophisticated ways
User features include:
- Notification tools to alert to the publication of new or updated datasets
- Integrated and embeddable visualisations to support manipulation and use of data directly in the portal, often with embedding in other websites.
- Automatically generated APIs to allow for more sophisticated online querying and interaction with datasets
- Engagement tools such as rating, discussions and publisher feedback channels
There are a number of open source and commercial data stores, including CKAN, Socrata and OpenDataSoft. All of these offer a mixture of the features outlined above.
Who uses data portals?
Right now the target customer for a data portal is likely to be a public sector organisation, e.g. a local authority, city administration or government department that is looking to publish a number of datasets.
But the users of a data portal are a mixture of all of different aspects of the open data community: individual citizens, developers or civic hackers, data journalists, public sector officials, commercial developers, etc.
Balancing the needs of these different constituents is difficult:
- The customer wants to see some results from publishing their data as soon as possible, so instant access to visualisations and exploration tools gives immediate utility and benefit
- Data analysts or designers will likely just want to download the data so they can make more sophisticated use of the data
- Web and mobile developers often want an API to allow them to quickly build an application, without setting up infrastructure and a custom data processing pipeline
- A citizen, assuming they wander in at all, is likely to want some fairly simple data exploration tools, ideally wrapped up in some narrative that puts the data into context and help tells a story
Depending on where you sit in the community you may think that current data portals are either fantastic or are under-serving your needs.
The business model and target market of the portal developer is also likely to affect how well they serve different communities. APIs, for example, support the creation of platforms that helps embed the portal into an ecosystem.
There are enterprise data portals too. Large enterprises have exactly the same problems as exists in the wider open data community: it’s often not clear what data is available or how to access it.
For example Microsoft has the Azure Data Catalog. This has been around for quite a few years now in various incarnations. There are also tools like Tamr Catalog.
They both have similar capabilities – collaborative cataloguing of datasets within an enterprise – and both are tied into a wider ecosystem of data processing and analytics tools.
How might data portals evolve in the future?
I think there’s still plenty of room to develop new features to better serve different audiences.
For example none of the existing catalogues really help me publish some data and then tell a story with it. A story is likely to consist of a mixture of narrative and visualisations, perhaps spanning multiple datasets. This might best be served by making it easier to embed differnt views of data into blog posts rather than building additional content management features into the catalog itself. But for a certain audience, e.g. data journalists and media organisations, this might be a useful package.
Better developer tooling, e.g. data syndication and schema validation, would help serve data scientists that are building custom workflows against data that is downloaded or harvested from data portals. This is a way to explore a platform approach that doesn’t necessarily require downstream users to use the portal APIs to query the data – just syndication of updates and notifications of changes.
Another area is curation and data management tools. E.g. features to support multiple people in creating and managing a dataset directly in the portal itself. This might be useful for small-scale enterprise uses as well as supporting collaboration around open datasets.
Automated analysis of hosted data is another area in which data portals could develop features that would support both the publishers and developers. Some metadata about a dataset, e.g. to help describe its contents, could be derived by summarising features of the data rather than requiring manual data entry.
Regardless of how they evolve in terms of features, data portals are likely to remain a key part of open data infrastructure. However as Google and others begin doing more to index the contents of datasets, it may be that the users of portals increasingly become machines rather than humans.