Why are we still building portals?

The Geospatial Commission have recently published some guidance on Designing Geospatial Data Portals. There’s a useful overview in the accompanying blog post.

It’s good clear guidance that should help anyone building a data portal. It has tips for designing search interfaces, presenting results and dataset metadata.

There’s very little advice that is specifically relevant to geospatial data and little in the way of new insights in general. The recommendations echo lots of existing research, guidance and practice. But it’s always helpful to see best practices presented in an accessible way.

For guidance that is ostensibly about geospatial data portal, I would have liked to have seem more of a focus on geospatial data. This aspect is largely limited to recommending the inclusion of a geospatial search, spatial filtering and use of spatial data formats.

It would have been nice to see some suggestions around the useful boundaries to include in search interfaces, recommendations around specific GIS formats and APIs, or some exploration of how to communicate the geographic extents of individual datasets to users.

Fixing a broken journey

The guidance presents a typical user journey that involves someone using a search engine, finding a portal rather than the data they need, and then repeating their search in a portal.

Improving that user journey is best done at the first step. A portal is just getting in the way.

Data publishers should be encouraged to improve the SEO of their datasets if they really want them to be found and used.

Data publishers should be encouraged to improve the documentation and metadata on their “dataset landing pages” to help put that data in context.

If we can improve this then we don’t have to support users in discovering a portal, checking whether it is relevant, teaching them to navigate it, etc.

We don’t really need more portals to improve discovery or use of data. We should be thinking about this differently.

There are many portals, but this one is mine

Portals are created for all kinds of purposes.

Many are just a fancy CMS for datasets that are run by individual organisations.

Others are there to act as hosts for data to help others make it more accessible. Some provide a directory of datasets across a sector.

Looking more broadly, portals support consumption of data by providing a single point of integration with a range of tools and platforms. They work as shared spaces for teams, enabling collaborative maintenance and sharing of research outputs. They also support data governance processes: you need to know what data you have in order to ensure you’re managing it correctly.

If we want to build better portals, then we ought to really have a clearer idea of what is being built, for whom and why.

This new guidance rightly encourages user research, but presumes building a portal as the eventual outcome.

I don’t mean that to be dismissive. There are definitely cases where it is useful to bring together collections of data to help users. But that doesn’t necessarily mean that we need to create a traditional portal interface.

Librarians exist

For example, in order to tackle specific challenges it can be useful to identify a set of relevant related data. This implies a level of curation — a librarian function — which is so far missing from the majority of portals.

Curated collections of data (& code & models & documentation & support) might drive innovation whilst helping ensure that data is used in ways that are mindful of the context of its collection. I’ve suggested recipes as one approach to that. But there are others.

Curation and maintenance of collections are less popular because they’re not easily automated. You need to employ people with an understanding of an issue, the relevant data, and how it might be used or not. To me this approach is fundamental to “publishing with purpose”.

Data agencies

Jeni has previously proposed the idea of “data agencies” as a means of improving discovery. The idea is briefly mentioned in this document.

I won’t attempt to capture the nuance of her idea, but it involves providing a service to support people in finding data via an expert help desk. The ONS already have something similar for their own datasets, but an agency could cover a whole sector or domain. It could also publish curated lists of useful data.

This approach would help broker relationships between data users and data publishers. This would not only help improve discovery, but also build trust and confidence in how data is being accessed, used and shared.

Actually linking to data?

I have a working hypothesis that, setting aside those that need to aggregate lots of small datasets from different sources, most data-enabled analyses, products and services typically only use a small number of related datasets. Maybe a dozen?

The same foundational datasets are used repeatedly in many different ways. The same combination of datasets might also be analysed for different purposes. It would be helpful to surface the most useful datasets and their combinations.

We have very little insight into this because dataset citation, linking and attribution practices are poor.

We could improve data search if this type of information was more readily available. Link analysis isn’t a substitute for good metadata, but its part of the overall puzzle in creating good discovery tools.

Actually linking to data when its referenced would also be helpful.

Developing shared infrastructure

Portals often provide an opportunity to standardise how data is being published. As an intermediary they inevitably shape how data is published and used. This is another area where existing portals do little to improve their overall ecosystem.

But those activities aren’t necessarily tied to the creation and operation of a portal. Provision of shared platforms, open source tools, guidance, quality checkers, linking and aggregation tools, and driving development and adoption of standards can all be done in other ways.

It doesn’t matter how well designed your portal interface is if a user ends up at an out-of-date, poor quality or inaccessible dataset. Or if the costs of using it are too high. Or a lack of context contributes to it being misused or misinterpreted.

This type of shared infrastructure development doesn’t get funded because its not easy to automate. And it rarely produces something you can point at and say “we launched this”.

But it is vital to actually achieving good outcomes.

Portals as service failures

The need for a data portal is an indicator of service failure.

Addressing that failure might involve creating a new service. But we shouldn’t rule out reviewing existing services to see where data can be made more discoverable.

If a new service is required then it doesn’t necessarily have to be a conventional search engine.