The limitations of the open banking licence

The Open Banking initiative recently began to publicly publish specifications, guidance and data through its website. If you’re not already aware of the initiative, it was created as a direct result of government reforms that aim to encourage the banking sector to be more open and innovative. The CMA undertook a lengthy consultation period during which the ODI coordinated work on the Open Banking Standard report.

The recommendations from that report and the CMA ruling were clear. Banks have to:

  • publish open data about their products, branches and locations, and
  • develop and provide open APIs to support access to other data, e.g. the transaction history on your account.

Unfortunately, while the banks are moving in that direction, the data they are publishing is not open data.

The Open Definition is the definitive description of what makes content and data open. It describes certain freedoms that are essential to maximise the value of publishing data under an open licence.

I think publishing open data is what the CMA and others really intended. Its also clearly spelt out in the Open Banking report. But unfortunately something has been lost in translation. The Open Banking Licence does not conform to the open definition.

Owen Boswarva has given a detailed review on his blog. For a review of the impacts of non-open licences you can read the ODI guidance which I helped to draft.

Rather than recap that guidance here, I thought it might be useful to try to spell out where the limitations in the Open Banking licence will impact reuse of the data. This is based on my early explorations with the public data.

Exploring the limitations of the open banking licence

The Open Banking API dashboard provides direct access to the currently available data. It includes data on the ATMs provided by each of the participating banks, their branches and products.

The data is published as JSON. A commonly used data format that is easy for developers to work with.

I can’t freely distribute the data

The first thing I did was to build a public map of all of the ATM data. To do this I had to convert the data from JSON to CSV which I could then load into an online mapping tool (Carto).

This is a permitted use under the Open Banking licence. The conversion of the data from JSON to CSV, and the creation of a map is explicitly allowed in the licence. Section 2.1(c) says that I am allowed “to adapt the Open Data into different formats for the purposes of data mapping (or otherwise for display or presentational purposes)“.

But that clause means that:

  • I can’t share the CSV version of the data. Data in CSV format is useful to many more potential reusers of the data. Many analytics tools support CSV but won’t support custom JSON documents. Because I can’t distribute the alternative version, fewer people can immediately use the data
  • I had to keep the dataset private in my Carto account. I’m lucky enough to have a personal account that lets me keep data private. Most freely available online tools allow people to use their services for free, so long as they’re using open data. If I was allowed to share the data with other Carto users, anyone could use it in their own maps. People without a paid Carto account can’t use this data. The result is, again, that fewer people can get immediate benefit from it.

The ability to freely convert and distribute data is a key part of the open definition. It allows data re-users to support each other in using the data by making it available in alternative formats and on all available platforms.

At the moment we are only allowed to copy, re-use, publish and distribute data so long as we don’t change it.

I’m limited in using the data to enrich other services and products

Because I can’t distribute the data it means I can’t take the data that has been provided and use it to improve an existing system. For example I don’t believe I can use the data to add missing ATM locations in Open Street Map.

The terms of the Open Banking licence are not compatible with the Open Street Map licence. Because it is a custom licence, rather than an existing standard open licence, resolving that issue will require legal advice.

OSM requires contributors to be extra cautious when adding data from other sources. They suggest getting explicit written agreement. This takes time and effort. That doesn’t seem to be achieving the desired outcome of a more open banking sector.

The licence is also revocable. At any time the banks can revoke my ability to use the data. Open licences, like the creative commons licences are not revocable. This means I’m exposing myself to legal and commercial risks if I build it into a product or service. I would need to take legal advice on that.

I can’t improve the data

After creating a basic map of ATM locations, I wanted to link the data with other sources. Data becomes more valuable once its linked together.

I opened my CSV version of the data in a free, open source desktop GIS tool called QGIS. Using the standard features of that tool I was able to match the geographic coordinates in the ATM data against openly licenced geographic data from the Office of National Statistics.

This generated an enriched dataset in which every ATM was now linked to an LSOA. An LSOA is a statistical area used by the ONS and others to help publish statistics about the UK. There are many statistical datasets that are reported against these areas.

Having completed that enrichment process I could now start to explore the data in the context of official statistics on demographics. There are many interesting questions that I can now ask of the data. But other people might also have interesting uses for that enriched dataset.

The process of doing the enriching is quite technical. I’m comfortable with teaching myself how to do that. But it would be great if I could help other people unlock value by letting them explore the enriched data.

Unfortunately I can’t share my enriched version with them. I’m not allowed to change any of the content of the data, or distribute it in alternate forms. The best I could do is tweet out a few interesting insights.

I am discouraged from using the data

One way I could use the enriched data is to explore how ATM and branch locations might relate to deprivation or other demographic statistics. This might highlight patterns in how individual banks have chosen to site their branches.

I could also monitor the data over time and build up a picture of where ATMs and branches are opening and closing around the country. Or explores the changing mix of products available from individual banks.

Unfortunately I don’t think I can do that. Clause 3.1(b) of the licence states that I must not “use or present the Open Data or any analysis of it in a way that is unfair or misleading, for example comparisons must be based on objective criteria and not be prejudiced by commercial interests“.

It’s not clear to me what unfair or misleading means. Unfair to the banks? Unfair to consumers? What type of objective criteria are acceptable?

If I were working for a fintech startup, I could perhaps use the data to identify new financial products that could be offered to consumers. I think that’s the type of innovation that the CMA wanted to encourage?

But if I do that and explain my analysis with others, then am I “prejudiced by commercial interests”? The licence says I can use the data commercially, but seems to discourage certain types of commercial usage.

These types of broad, under defined clauses in licences discourage reuse. They create uncertainty around what is actually permitted under the terms of the licence. This reduces the likelihood of people using the data, unless they can cover the legal guidance needed to remove the uncertainty.

I have probably already broken the terms of the licence

I think I may have already broken the terms of the licence. As a bit of fun I’ve created a twitter account called @allthebarclays. Every day it tweets out a picture of a branch of Barclays along with its name and unique identifier.

I’m probably not allowed to do that. The photos in the data don’t have a licence attached to them, so I’m hoping that if challenged, I can justify it under fair use.

The account is clearly a joke. It’s of real use to anyone. But it gave me a focus for my explorations with the data.

It was also a deliberate attempt to show how the data could be used to create something which far from its original intended use. Because encouraging unexpected uses of the data is one of the primary goals of publishing open data. It’s the unexpected uses that are most likely to hit the types of limitations that I’ve outlined above.

How does this get resolved?

There are several ways in which these issues could begin to be addressed. There are measures that the initiative could take that would address some specific limtations, or they could take steps to address all of them. For example, the Open Banking Initiative could:

  1. Publish data in other formats, e.g. by providing a CSV download, this would explicitly address one part of the first issue I highlighted, but none of the real concerns
  2. Publish some guidance for reusers that clarifies some of the terms of its existing licence. This might avoid discouraging some uses of the data but again, it doesn’t address the primary issues. The data would still not be open data
  3. Revise its licence to remove the problematic clauses and create an open data licence. This would ideally go through the licence approval process. This would address all of the concerns
  4. Drop the licence completely in favour of the Creative Commons Attribution licence (CC-BY 4.0). This would address all of the concerns with the added benefit that it would be explicitly clear to all users that the data could be freely and easily mixed with other open data

Only the last two options would be substantial progress.

What’s needed is for someone at the Open Banking initiative (or perhaps the CMA?) to step up and take responsibility for addressing the issues. Unfortunately, until that happens, the initiative is just another example of open washing.

What is data asymmetry?

You’ve just parked your car. Google Maps offers to record your current location so you can find where you parked your car. It also lets you note how much parking time you have available.

Sharing this data allows Google Maps to provide you with a small but valuable service: you can quickly find your car and avoid having to pay a fine.

For you that data has a limited shelf-life. It’s useful to know where you are currently parked, but much less useful to know where you were parked last week.

But that data has much more value to Google because it can be combined with the data from everyone else who uses the same feature. When aggregated that data will tell them:

  • The location of all the parking spaces around the world
  • Which parking spaces are most popular
  • Whether those parking spaces are metered (or otherwise time-limited)
  • Which parking spaces will be become available in the next few hours
  • When combined with other geographic data, it can tell them the places where people usually park when they visit other locations, e.g. specific shops or venues
  • …etc

That value only arises when many data points are aggregated together. And that data remains valuable for a much longer period.

With access to just your individual data point Google can offer you a simple parking reminder service. But with access to the aggregate data points they can extract further value. For example by:

  • Improving their maps, using the data points to add parking spaces or validate those that they may already know about
  • Suggesting a place to park as people plan a trip into the same city
  • Creating an analytics solution that provides insight into where and when people park in a city
  • …etc

The term data asymmetry refers to any occasion when there a disparity in access to data. In all cases this results in the data steward being able to unlock more value than a contributor.

A simple illustration using personal data

When does data asymmetry occur?

Broadly, data asymmetry occurs in almost every single digital service or application. Anyone running an application automatically has access to more information than its users. In almost all cases there will be a database of users, content or transaction histories.

Data asymmetry, and the resulting imbalances of power and value, are most often raised in the context of personal data. Social networks mining user information to target adverts, for example. This prompts discussion around how to put people back in control of their own data as well as encouraging individuals to be more aware of their data.

Apart from social networks, other examples of data asymmetry that relate to personal data include:

  • Smart meters that provide you with a personal view of your energy consumption, whilst providing energy companies with an aggregated view of consumption patterns across all consumers
  • Health devices that track and report on fitness and diet, whilst developing aggregated views of health across its population of users
  • Activity loggers like Strava that allow you to record your individual rides, whilst developing an understanding of mobility and usage of transport networks across a larger population

But because asymmetry is so prevalent it occurs in many other areas; it’s not an issue that is specific to personal data. Any organisation that offers a B2B digital service will also be involved in data asymmetry. Examples include:

  • Accounting packages that allow better access to business information, whilst simultaneously creating a rich set of benchmarking data on organisations across an industry
  • Open data portals that will have metrics and usage data on how users of the service are finding and consuming data
  • “Sharing economy” platforms that can turn individual transactions into analytics products

Data asymmetry is as important an issue in this areas as it is for personal data. These asymmetries can create power imbalances in sharing economy platforms like Uber. The impact of information asymmetry on markets has been understood since the 1970s.

How can data asymmetry be reduced?

There are many ways that data asymmetry can be reduced. Broadly, the solutions either involve reducing disparity in access to data, or in reducing disparities in the ability to extract value from that data.

Reducing the amount of data available to an application or service provider is where data protection legislation has a role to play. For example, data protection law places limits on what personal data companies can collect, store and share. Other examples of reducing disparities in access to data include:

  • Allowing users to opt-out of providing certain information
  • Allowing users to remove their data from a service
  • Creating data retention policies to reduce accumulation of data

Practising Datensparsamkeit reduces risks and imbalances associated with unfettered collection of data.

Reducing disparities in the ability to extract value from data can include:

  • Giving users more insight and input into when and where their data is used or shared
  • Giving users or businesses access to all of their data, e.g. a complete transaction history or a set of usage statistics, so they can attempt to draw additional value from it
  • Publishing some or all of the aggregated data as open data

Different applications and services will adopt a different mix of strategies. This will require balancing the interests of everyone involved in the value exchange. Policy makers and regulators also have a role to play in creating a level playing field.

Open data can reduce asymmetry by allowing value to spread through a wider network

Update: the diagrams in this post were made with a service called LOOPY. You can customise the diagrams and play with the systems yourself. Here’s the first diagram visualising data asymmetry and here is the revised version shows how open data reduces asymmetry by allowing value to spread further.

This post is part of a series called “basic questions about data“.

Fearful about personal data, a personal example

I was recently at a workshop on making better use of (personal) data for the benefit of specific communities. The discussion, perhaps inevitably, ended up focusing on many of the attendees concerns around how data about them was being used.

The group was asked to share what made them afraid or fearful about how personal data might be misused. The examples were mainly about use of the data by Facebook, by advertisers, as surveillance, etc. There was a view that being in control of that data would remove the fear and put the individual back in control. This same argument pervades a lot of the discussion around personal data. The narrative is that if I own my data then I can decide how and where it is used.

But this overlooks the fact that data ownership is not a clear cut thing. Multiple people might reasonably claim to have ownership over some data. For example bank transactions between individuals. Or about cats. Multiple people might need to have a say in how and when that data is used.

But setting aside that aspect of the discussion for now, I wanted to share what made me fearful about how some personal data might be misused.

As I’ve written here before my daughter has Type-1 diabetes. People with Type-1 diabetes live a quantified life. Blood glucose testing and carbohydrate counting are a fact of life. Using sensors makes this easier and produces better data.

We have access to my daughter’s data because we are a family. By sharing it we can help her manage her condition. The data is shared with her diabetes nurses through an online system that allows us to upload and view the data.

What makes me fearful isn’t that this data might be misused by that system or the NHS staff.

What makes me fearful is that we might not be using the data as effectively as we could be.

We are fully in control of the data, but that doesn’t automatically give us the tools, expertise or insight to use it. There may be other ways to use that data that might help my daughter manage her condition better. Is there more that we could be doing? Is there more data we could be collecting?

I’m technically proficient enough to do things with that data. I can download, chart and analyse it. Not everyone can do that. What I don’t have are the skills, the medical knowledge, to really use it effectively.

We have access to some online reporting tools as a consequence of sharing the data with the NHS. I’m glad that’s available to us. It does a better job than I can do.

I also fear that there might be insights that researchers could extract from that data, by aggregating it with data shared by other people with diabetes. But that isn’t happening, because have no way to really allow that. And even so I’m not sure we would be qualified to judge the quality of a research project to know where it might best be shared.

My aim here is not to be melodramatic. We are managing very well thank you. And yes there are clearly areas where unfettered access to personal data is problematic. There’s no denying that. My point is to highlight that ownership and control doesn’t automatically address concerns or create value.

We are not empowered by the data, we are empowered when it is being used effectively. We are empowered when it is shared.

Some tips for open data ecosystem mapping

At Open Data Camp last month I pitched to run a session on mapping open data ecosystems. Happily quite a few people were interested in the topic, so we got together to try out the process and discuss the ideas. We ended up running the session according to my outline and a handout I’d prepared to help people.

There’s a nice writeup with a fantastic drawnalism summary on the Open Data Camp blog. I had a lot of good feedback from people afterwards to say that they’d found the process useful.

I’ve explored the idea a bit further with some of the ODI team, which has prompted some useful discussion. It also turns out that the Food Standards Agency are working through a similar exercise at the moment to better understand their value networks.

This blog post is just gather together those links along with a couple more examples and a quick brain dump of some hints and tips for applying the tool.

Some example maps

After the session at Open Data Camp I shared a few example maps I’d created:

That example starts to present some of the information covered in my case study on Discogs.

I also tried doing a map to illustrate aspects of the Energy Sparks project:

Neither of those are fully developed, but hopefully provide useful reference points.

I’ve been using Draw.io to do those maps as it saves to Google Drive which makes it easier to collaborate.

Some notes

  • The maps don’t have to focus on just the external value, e.g. what happens after data is published. You could map value networks internal to an organisation as well
  • I’ve found that the maps can get very busy, very quickly. My suggestion is to focus on the key value exchanges rather than trying to be completely comprehensive (at least at first)
  • Try to focus on real, rather than potential exchanges of value. So, rather than brainstorm ways that sharing some data might provide useful, as a rule of thumb check whether you can point to some evidence of a tangible or intangible value exchange. For example:
    • Tangible value: Is someone signing up to a service, or is there an documented API or data access route?
    • Intangible value: is there an event, contact point or feedback form which allows this value to actually be shared?
  • “Follow the data”. Start with the data exchanges and then add applications and related services.
  • While one of the goals is to identify the different roles that organisations play in data ecosystems (e.g. “Aggregator”) its often easier to start with the individual organisation and their specific exchanges first, rather than the goal. Organisations may end up playing several roles, and that’s fine. The map will help evidence that
  • Map the current state, not the future. There’s no time aspect to these maps, I’d recommend drawing a different map to show how you hope things might be, rather than how they are.
  • There was a good suggestion to label data exchanges in some way to add a bit more context, e.g. by using thicker lines for key data exchanges, or a marker to indicate open (versus shared or closed data sharing)
  • Don’t forget that for almost all exchanges where a service is being delivered (e.g. an application, hosting arrangement, etc) there will also be an implicit, reciprocal data exchange. As a user of a service I am contributing data back to the service provider in the form of usage statistics, transactional data, etc. Identifying where that data is accruing (but not being shared) is a good way to identify future open data releases
  • A value network is not a process diagram. The value exchanges are between people and organisations, not systems. If you’ve got a named application on the diagram it should only be as the name of tangible value (“provision of application X”) not as a node in the diagram
  • Sometimes you’re better off drawing a process or data flow diagram. If you want to follow how the data gets exchanged between systems, e.g. to understand its provenance or how it is processed, then you may be better of drawing a data flow diagram. I think as practitioners we may need to draw different views of our data ecosystems. Similar to how systems architects have different ways to document software architecture
  • The process of drawing a map is as important as the output itself. From the open data camp workshop and some subsequent discussions, I’ve found that the diagrams quickly generate useful insights and talking points. I’m keen to try the process out in a workshop setting again to explore this further

I’m keen to get more feedback on this. So if you’ve tried out the approach then let me know how it works for you. I’d be really interested to see some more maps!

If you’re not sure how to get started then also let me know how I can help, for example what resources would be useful? This is one of several tools I’m hoping to write-up in my book.

The British Hypertextual Society (1905-2017)

With their globe-spanning satellite network nearing completion, Peter Linkage reports on some of the key milestones in the history of the British Hypertextual Society.

The British Hypertextual Society was founded in 1905 with a parliamentary grant from the Royal Society of London. At the time there was growing international interest in finding better ways to manage information, particularly scientific research. Undoubtedly the decision to invest in the creation of a British centre of expertise for knowledge organisation was also influenced by the rapid progress being made in Europe.

Paul Otlet‘s Universal Bibliographic Repertory and his ground-breaking postal search engine were rapidly demonstrating their usefulness to scholars. Otlet’s team began publishing the first version of their Universal Decimal Classification only the year before. Letters between Royal Society members during that period demonstrate concern that Britain was losing the lead in knowledge science.

As you might expect, the launch of the British Hypertextual Society (BHS) was a grand affair. The centre piece of the opening ceremony was the Babbage Bookwheel Engine, which remains on show (and in good working order!) in their headquarters to this day. The Engine was commissioned from Henry Prevost Babbage, who refined a number of his fathers ideas to automate and improve on Ramelli’s Bookwheel concept.

While it might originally have been intended as only a centre piece, it was the creation of this Engine that laid the ground work for many of the Society’s later successes. Competition between the BHS members and Otlet’s team in Belgium encouraged the rapid development of new tools. This includes refinements to the Bookwheel Engine, prompting its switch from index cards to microfilm. Ultimately it was also instrumental in the creation of the United Kingdom’s national grid and the early success of the BBC.

In the 1920s, in an effort to improve on the Belgium Postal Search Service, the British Government decided to invest in its own solution. This involved reproducing decks of index cards and microfilm sheets that could be easily interchanged between Bookwheel Engines. The new, standardised electric engines were dubbed “Card Wheels”.

The task of distributing the decks and the machines to schools, universities and libraries was given to the recently launched BBC as part of its mission to inform, educate and entertain. Their microfilm version of the Domesday book was the headline grabbing release, but the BBC also freely distributed a number of scholarly and encyclopedic works.

Problems with reliable supply of electricity to parts of the UK hampered the roll out of the Card Wheels. This lead to the Electricity (Supply) Act of 1926 and the creation of Central Electricity Board. This simultaneously laid the foundations for a significant cabling infrastructure that would later carry information to the nation in digital forms.

These data infrastructural improvements were mirrored by a number of theoretical breakthroughs. Drawing on Ada Lovelace’s work and algorithms for the Difference Engine, British Hypertextual Society scholars were able to make rapid advances in the area of graph theory and analysis.

These major advances in the distribution of knowledge across the United Kingdom lead to Otlet moving to Britain in the early 1930s. A major scandal at the time, this triggered the end of many of the projects underway in Belgium and beyond. Awarded a senior position in the BHS, Otlet transferred his work on the Mundaneum to London. Close ties between the BHS members and key government
officials meant that the London we know today is truly the “World City” envisioned by Otlet. It’s interesting to walk through London and consider how so much of the skyline and our familiar landmarks are influenced by the history of hypertext.

The development of the Memex in the 1940s laid the foundations for the development of both home and personal hypertext devices. Combining the latest mechanical and theoretical achievements of the BHS with some American entrepreneurship lead to devices rapidly spreading into people’s homes. However the device was the source of some consternation within the BHS as it was felt that British ideas hadn’t been properly credited in the development of that commercial product.

Of course we shouldn’t overlook the importance of the InterGraph in ensuring easy access to information around the globe. Designed to resist nuclear attack, the InterGraph used graph theory concepts developed by the BHS to create a world-wide mesh network between hypertext devices and sensors. All of our homes, cars and devices are part of this truly distributed network.

Tim Berners-Lee‘s development of the Hypertext Resource Locator was initially seen as a minor breakthrough. But it actually laid the foundations for the replacement of Otlet’s classification scheme and accelerated the creation of the World Hypertext Engine (WHE) and the global information commons. Today the WHE is ubiquitous. It’s something we all use and contribute to on a daily basis.

But, while we all contribute to the WHE, it’s the tireless work of the “Controllers of The Graph” in London that ensures that the entire knowledge base remains coherent and reliable. How else would we distinguish between reliable, authoritative sources and information published by any random source? Their work to fact check information, manage link integrity and ensure maintenance of core assets are key features of the WHE as a system.

Some have wondered what an alternate hypertext system might look like. Scholars have pointed to ideas such as Ted Nelson’s “Xanadu” as one example of an alternative system. Indeed it is one of many that grew out of the counter-culture movement in the 1960s. Xanadu retained many of the features of the WHE as we know it today, e.g. transclusion and micro-transactions, but removed the notion of a centralised index and register of content. This not only removed the ability to have reliable, bi-directional links,  but would have allowed anyone to contribute anything, regardless of its veracity.

For many its hard to imagine how such a chaotic system would actually work. Xanadu has been dismissed as “a foam of ever-popping bubbles“. And a heavily commercialised and unreliable system of information is a vision to which a few would subscribe.

Who would want to give up the thrill of seeing their first contributions accepted into the global graph? It’s a rite of passage that many reflect on fondly. What would the British economy look like if it were not based on providing access to the world’s information? Would we want to use a system that was not fundamentally based on the “Inform, Educate and Entertain” ideal?

This brings us to the present day. The launch of a final batch of satellites will allow the British Hypertextual Society to deliver on a long-standing goal whilst also enabling its next step into the future.

Launched from the British space centre at Goonhilly, each of the standardised CardSat satellites carries both a high-resolution camera and an InterGraph mesh network node. The camera will be used to image the globe in unprecedented detail. This will be used to ensure that every key geographical feature, including every tree and many large animals can be assigned a unique identifier, bringing them into
the global graph. And, by extending the mesh network into space the BHS will ensure that the InterGraph has complete global coverage, whilst also improving connectivity between the fleet of British space drones.

It’s an exciting time for the future of information sharing. Let’s keep sharing what we know!

Designing CSV files

A couple of the projects I’m involved with at the moment are at a stage where there’s some thinking going on around how to best provide CSV files for users. This has left me thinking about what options we actually have when it comes to designing a CSV file format.

CSV is a very useful, but pretty mundane format. I suspect many of us don’t really think very much about how to organise our CSV files. It’s just a table, right? What decisions do we need to make?

But there are actually quite a few different options we have that might make a specific CSV format more or less suited for specific audiences. So I thought I’d write down some of the options that occured to me. It might be useful input into both my current projects as well as future work on standard formats.

Starting from the “outside in”, we have decisions to make about all of the following:

File naming

How are you going to name your CSV file? A good file naming convention can help ensure that a data file has an unambiguous name within a data package or after a user has downloaded it.

Including a name, timestamp or other version indicator will avoid clobbering existing files if a user is archiving or regularly collecting data.

Adopting a similar policy to generating URL slugs can help generate readable file names that work across different platforms.

Tabular Data Packages recommends using a .csv file name extension. Which seems sensible!

CSV Dialect

CSV is a loosely defined format of which there are several potential dialects. Variants can use different delimiter, line-endings and quoting policies. Content encoding is another variable. CSV files may or may not have headers.

The CSV on the Web standard defines a best practise CSV file dialect. Unless there’s a good reason, this ought to be your default dialect when defining new formats. But note that the recommended UTF-8 encoding may cause some issues with Excel.

CSV on the Web doesn’t say how many header rows a CSV file should have, but does define how, when parsing a CSV file, multiple header rows can be skipped. Multiple header rows are often used as a way to add metadata or comments, but I’d recommend using a CSV on the Web file instead as it provides more options.

Column Naming

What naming convention to use for columns? Options are to use an all lower case convention similar to a URL slug. This might make it marginally easier when accessing columns by name in an application.  But if there are expectations that a CSV file will be opened in a spreadsheet application, having readable column names (including spaces) will make the data more user friendly.

CSV on the Web has a few other notes about column and row labelling.

Also, what language will you use in the column headings?

Column Ordering

How are you going to order the columns in your CSV? The ordering of columns in a CSV file can enhance readability. But there’s is likely to be several different orderings, some of them more “natural” than others.

A common convention is to start with an identifier and other properties (dimensions) that describe what is being reported first, and then the actual observed values. So for example in a sales report we might have:

region, customer, product, total

Or in a statistical dataset

dimension1, dimension2, dimension3, value

Or

dimension1, dimension2, dimension3, value, qualifier

This has the advantage of having a more natural reading order for the table. Particularly if as you move from left to right the columns can have fewer values. Adding qualifiers and notes to the end also ensures that they sit naturally next to the value they are annotating

Row Ordering

Is your CSV sorted by default? Sorting may be less relevant if a CSV is being automatically processed and not worrying about order might reduce overheads when generating a data dump.

But if the CSV is going to be inspected or manipulated in a spreadsheet, then defining a default order can help a reader make sense of it.

If the CSV isn’t ordered, then document this somewhere.

Table Layout

How is the data in your table organised?

The Tidy Data guidance recommends having variables in columns, observations in rows, and only a single type of measure/value per table.

In addition to this, I’d also recommend that where there are qualifiers for reported values (as there often are for statistical data) that these are always provided in a separate column, rather than within the main value column. This has the advantage of letting you value column be numeric, rather than a mix of numbers and symbols or other codes. Missing and surpressed values can also then just be omitted and accompanied by an explanation in an adjacent column.

Another pattern I’ve seen with table layouts is to include an element of redundancy to include both labels and identifiers for something referenced in a row. Going back to the sales report example, we might structure this as follows:

region_id, region_name, customer_id, customer_name, product, total

This allows an identifier (which might be a URI) to be provided alongside a human-readable name. This makes the data more readable, at the cost of increasing file size. But it does avoid the need to publish a separate lookup table of identifiers.

You might also sometimes fine a need for repeated values. This is sometimes handled by adding additional redundant columns, e.g. “SICCode1″…”SICCode4” as used in the Companies House data. This works reasonably well and should be handled by most tools, at the potential cost of having lots of extra columns and a sparsely populated table. The alternative is to use a delimiter to put all the values in a single CSV. Again, CSV on the Web defines ways to process this.

Data Formats

And finally we have to decide how to include values in the individual cells. In the section on parsing the CSV on the Web recommends XML Schema data types and date formats as a default, but also allows formats to be defined in an accompanying metadata file.

Other things to think about are more application specific issues, such as how to specify co-ordinates, e.g. lat/lng or lng/lat?

Again, you should think about likely uses of the data and how, for example data and date formats might be interpreted by spreadsheet applications as well as other internationalisation issues.

This is just an initial list of thoughts. CSV on the Web clearly provides a lot of useful guidance that we can now build on, but there are still reasonable questions and trade-offs to be made. I think I’d also now recommend always producing a CSV on the Web metadata file along with any CSV file to help document its structure and any of the decisions made around its design. It would be nice to see the Tabular Data Package specification begin to align itself with that standard.

I suspect there a number of useful tips and guidance which could be added to what I’ve drawn up here. If you have any comments or thoughts then let me know.

Open Data Camp Pitch: Mapping data ecosystems

I’m going to Open Data Camp #4 this weekend. I’m really looking forward to catching up with people and seeing what sessions will be running. I’ve been toying with a few session proposals of my own and thought I’d share an outline for this one to gauge interest and get some feedback.

I’m calling the session: “Mapping open data ecosystems“.

Problem statement

I’m very interested in understanding how people and organisations create and share value through open data. One of the key questions that the community wrestles with is demonstrating that value, and we often turn to case studies to attempt to describe it. We also develop arguments to use to convince both publishers and consumers of data that “open” is a positive.

But, as I’ve written about before, the open data ecosystem consists of more than just publishers and consumers. There are a number of different roles. Value is created and shared between those roles. This creates a value network including both tangible (e.g. data, applications) and intangible (knowledge, insight, experience) value.

I think if we map these networks we can get more insight into what roles people play, what makes a stable ecosystem, and better understand the needs of different types of user. For example we can compare open data ecosystems with more closed marketplaces.

The goal

Get together a group of people to:

  • map some ecosystems using a suggested set of roles, e.g. those we are individually involved with
  • discuss whether the suggested roles need to be refined
  • share the maps with each other, to look for overlaps, draw out insights, validate the approach, etc

Format

I know Open Data Camp sessions are self-organising, but I was going to propose a structure to give everyone a chance to contribute, whilst also generating some output. Assuming an hour session, we could organise it as follows:

  • 5 mins review of the background, the roles and approach
  • 20 mins group activity to do a mapping exercise
  • 20 mins discussion to share maps, thoughts, etc
  • 15 mins discussion on whether the approach is useful, refine the roles, etc

The intention here being to try to generate some outputs that we can take away. Most of the session will be group activity and discussion.

Obviously I’m open to other approaches.

And if no-one is interested in the session then that’s fine. I might just wander round with bits of paper and ask people to draw their own networks over the weekend.

Let me know if you’re interested!