Today I am 101100.

That’s XLIV in Roman.

44 is also the square root of 1936. 1936 was a leap year starting on a Wednesday.

The Year 44 was also a leap year starting on a Wednesday.

It was also known as the Year of the Consulship of Crispus and Taurus. Which is another coincidence because I like crisps and I’m also a Taurus.

And while we’re on Wikipedia, we can use the API to find out that page id 101100 is Sydney Harbour National Park which opened when I was 3.

Wolfram Alpha reminds me that 44 is the ASCII code for a comma.

Whichever way you look at it #101100 is a disappointing colour.

But if we use the random art generator then we can make a more colourful image from the number. But actually the image with that identifier is more interesting. Glitchy!

The binary number is also a car multimedia entertainment system. But £200 feels a bit steep, even if it is my birthday.

A 12 year old boy once bid £101,100 for a flooded Seat Toledo on EBay. Because reasons.

101100, or tubulin tyrosine ligase-like family, member 3 to its friends, also seems to do important things for mice.

I didn’t really enjoy Jamendo album 101100, the Jamez Anthony story.

Care of Cell Block 101100 was a bit better in my opinion. But only a bit.

Discogs release 101100 is The Sun’s Running Out by Perfume Tree. Of which the most notable thing is that track six includes a sample from a Dr Who episode.

I’m not really sure what the tag 101100 on flickr means.

IMDB entry 101100 is “Flesh ‘n’ Blood

The Board Game Geek identifier 101100 is for an XBox 360 version of 1 vs 100. That’s not even a board game!

Whereas Drive Thru RPG catalogue product 101100 as Battlemage. Which sounds much more interesting.

If I search for “101100 coordinates” on google, then it tells me that it’s somewhere in China. I should probably know why.

There are 26 results for 101100 on data.gov.uk. But none on data.gov. Which explains why the UK is #1 in the world for open data.

But HD 101100 is also a star.

And a minor planet discovered on 14th September 1998

CAS 101-10-0 is 2-(3-Chlorophenoxy)propionic acid. I think its a herbicide. Anyway, this is what it looks like.

It’s also a marine worm.

And an insect.

In the database of useful biological numbers, we discover that entry 101100 is the maximal emission wavelength for Venus fluorophore. Which is, of course, 528 nm.

I think the main thing I’ve learnt in my 44 years is that the web is an amazing place.

Data marketplaces, we hardly knew ye

I’m on a panel at the ODI lunchtime lecture this week, where I’m hoping to help answer the question of “what does a good data market look like?“.

As many of you know I was previously the product manager/lead for a data marketplace called Kasabi. That meant that I spent quite a bit of time exploring options for building both free and commercial services around data, business models for data supply, etc. At the time data marketplaces seemed to be “a thing”. See also this piece from 2011. There were suddenly a number of data marketplaces springing up from a variety of organisations.

The idea of data marketplaces, perhaps as an evolution of current data portals is one that seems to be resurfacing. I’ve already written about why I think “data marketplace” isn’t the right framing for encouraging more collaboration around data, particularly in cities.

I’m not going to rehash that here, but, as preparation for Friday, I thought I’d take a look at how the various data marketplaces are fairing. Here’s a quick run down.

If you think I’ve misrepresented anything then leave a comment and I’ll correct the post.

  • Data Market was originally focused on delivering data to businesses, offered sophisticated charting and APIs. Drew largely on national and international statistics. Great platform and a really nice team (disclaimer: have previously done some freelance work with them). They were acquired by Qlik. My understanding is that this rounded out their product offering by having an off-the-shelf platform for visualising on-demand data. This is no longer what I’d consider a marketplace, more a curated set of data feeds.
  • Azure Data Marketplace is still around but seems to be largely offering only Microsoft’s own data and APIs. Seems to be in the middle of a revamp and refocus on cloud apps and more general APIs rather than a marketplace. In its early stages Microsoft explored iterating this into an enterprise data portal as well as deeper integration with some of their products like SQL Server.
  • Kasabi. Shutdown. Sob.
  • BuzzData. Shutdown.
  • FreeBase. Acquired by Google, continued as a free services for a while and shutdown in 2015. The data is now part of wikidata.
  • Infochimps. Originally a data marketplace, the team spent a lot of timing building out a data processing pipeline using Big Data technologies. They were acquired for this technology.
  • Timetric started out as a data platform focusing on statistical and time series data, now seems to have evolved in a slightly different direction.
  • Factual continue to focus on location data. I was always intrigued by their approach which (at least originally) included businesses pooling their data together to create a richer resource, which was then used to drive additional revenue and sales. While there were suggestions they may expand into other sectors, that hasn’t happened.
  • Gnip and Datasift are still around, both still focusing on services and data analysis around social media data

There are others that could be included in the list. There’s one interesting new contender that shares a lot of similarity with some things that we were building in Kasabi, but they’re currently in stealth mode so I won’t share more.

I also don’t include Amazon Public Datasets or Google Public Data as they’re not really marketplaces. They’re collections of large datasets that Amazon or Google are providing as an enabler or encouragement to use some of their cloud services. Difficult to demonstrate big data analysis unless there’s a nice collection of demo datasets.

So, really only the Microsoft offering is still around in its originaly form of a data marketplace and its clear that the emphasis is is shifting elsewhere. The other services that are still around are all focused on a specific vertical or business sector rather than offering a general purpose (“horizontal”) platform for the supply and selling of data.

This matches what we can see elsewhere, there are lots of businesses that have been selling data for some time. While the original emphasis was on the data, the move now is to sell services on top of it. But they’re all focused on a specific sector or vertical. I think cities are neither.


On accessibility of data

My third open data “parable”. You can read the first and second ones here. With apologies to Borges.

. . . In that Empire, the Art of Information attained such Perfection that the data of a single City occupied the entirety of a Spreadsheet, and the datasets of the Empire, the entirety of a Portal. In time, those Unconscionable Datasets no longer satisfied, and the Governance Guilds struck a Register of the Empire whose coverage was that of the Empire, and which coincided identifier for identifier with it. The following Governments, who were not so fond of the Openness of Data as their Forebears had been, saw that that vast register was Valuable, and not without some Pitilessness was it, that they delivered it up to the Voraciousness of Privatisation and Monopolies. In the Repositories of the Net, still today, there are Stale Copies of that Data, crowd-sourced by Startups and Citizens; in all the Commons there is no other Relic of the Disciplines of Transparency.

Sharon More, The data roads less travelled. London, 2058.


Caution: data, use responsibly

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/caution-data-use-responsibly

In December 2015, Ben Goldacre and Anna Powell-Smith launched the beta of Open Prescribing. The site, which was swiftly celebrated in the open data community and beyond, provides insight into the prescribing practices of GPs around the UK. Its visualisations and reports give an entirely new perspective on some of the bulk open datasets available from the NHS.

Open Prescribing is a fantastic demonstration of how openly publishing data can unlock new, creative uses.

There is a particular feature of the site which piqued my interest: a page entitled, ‘Caution: how to use the data responsibly‘. Goldacre and Powell-Smith have included some clear guidance that helps users to properly interpret their findings, including:

  • guidance on how to interpret high and low values for the measurements, encouraging thought into what patterns they may or may not demonstrate – because of differences in population around a practice, for example
  • notes on how the individual measures were decided upon
  • insight into the importance of specific drugs and measures for a non-specialist audience
  • links to useful background information from the original data publishers

The ‘About‘ page for the site also attributes all of the datasets that were used as input to the analysis.

Clear attribution, provenance reporting and guidance on limits to the analysis might be expected from authors with a background in evidence-based medicine. It’s not yet normal practice within the open data community. But it should be.

As a society, we are making an increasing number of decisions based on data, about our health, economy and businesses. So it’s becoming more and more important that we know the limits of what that data can reliably tell us. Data enables informed decisions. Knowing the limits of data also makes us more informed.

In my opinion all data analysis should have an equivalent of the Open Prescribing “/caution” URL.

To achieve this data users need to know more about how data is collected and processed before it is published. This is why the higher levels of Open Data Certificaterequire publishers to:

  • document any known quality issues or limitations with the data
  • publish details of their quality control processes, including how to report errors
  • describe the provenance of the data, e.g. how it was collected and analysed

That information provides the necessary foundation for re-users to properly interpret and apply data. This information can then be cited, as it is on Open Prescribing, to help downstream users understand the impacts on any analysis.

Documenting the datasets used in an analysis is another norm that’s common in the medical and scientific communities. Linking to source datasets is the basis for citation analysis in academic research. These links power many types of discovery tools, and help improve reproducibility and transparency in research.

Use of machine-readable attributions could do the same for more general uses of data online. In the early days of the web, developers would “view source” to view the markup behind a webpage to learn how it was put together. The ability to “view sources” to discover the data underlying an application or data analysis would be a useful feature for the data web.

So, if you’re doing some data analysis, follow the best practices embodied by Open Prescribing and help users and other developers to understand how you’ve achieved your results.

Take your first steps with Open Data Pathway

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/take-your-first-steps-with-open-data-pathway

We’re launching a new tool today called Open Data Pathway. It’s a self-assessment tool that will help you assess how well your organisation publishes and consumes open data, and identify actions for improvement.

The tool is based on the Open Data Maturity Model we have been developing in partnership with the Department for Environment, Food & Rural Affairs.

The maturity model is based around five themes and maturity levels. Each theme represents a broad area of operations within an organisation, and is broken down into areas of activity which can then be used to assess progress.

We’ve previously published the maturity model as a public draft. We would like to thank everyone from across central and local government, agencies and other organisations who have given feedback on the draft documents. Your contributions and ideas were extremely valuable. We’re pleased to announce that the final, first edition of the model is now available.

Open Data Pathway supports open data practitioners in carrying out a maturity assessment. Completing an assessment will create a report that scores your organisation against each activity. The report also includes practical recommendations that suggest how scores can be improved for each activity. Combined with the ability to set targets for improvement, Open Data Pathway provides a complete self-assessment tool to enable practitioners to successfully apply the maturity model to their organisation.

Open Data Pathway offers a useful complement to the Open Data Certificates. The certificates measure how effectively someone is sharing a dataset for ease of reuse. Open Data Pathway helps organisations assess how well they publish and consume open data, helping build a roadmap for their open data journey.

We are initially launching the tool as an alpha release to help us gain valuable user feedback. The beta version will launch at the end of April, 2015, and will have the functionality to support results sharing and organisation benchmarking.

Please sign up and explore the tool and let us know what you think.

5 ways to be a better open data reuser

Originally published on the Open Data Institute blog. Original URL: https://theodi.org/blog/5-ways-better-open-data-reuse

Open data is still in its infancy. The focus so far has been on encouraging and supporting owners of data to publish it openly. A lot has been written about why opening up data is valuable, how to build business cases for open data sharing, and how to publish data in order to make it easy for people to reuse.

But, while it’s great there is so much advice for data publishers, we don’t often talk about how to be a good reuser of data. One of the few resources that give users advice is the Open Data Commons Attribution-Sharealike Community Norms.

I want to build on those points and offer some more tips and insights on how to use open data better.

1. Take time to understand the data

It almost goes without saying that in order to use data you need to understand it first. But effective reuse involves more than just understanding the structure and format of some data. We are asking publishers to be clear about how their data was collected, processed and licensed. So it’s important for reusers to use this valuable information and make informed decisions about using data.

It may mean that data is not fit for the purpose you intend, or perhaps you just need to be aware of caveats that impact its interpretation. These caveats should be shared when you are presenting your own analysis or conclusions, based on the data.

2. Be open about your sources

Attribution is a requirement of many open licences and reusers should be sure they are correctly attributing their sources. But citation of sources should be a community norm, not just a provision in a licence. Within research communities the norm is to publish data under a CC0 licence, because attribution and citation of data is already well-embedded as a best-practice: every scientific paper has a list of references.

The same principles should apply to the wider open data community. Acknowledging sources not only helps credit the work of data publishers, it also helps to identify widely-used, high-quality datasets.

Consider adding a page to your application that lists both the open source software and open data sources that you’ve used in developing it. The Lanyrd colophon pageprovides one example of how this might look.

3. Engage with the publisher

If you’re using someone’s data, tell them! Every open data publisher is keen to understand who is using their data and how. It’s by identifying the value that comes from reuse of their data that publishers can justify continual (and additional) investment in open data publishing.

Engage with publishers when they ask for examples of how their data is being reused. Provide constructive feedback on the data itself and identify quality issues if you find them. Point to improvements in how the data is published that might help you and others consume it more easily.

If it was hard for you to get in touch with the publisher, encourage them to provide clearer contact details on their website. Getting them to complete an Open Data Certificate will help make this point: you can’t get a Pilot rating unless you provide this information.

If open data is a benefit to your business, then share your story. Evidence of open data benefits provides a positive feedback loop that can help people to unlock more data.

4. Share what you know

In some cases it’s not easy or possible to provide feedback directly to publishers, so share what you learn about working with open data with the wider community.

Do you have some tips about how to consume a dataset? Consider writing a blog to share them. Maybe you can even share some open source code to help work with the data.

Have you identified some issues with a dataset? Those issues may well affect others, so share your observations with the wider community, not just the data publisher.

5. Help build the commons

The open data commons consists of all of the openly licensed and inter-connected datasets that are published to the web. The commons can grow and become more stable if we all contribute to it. There are various ways to achieve this beyond attribution and knowledge-sharing.

For example, if you’ve made improvements to a dataset, perhaps to enrich it against other sources, consider sharing that new dataset under an open licence. This might be the start of a more collaborative relationship with the original publisher or open up new business opportunities.

Some datasets are built and maintained collaboratively. Consider contributing some resources to help maintain the dataset, contributing your fixes or improvements. The more people do this, the more valuable the whole dataset becomes.

Direct financial contributions might also be an option, especially if you’re a commercial organisation making large-scale use of an open dataset. This is a direct way to support open data as a public good.

What do you think?

A mature open data commons will consist of a network of datasets published and reused by a variety of organisations. All organisations will be both publishers and consumers of open data. As we move forward with developing open data culture we need to think about how to encourage and support good practice in both roles.

The suggestions in this blog should prompt further discussion. We’d like to develop this further into some guidance for open data practitioners.

Comparing the 5-star scheme with Open Data Certificates

Originally published on the Open Data Institute blog.

I’ve been asked several times recently about the differences between the 5-star scheme for open data and the Open Data Certificates. How do the two ratings relate to one another, if at all? In this blog post I aim to answer that question.

The 5-star scheme

The 5-star deployment scheme was originally proposed by (our President) Tim Berners-Lee in his linked data design principles. The scheme is neatly summarised in this reference, which also identifies the costs and benefits associated with each stage.

Essentially, the scheme measures how well data is integrated into the web. “1-star” data is published in proprietary formats that users must download and process. “5-star” data can be accessed online, uses URIs to identify the resources in the data, and contains links to other sources.

The scheme is primarily focused on how data is published: the formats and technologies being used. Assessing whether a dataset is published at 2, 3 or 4 stars requires some insight into how the data has been published, which can be difficult for a non-technical person to assess.

The scheme is therefore arguably best used as a technical roadmap and a short-hand assessment of the technical aspects of data publishing.

Open Data Certificates

The Open Data Certificates process takes an alternative but complementary view. A certificate measures how effectively someone is sharing a dataset for ease of reuse. The scope covers more than just technical issues including rights and licensing, documentation, and guarantees about availability. A certificate therefore offers a more rounded assessment of the quality of publication of a dataset.

For data publishers the process of assessing a dataset provides insight into how they might improve their publishing process. The assessment process is therefore valuable in itself, but the certificate that is produced is also of value to reusers.

An Open Data Certificate acts as a reference sheet containing information of interest to reusers of a dataset. This saves time and effort digging through a publishers website to find out whether a dataset can meet their needs. The ability to search and browse for certified datasets may eventually make it easier to find useful data.

Despite these differences, the certificates and the 5-star scheme are in broad alignment. Both aim to improve the quality and accessibility of published data. And both require that data is published under open licences using standard formats. We would expect a dataset published to Expert level on the certificates to be well-integrated into the web, for example.

However it doesn’t necessarily follow that all “5-star” data would automatically gain an Expert rating: a dataset may be well integrated into the web but still be poorly maintained or documented.

In our view the Open Data Certificates provide clearer guidance for data publishers to consider when planning and improving their publishing efforts. They help publishers look at the bigger picture of data-user needs, many of which are not about the data format or whether the data contains URIs. This bigger picture can help inform data publishing roadmaps, procurement of data publishing services and policy development.

The certificates also provide a clear quality mark for reusers looking for assurances around how well data is published.

The 5-star scheme has been very effective at moving publishers away from Excel and closed licences and towards CSV and open licences. But for sustained and sustainable open data, reusers need the publishers of open data to consider more than licences and data formats. The Open Data Certificates helps publishers do that.