The significant package was called “left-pad”. It’s a simple piece of utility code which has become a direct (and indirect) dependency for many, many other packages and software applications. Removing left-pad from NPM meant that every dependent piece of software would no longer build or install.
If you’re interested in digging into the story in more detail I’d suggest reading the register post I referenced above, the summary of events from the perspective of Klik and this retrospective from NPM.
But I did want to share a few observations that might help the open data community learn from this issue.
All modern software development involves a massive graph of dependencies. As engineers we only rarely take time to think about how big and complex it can become. This is how we build up more complex systems and why we have packaging systems in the first place. Network effects inevitably create “hubs” in those graphs which would become points of failure if removed. And this is why we have open licensing: to allow the community to ensure continuity if those key dependencies become unmaintained.
Secondly, I was hugely surprised at the number of comments I saw from developers who were angry and shocked that NPM might choose to reverse the decision of un-publishing the left-pad library. I’ve seen suggestions of copyright infringement, theft and more. I’m stunned at how many people seem to misunderstand that the ability to fork and take over an unmaintained code base is an important characteristic of open source licensing. Cameron Westlake, the developer who stepped in to take over the code, should be applauded. Regardless of how “trivial” the code itself might be.
NPM should also be applauded for having some policies in place to handle disputes and for publishing a clear analysis of the impact. NPM’s decision to reinstate the code to stop the breakage was the right thing to do.
This is how the commons is meant to work after all. Collaborative maintenance of code and principled management of shared infrastructure is how we build resilience for the benefit of everyone. It’s unsettling to me that so many people seem to misunderstand that and treat these actions with suspicion.
I think so many people were surprised because things have largely “just worked”. This is how you know infrastructure is successful. Because its invisible.
So what does this experience in the open source community mean for the open data commons? I think there are several points to consider:
- Permissive open licensing is critical
- If left-pad hadn’t been published under an open licence then it couldn’t have been quickly adopted by a new developer.
- In open data terms this mean that a dataset can be adopted by a new maintainer who could at least provide hosting, if not updates
- Ongoing education around the importance of clear licensing is essential
- We need to continue to educate people about the importance of clear, open licensing
- We need to continue to push back against complex and custom licences
- Access to data dumps is important
- The left-pad source code was readily available allowing it to be quickly forked and then provided back to the community.
- In open data terms this highlights the importance of regular data dumps and not just the provision of APIs
- We need to plan for resilience
- Should all open datasets automatically be mirrored, e.g. to the Internet Archive, to ensure that there is some continuity in the case of specific portal going down either accidentally or for unannounced “scheduled maintenance” like statistics.data.gov.uk which has been offline for a while now. This could be a basic feature of all data portals
- Data portals have other roles to play. Every data portal I have used allows instant unpublishing of a dataset regardless of consequences. As NPM note “Unrestricted un-publishing caused a lot of pain“. In the data marketplace I worked on we didn’t allow this a publisher was required to give notice. While there are reasons to allow quick un-publishing (e.g. data leak) there are other options (e.g. rollback)
- Our dependency graphs should be open
- Open source package managers allow dependency graphs to be easily analysed. This is a key feature that services like libraries.io rely upon to add value to the ecosystem. What’s the equivalent for open data?
- We are very far from that type of service in the open data community, but its something which we should be aiming for. While infrastructure might be invisible, it should still be mapped
- Open dependencies would also help sharing of workflows, identifying impacts, etc
The more fundamental problem to consider is how easy it would really be for someone to adopt an open dataset? The means of collection, curation and publishing are not readily available to all.
This is why I think that key elements of the open data commons should be collaboratively maintained from the beginning. But that is a topic for another day.