According to the Wikipedia due diligence is “the effort a party makes to avoid harm to another party.“. It goes on to note that within a business context: A “due diligence report” is often prepared to discover all risks and implications regarding a decision to be made.
I think this concept should be embraced by the open data movement. In short when you publish a public collection of data I think there’s some due diligence that should take place. My reasoning plays off both of these definitions.
Firstly: avoiding harm. This one is relatively straight-forward. Don’t publish any data about a user unless they’ve expressly allowed it. Or perhaps more realistically, don’t publish any sensitive data (e.g. email addresses) without permission. I’m not aware of any sites that do this, but it should probably be set in stone somewhere to reinforce the convention. Privacy issues are only going to get worse as data becomes more easily available.
Secondly: understanding the risks and implications of the decision. There are several aspects to explore here.
The business implications of releasing open data can be manifold: what are you gaining and losing as a result of increased data sharing? From my perspective, opening up your database is at least tacit acknowledge that you’re happy that at least part, perhaps all, of your business model is shifting to exploit second order effects. For example you now longer charge for or hide data, with the intent that increased traffic or usage as a result of social content hacking will indireclty effect revenues.
There are also copyright and licensing issues. Very few social content sites that I’ve explored have clear licensing of its data and APIs. MusicBrainz is streets ahead here. You have to think beyond simple usage licensing (personal/academic/commercial) to issues like aggregation:
- Can I freely aggregate all your data for a non-commercial application?
- Is all of your data consistently licensed? For example flickr allows Creative Commons licensing of photos, but what about the photo and personal metadata?
- Can I redistribute your metadata? And how can I relicense it?
- How much provenance tracking must be done?
This area is well-worn ground in DRM circles but the issues are not incompatible with open licensing.
Which brings me to my third point: relationships between your data and that already out there “in the wild”.
I’ve spent a fair amount of time looking through collections of data on rdfdata.org and very little of it, even where the data is from a common domain, is interlinked. For example there are several geo data sets which could easily be interrelated.
The beauty of RDF is that I can of course begin to publish these interconnections myself, but I think this should become part of the due diligence undertaken by data providers. It helps to avoid perpetuating data islands and makes free mixing of data much easier.
The diligence doesn’t only apply to data, but also schemas. If you’re publishing an RDF schema it’s your job to ensure that you’ve made some effort to relate your terms to existing vocabularies where possible. Again, third-parties can easily annotate your schema to include missing or additional relationships. However to ensure we have not only a web of documents, but also a web of schemas (allowing agents to explore ontology relationships), schema authors must include relevant links. There are some other best practices they should follow too.