There have been a number of discussions about “Enterprise Linked Data” recently, and I took part on a panel on precisely that topic at ESTC 2009. Unfortunately the panel was cut short due to time pressures so I didn’t get chance to say everything I’d hoped. In lieu of that debate here’s a blog post containing a few thoughts on the subject.
When we refer to enterprise use of Linked Data, there are a number of different facets to that discussion which are worth highlighting. In my opinion the issues and justifications relating to each of them are quite different. So different in fact that we’re in danger of having a confused debate unless we tease out this different aspects.
Aspects of the Debate
In my view there are three facets to the discussion:
- Publishing Linked Data, the key question here being: What does an Enterprise have to benefit by publishing Linked Data?
- Consuming Linked Data: What does an Enterprise have to benefit from consuming Linked Data?
- Adopting Linked Data: What benefits can an Enterprise gain by deploying Linked Data technologies internally?
I think these facets whilst obviously closely related are largely orthogonal. For example I could see a scenario in which an organization consumed Linked Data but didn’t store or use it as RDF, but just fed it into existing applications. Similarly businesses could clearly adopt Linked Data as a technology without publishing or using any data to the web at all.
These issues are also largely orthogonal to the Open Data discussion: an enterprise might use, consume and publish Linked Data but this might not be completely open for others to reuse. The data may only be available behind the firewall, amongst authorised business partners, or only available to licensed third-parties. So, while the issue as to whether to publish open data is a very important aspect of the discussion, its not a defining one.
Here’s a few thoughts on each of these different facets.
Publishing Linked Data
So why might an enterprise publish Linked Data? And if that is a worthwhile goal, then is it clear how to achieve it? Lets tackle the second question first as its the simplest.
There is an increasingly large amount of good advice available online, as well as tools and applications, to support the publishing of Linked Data. We’re making good strides towards making the important transition from moving Linked Data out of the research area and into the hands of actual practitioners. The How to Publish Linked Data on the Web tutorial is an great resource but to my mind Jeni Tennison’s recent series on publishing Linked Data is an excellent end-to-end guide full of great practical advice.
We can declare victory when someone writes the O’Reilly book on the subject and do for Linked Data what RESTful Web Services did for REST. (And the two would make great companion pieces).
But technology issues aside, what are the benefits to an organization in publishing Linked Data? There are several ways to approach answering that question but I think in most discussions Linked Data tends to get compared with Web APIs. The value of creating an API is now reasonably well understood, and many of the benefits that come from opening data through an API also apply to Linked Data.
However the argument that Linked Data married with a SPARQL endpoint is as easy for developers to use as a Web API is still a little weak at this stage. SPARQL can be off-putting for developers used to simpler more tightly defined APIs. As a community we ought to consider it as a power tool and look for ways to make it easier to get started with. It’s also worth recognising that a search API is also a useful addition to a SPARQL endpoint as part of Linked Data deployment.
But publishing Linked Data can’t be directly compared to just creating an API, because its also largely a pattern for web publishing in general. Its increasingly easier to instrument existing content management systems to expose RDF(a) and Linked Data. So rather than create a custom API, which will involve expensive development costs, particularly if its going to scale, its possible to simply expose Linked Data as part of an existing website.
By following the Linked Data pattern for web publishing, in particular the use of strong identifiers, an enterprise can end up with a single point of presence on the web for publishing all of its human and machine-readable data, resulting in a website that is strongly Search Engine Optimised. Search engines can better crawl and index well structured websites and are increasingly ingesting embedded RDFa to improve search results and rankings. That’s a strong incentive to publish Linked Data by itself.
Adopting Linked Data, particularly as part of a reorganization of an existing web presence, could deliver improved search engine rankings and exposure of content whilst saving on the costs of developing and running a custom API. The longer term benefits of being part of the growing web of data can be the icing on the cake.
Consuming Linked Data
Next we can consider why an enterprise might want to consume Linked Data.
To my knowledge organizations are currently only publishing Linked Open Data (albeit with some wide variations in licensing terms), so we’ll skip for the present whether enterprises have an option of consuming non-open Linked Data, e.g. as part of a privately licensed dataset.
The LOD Cloud is still growing and provides a great resource of highly interlinked data. The main issues that face an organization consuming this data are ones of quantity (there’s still a lot more data that could be available); quality (how good is the data, and how well is it modelled); and trust (picking and choosing reliable sources).
To some extent these issues face any organization that begins relying on a third-party API or dataset. However at present a lot of the data in the LOD cloud is still from secondary sources. The same can’t be said for the majority of web APIs, which tend to be published by the original curators of the data.
These issues should resolve themselves over time as more primary sources join the LOD cloud. Because Linked Data is all based on the same data model bulk loading and merging data from external sources is very simple. This gives enterprises the option of creating their own mirrors of LOD data sources which will provide some additional reassurances around stability and longevity.
Linked Data, with its reliance on strong identifiers, is much easier to navigate and process than other sources, even if you’re not storing the results of that processing as RDF. There’s also a much greater chance of serendipity, resulting in the discovery of new data sources and new data items. Whereas there is virtually no serendipity in a Web API as each API needs to be explicitly integrated.
But this benefit is only going to become evident if we continue to put effort into helping (enterprise) developers understand how to consume Linked Data. E.g. as part of existing frameworks or using new data integration patterns is another area that needs more attention. The Consuming Linked Data tutorial at ISWC 2009 was a good step in that direction, although the message needs to be circulated wider, outside of the core semantic web community.
In my opinion it will be easier for enterprises to consume Linked Data if they first begin to publish it. By publishing data they are putting their identifiers out into the wild. These identifiers become points for annotation and reuse by the community, creating liminal zones from which the enterprise can harvest and filter useful data. This is a benefit that I think is unique to Linked Data as with an Web API the end results are typically mashups or widgets displaying in a third-party application; these are just new silos one step removed from the data publisher.
Adopting Linked Data
Finally, what value could be gained if an organization adopts Linked Data internally as a means to manage and integrate data behind the firewall?
The issues and potential benefits here are largely a mixture of the above, except that there are little or no issues with trust as all of the data comes from known sources. In a typical enterprise environment Linked Data as an integration technology will be compared to a wider range of systems ranging from integrated developer tools through to middleware systems. There’s a reason why SOAP based systems are still well used in enterprise IT as most organizations aren’t (yet?) internally organized as if they were true microcosms of the web.
Its interesting to see that Linked Data can potentially provide a means for solving many of the issues that Master Data Management is trying to address. Linked Data encourages strong identifiers; clean modelling; and linking to, rather than replicating data. These are core issues for data consolidation within the enterprise. Coupled with the ability to link out to data that is part of the LOD Cloud, or published by business partners, Linked Data has the potential to provide a unifying infrastructure for managing both internal and external data sources.
Its worth noting however that semantic technologies in general, e.g. document analysis, entity extraction, reasoning and ontologies seem to be much more widely deployed in enterprise systems than Linked Data. This is no doubt in large part because the advantages of those technologies may currently be much more easily articulated as they’re more easily packaged into a product.
In this post I wanted to tease out some of the questions that underpin the discussions about enterprise adoption of Linked Data. I’ve presented a few thoughts on those questions and I’d love to hear your opinions.
Along the way I’ve attempted to highlight some areas where we need to focus to help transition from a researcher-led to a practioner-led community. More data, more documentation, and more tools are the key themes.