Interesting Papers from CIDR 2009

CIDR 2009 looks like it was an interesting conference, there were a lot of very interesting papers covering a whole range of data management and retrieval issues. The full list of papers can be browsed online, or downloaded as a zip file. There’s plenty of good stuff in there ranging from the energy costs of data management, forms of query analysis and computation on “big data”, and discussions on managing inconsistency in distributed systems.
Below I’ve pulled out a few of the papers that particularly caught my eye. You can find some other picks and summary on the Data Beta blog: part 1, and part 2.
Requirements for Science Databases and SciDB from Michael Stonebraker et al, presents the results of a requirement analysis covering the data management needs of scientific researchers in a number of different fields. Interestingly it seems that for none of the fields covered, which includes astronomy, oceanography, biologic, genomics and chemistry, is a relational structure a good fit for the underlying data models used in the data capture or analysis. In most cases an array based system is most suitable, while for biology, chemistry and genomics in particular a graph database would be best; semantic web folk take note. The paper goes on to discuss the design of SciDB which will be an open source array-based database suitable for use in a range of disciplines.
The Case for RodentStore, an Adaptive, Declarative Storage System, Cudre-Mauroux et al, introduces RodentStore an adaptive storage system that can be used at the heart of a number of different data management solutions. The system provides a declarative storage algebra that allows a logical schema to be mapped to a specific physical disk layout. This is interesting as it allows greater experimentation within the storage engine, allowing exploration of how different layouts may be used to optimise performance for specific applications and datasets. The system supports a range of different structures, including multi-dimensional data, and the authors note that the system can be used to manage RDF data.
Principles for Inconsistency, proposes some approaches for cleanly managing inconsistency in distributed applications, providing some useful additional context and implementation experience for those wrapping their heads around the notion of eventual consistency. I’m not sure that’d I’d follow all of these principles, mainly due to the implementation and/or storage overheads, but there’s a lot of good common sense here.
Harnessing the Deep Web: Present and Future, Madhavan et al, describes some recent work at Google to explore how to begin surfacing “Deep Web” information and data into search indexes. The Deep Web is defined by them as pages that are currently hidden behind search forms and that are not currently accessible to crawlers through other means. The work essentially involved discovering web forms, analysing existing pages from the same site in order to find candidate values to fill in fields in those forms, then automatically submitting the forms and indexing the results. The authors describe how this approach can be used to help answer factual queries, and is already in production on Google. This probably explains the factual answers that are appearing on search results pages. The approach is clearly in-line with Google’s mission to do as much as possible with statistical analysis of document corpora as possible, there’s very little synergy with other efforts going on elsewhere, e.g. linked data. There is reference to how understanding the semantics of forms, in particular the valid range of values for a field (e.g. a zip code) and co-dependencies between fields, could improve the results, but the authors also note that they’ve achieved a high level of accuracy in automated approaches to identifying common fields such as zip code, etc. A proposed further avenue for research is exploration of whether the contents of an underlying relational database can be reconsistuted through automated form submission and scraping of structured data from the resulting pages. Personally I think there are easier ways to achieve greater data publishing on the web! The authors reference some work on a search engine specifically for data surfaced in this way, called Web Tables which I’ve not looked at yet.
DBMSs Should Talk Back Too, Yannis Ioannidis and Alkis Simitsis, describes some work to explore how database query results and queries themselves can be turned into human-readable text (i.e. the reverse of a typical natural-language query system), arguing that this provides a good foundation for building more accessible data access mechanisms, as well as allowing easier summarisation of what a query is going to do, in order to validate it against the users expectations. The conversion of queries to text was less interesting to me than the exploration of how to walk a logical datamodel to generate text. I’ve very briefly explored summarising data in FOAF files, in order to generate an audible report using a text-to-speech engine, and so it was interesting to me to see that the authors were using a graph based representation of the data model to drive their engine. Class and relation labelling, with textual templates, are a key part of the system, and it seems much of this would work well against RDF datasets.
SocialScope: Enabling Information Discovery on Social Content Sites, Amer-Yahia et al, is a broad paper that introduces SocialScope a logical architecture for managing, analysing and presentation information derived from social content graphs. The paper introduces a logical algebra for describing operations on the social graph, e.g. producing recommendations based on analysis of a users social network; introduces a categorisation for types of content present in the social graph and means for managing it; and also discusses some ways to present results of searches against the content graph (e.g. for travel recommendations) using different facets and explanations of how recommendations are derived.

Enabling the Linked Data Ecosystem

This post was originally published in the Talis “Nodalities” blog and in “Nodalities” magazine issue 5.

he Linked Data web might usefully be viewed as an incremental evolution beyond Web 2.0. Instead of disconnected silos of data accessible only through disconnected custom APIs, we have datasets that are deeply connected to one another using simple web links, allowing applications to “follow their nose” to find additional relevant data about a specific resource. Custom protocols and data formats are the realm of the early web; the future of the web is in an increased emphasis on standards like HTTP, URIs and RDF that ironically have been in use for many years.

Describing this as a “back to basics” approach wouldn’t be far wrong. Many might dispute that RDF is far from simple, but this overlooks the elegance of its core model. Working within the constraints of standard technologies and the web architecture allows for a greater focus on the real drivers behind data publishing: what information do we want to share, and how is it modelled?

Answering those questions should be relatively easy for any organisation. All businesses have useful datasets that their customers and business partners might usefully access; and they have the domain expertise required to structure that data for online reuse. And, should any organisation want some additional creative input, the Linked Data community has also put together a shopping list [1] to highlight some specific datasets of interest. This list is worth reviewing alongside the Linked Data graph [2], to explore both the current state of the Linked Data web and the directions in which it is potentially going to grow.

Beyond the first questions of what and how to share data, there are other issues that need to be considered. These range from internal issues that organisations face in attempting to justify the sharing of data online, through to larger concerns that may impact the Linked Data ecosystem. For the purposes of this of article, this ecosystem can be divided up into two main categories: data publishers, who publish and share information online; and data consumers, who make use of these rich datasets.

There is obvious overlap between these two categories: many organisations will fall into both camps, as do we all through our personal contributions to the web. However, for this paper I want to focus primarily on business and organisational participants, and attempt to illustrate the different issues that are relevant to these  roles.

Data Publishers Perspective

The first issue facing any organisation is how to justify both the initial and ongoing effort required to support the publishing of Linked Data. Depending on existing infrastructure this may range from a relatively small effort to a major engineering task—particularly true if content has to be converted from other formats or new workflows introduced. In “A Call to Arms” in the last issue of Nodalities [3], John Sheridan and Jeni Tennison provided some insight into how to address the technology hurdle by using technologies like RDFa.

But can this effort be made sustainable? Can the initial investment and ongoing costs be recouped? And, if a dataset becomes popular and grows to become very heavily used, can the infrastructure supporting the data publishing scale to match?

The general aim with enabling access to data is that it will foster network effects, and drive increasing traffic and usage towards existing products and services. There are success stories aplenty (Amazon, Ebay, Salesforce, etc) that illustrate that there is real and not imagined potential.

But this justification overlooks some important distinctions. Firstly for some organisations, e.g. charities and non-governmental organisations information dissemination is part of their mission and there may not be other chargeable services to which additional traffic may be driven. In this scenario everything must be sustainable from the outset. Secondly, it also overlooks the fact that the data being shared may itself be an asset that can be commoditised. The value of access to raw data, stripped of any bundling application, has never been clearer, or been easier to achieve. New business models are likely to arise around direct access to quality data sources. Simple usage-based models are already prevalent on a number of Web 2.0 services and APIs—the free basic access fosters network effects, while the tiered pricing provides more reliable revenue for the data publisher.

Software as a service and cloud computing models undoubtedly have a role to play in addressing the sustainability and scaling issue, allowing data publishers to build out a publishing infrastructure that will support these operations without significant capital investments. But few of the existing services are really firmly targeted at this particular niche: while computing power and storage are increasingly readily available, support for Linked Data publishing or metered access to resources are not yet common-place.

This is where Talis and the Talis Platform have a distinct offering: by supporting organisations in their initial exploration of Linked Data publishing, with a minimum of initial investment, and a scaleable, standards based infrastructure, it becomes much easier to justify dipping a toe into the “Blue Ocean” (see Nodalities issue 2 [5]).

Data Consumers Perspective

Let’s turn now to another aspect of the Linked Data ecosystem, and consider the data consumers perspective.

One issue that quickly becomes apparent when integrating an application with a web service or Linked Dataset is the need to move beyond simple “on the fly” data requests,  e.g. to compose (“mash-up”) and view data sources in the browser, towards polling and harvesting increasingly large chunks of a Linked Dataset.

What drives this requirement? In part it is a natural consequence and benefit of the close linking of resources: links can be mined to find additional relevant metadata that can be used to enrich an application. The way that the data is exposed, e.g. as inter-related resources, is unlikely to always match the needs of the application developer who must harvest the data in order to index, process and analyse it so that it best fits the use cases of her application.

Creating an efficient web-crawling infrastructure is not an easy task, particularly as the growth of the Linked Data web continues and the pool of available data grows. Technologies like SPARQL do go some way towards mitigating these issues, as a query language allows for more flexibility in extracting data. However provision of a stable SPARQL endpoint may be beyond the reach of smaller data publishers, particularly those who are adopting the RDFa approach of instrumenting existing applications with embedded data.  SPARQL also doesn’t help address the need to analyse datasets, e.g. to mine the graph in order to generate recommendations, analyse social networks, etc.

Just as few applications carry out large scale crawling of the web, instead relying on services from a small number of large search engines, it seems reasonable to assume that the Linked Data web will similarly organise around some “true” semantic web search engines that provide data harvesting and acquisition services to machines rather than human users. Issues of trust will also need to be addressed within this community as the Linked Data web matures and becomes an increasing target for spam and other malicious uses. Inaccuracies and inconsistencies are already showing up.

The Talis Platform aims to address these issues by ultimately providing application developers with ready access to Linked Datasets, avoiding the need for individual users and organisations to repeatedly crawl the web. Value-added services can then be offered across these data sources, allowing features, such as graph analysis (e.g. recommendations), to become commodity services available to all. The intention is not to try and mirror or aggregate the whole Linked Data web, this would be unfeasible, but rather collate those datasets that are of most value and use to the community, as well as shepherding the publishing of new datasets by working closely with data publishers.

As an intermediary, the Talis Platform can also address another issue: that of scaling service infrastructure to meet the requirements of data consumers without requiring data publishers to do likewise. It seems likely that data publishers may ultimately choose to “multi-home” their datasets, e.g. publishing directly onto the Linked Data web and also within environments such as the Talis Platform in order to allow consumers more choice in the method of data access.


The bootstrapping phase of the Linked Data web is now behind us. As a community, we need to begin considering the next steps, especially as the available data continues to grow.  This article has attempted to illustrate a few from a wide range of different issues that we face. While technology development, particularly around key standards like SPARQL, rules and inferencing, and the creation of core vocabularies, will always underpin the growth of the semantic web, increasingly it will be issues such as serviceable infrastructure and sustainable business models that will come to the fore.

At Talis we are thinking carefully about the role we might play in addressing those issues and playing our part in enabling the Linked Data ecosystem to flourish.