How Shall I Integrate Thee? Let Me Count the Ways…

Data integration is easy with Semantic Web technologies, right?
We’ve all said it, but has anyone actually sat down and tried to elucidate the ways in which technologies like RDF and OWL actually help with data integration? I don’t ever remember seeing one so here’s a first attempt.
Each of the sections below tries to work through a potential data integration scenario, attempting to demonstrate how RDF and/or OWL enable easier integration. Along the way I’ve tried to tease out a few common misconceptions.
Before we wade in to our first example, if you’re not already familiar with N3 then you may want to review the first few sections of the N3 Primer. Actually, read the whole thing. But you’ll only need to know the syntax covered in the first few sections to understand the examples shown below.

The One Where We Share Identifiers

This is the most trivial case but we have to start somewhere. In this scenario you and I have both published some data onto the Semantic Web. The RDF syntax we’ve used is irrelevant as its the assertions in the data that are significant.
As the example data shown below illustrates, we’ve both published metadata about the same resource: we’ve used the same URI as the subject of our RDF statements.


@prefix dc:  <http://purl.org/dc/elements/1.1/> .
@prefix ex:  <http://www.example.org/ns/> .
# My Data
<http://www.example.org/some/resource>
dc:title "About Squirrels".
# Your Data
<http://www.example.org/some/resource>
ex:comment "Should discuss red squirrels more".

This kind of data integration is automatic for RDF tools. If you put this into an RDF triple store then the end result is that the store will have two facts about the resource http://www.example.org/some/resource: the title, and a comment that suggests it may be biased towards grey squirrels.
To make a rough analogy, this would be like your relational database automatically adding columns to your existing tables so that there’s always a home for any data that you attempt to store. There isn’t an analogy for XML documents as if you throw the data into an XML repository you’ll still just have two documents unless you write some code.
The ability to be able to simply throw statements into a triple store and then be able to manipulate it with a common tools (e.g. SPARQL) is extremely useful, but as it doesn’t directly relate to data integration I won’t belabour that here.
One reaction to this kind of integration scenario is this: “this is all well and good but in practice it’ll never happen, how do we agree to use the same identifiers? Who is going to co-ordinate that?”
This kind of argument obviously glosses over all of the examples of where communities do get together and agree on shared identifiers. Identifiers like the ISSN, ISBN and the DOI, for example. Agreement on shared identifiers usually happens when there’s a demonstrable benefit for doing so; you know, benefits like being able to integrate data more easily.
The “it’ll never happen” argument, when applied to identifiers, also glosses over the serendipitous case, where users happen to have struck on using the same identifier for something, and yet we are still able to seamlessly merge the data. This too happens: for example, how many times have you referred to concepts by linking to a Wikipedia entry?

The One Where We’re Describing the Same Thing

But what about the case where we’ve not used the same identifier, or our community hasn’t gotten its act together and standardized how we name things? Does the Semantic Web help there?
Well in this scenario we’ve published information about what we think are separate resources but in actual fact are the same thing, i.e. we’re referring to the same thing but with different names.


@prefix foaf:  <http://xmlns.com/foaf/0.1> .
# My Data
<http://www.example.org/user/ldodds>
foaf:name "Leigh Dodds";
foaf:mbox <mailto:leigh@ldodds.com>.
foaf:weblog <http://www.ldodds.com/blog>.
# Your Data
<http://www.example.com/person/ldodds>
foaf:name "Leigh Dodds";
foaf:mbox <mailto:leigh@ldodds.com>.
foaf:depiction <http://www.ldodds.com/img/ldodds-corner.jpg>.

In this case, two social networking sites may have published information about me but assigned me two different URIs when they’ve exposed some FOAF data. So, if someone aggregates that data, instead of a single resource which has properties for my name, email address, homepage and a link to a photo, we still have two separate resources.
How do we go about integrating in this scenario? Well, like the first example, this too can be achieved automatically. All the information we need to drive the disambiguation is available from the FOAF schema. In this instance, the key bit of metadata is this:


@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .
foaf:mbox a owl:InverseFunctionalProperty.

What this says is that the foaf:mbox property has a special significance, in OWL terms its an Inverse Functional Property.
An Inverse Functional Property is one that uniquely identifies a resource. Roughly, in relational database terms, it means that its a “primary key”. To state this more explicitly: if we find two resources that have different URIs, but have the same value for an Inverse Functional Property, then we know that those resources are actually equivalent and we can then merge the statements we have for them.
If your repository or RDF system is OWL aware, and you provide it with the above fact then it will automatically treat the resources as equivalent and you need to nothing more. But what if you’re not using OWL? Well if you have a rules engine, then its possible to restate the fact as a set of rules that will achieve the same end without requiring the (rather hefty) overheads of a full OWL reasoner.
Don’t have a rules engine to hand? Well implementing this kind of merging, which is known as “smushing” in the Semantic Web community, is very straight-forward. It basically involves shuffling properties around in an RDF graph and is no more complex than shuffling nodes around in a DOM tree (less so in fact). But however you approach it, the important thing is that the whole process relies on being able to attach some specific data processing rules (“treat this like a primary key”) to a property in an RDF schema.
And those rules can be discovered, automatically, by dereferencing the schema URI in order to obtain the RDF or OWL schema associated with a particular vocabulary. It’s here, with this very simple principle, that the “semantics” and “web” in the Semantic Web show themselves. RDF is grounded in the web by its very use of URIs, and by traversing those URIs we can find not only more data, but also new ways of intepreting and processing that data. There’s nothing equivalent to this in the XML technology stack. And while the world of ontologies can get very complex, its all predicated on some simple principles like Inverse Functional Properties.
Before moving on we need to look at another useful OWL property that is also helpful in the current scenaro: owl:sameAs. This property simply states that two URIs are equivalent.
Based on what we know about Inverse Functional Properties, we can now state that:
If two resources have the same value for an Inverse Functional Property, then this implies that ResourceA owl:sameAs ResourceB
How is this useful? Well it allows us to re-publish the new fact that we’ve discovered. For example I could adjust my data as follows:


@prefix foaf:  <http://xmlns.com/foaf/0.1> .
# My Data
<http://www.example.org/user/ldodds>
foaf:name "Leigh Dodds";
foaf:mbox <mailto:leigh@ldodds.com>.
foaf:weblog <http://www.ldodds.com/blog>.
owl:sameAs <http://www.example.com/person/ldodds>.

But not only that, a third party might have gone to the trouble of discovering that those two URIs are equivalent, so we could have circumvented the whole smushing exercise and acted on the property directly. The equivalence of the two URIs is something that we could have discovered.
And whats interesting, especially in the case where we’re merging data across social networks, is that it could be me that publishes the data that connects my data across two different networks, i.e. I could be the third-party that provides the additional fact that helps to disambiguate the data.

The One Where We’re Speaking Different Languages

Lets move on from trying to identify common resources to focus on how RDF and OWL provide support for dealing with mapping between different vocabularies.
For this example lets consider 3 sources of data, yours, mine, and Bob’s:


@prefix foaf:  <http://xmlns.com/foaf/0.1> .
@prefix dc:  <http://purl.org/dc/elements/1.1/> .
@prefix ex:  <http://example.net/picture/vocab/> .
# My Data
<http://www.example.org/paper/1>
dc:creator  </user/ldodds>.
# Your Data
<http://www.example.com/page/ldodds>
foaf:maker  </person/ldodds>.
# Bob's Photo Data
<http://www.example.net/pictures/3>
ex:photographer  </person/ldodds>.

Let say that I’m publishing information about documents in a publishing system and have indicated the author of a particular academic paper using dc:creator. And lets also say that you’re doing a similar thing for wiki pages, but decided to use foaf:maker. In other words we both decided to use two different vocabularies to publish essentially the same information. And just to compound the problem Bob decided to publish the metadata about a photo using a vocabulary he cooked up in 10 minutes while rushing to meet a deadline.
How does RDF help us here? If we were, say, creating a simple web site that lists people and all of the creative works they were involved with, how can we merge all of the data in order to present a nice consistent view on it? And additional can we avoid the need to add support for each new vocabulary as we aggregate data that refers to new vocabularies? To make this even more concrete, lets say that our application has been built to operate on another property entirely: work:creativeAuthor property. How do we equate these other properties to this term?
Well, just like with the Inverse Functional Property example, we can look to the RDF and/or OWL schema that describes a specific vocabulary and see if it provides some guidance on how to disambiguate and merge the data.
There are several different forms that this annotation can take. The first is the rdfs:subPropertyOf. This property indicates that one RDF property is a specialization of an other existing term. The second form is owl:equivalentProperty which indicates that two properties, usually from different vocabularies, are in fact equivalent. In terms of data integration if we see either of these properties we can (broadly) consider them to be equivalent. If our RDF processing applications can discover these relationships, and ultimately relate the terms back to our preferred term (work:creativeAuthor) then we need do nothing more.
However, given the early state of vocabulary development on the semantic web, not all schemas include these kind of property relationships. Vocabulary authors should be encouraged to relate their terms to existing terms in other languages (And equivalence isn’t the only relationship, see the OWL specification for more options.) Actually, people should just be encouraged to reuse vocabularies where they can, but until the benefits of doing so are clearer, equivalencies seem a useful compromise.
And, happily, we don’t have to rely on the vocabulary owners to do this, we can publish RDF statements ourselves that state these equivalences. So just like we can use owl:sameAs to equate resources, we can publish statements that use owl:equivalentProperty to equate terms.
So you see the Semantic Web includes User Contributed Schemas as well as User Generated Content.
Its this kind of gradual, loosely co-ordinated integration of vocabularies that Semantic Web critics often gloss over. The argument typically runs that the Semantic Web requires globally agreed and standardized ontologies, whereas in actual fact while this might be an ideal situation, its just that: an ideal. Nothing about the Semantic Web requires this at all. Instead the Semantic Web encourages publishing of community owned schemas and the gradual evolution of schemas and identifiers within communities of practice. Small vocabularies, loosely joined is the order of the day.
Returning to our earlier example, here’s how we’d configure our application to treat each of the three terms as equivalent:


@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> . .
@prefix dc:  <http://purl.org/dc/elements/1.1/> .
@prefix ex:  <http://example.net/picture/vocab/> .
@prefix work: <http://example.org/works/> .
work:creativeAuthor
owl:equivalentProperty dc:creator;
owl:equivalentProperty foaf:maker;
owl:equivalentProperty ex:photographer.

Just like with owl:sameAs and owl:InverseFunctionalProperty, if we have an OWL reasoner or rules language, then we’re done. Otherwise its a small piece of work to write some code to shuffle the RDF triples to ensure that our application can interpret and act on these declarations.

The One Where We’re Using Different Units

One of the grinding chores of data integration is dealing with mapping between different data types: numeric type conversions, date formats, etc. Can we look for any help here? Yes, some. RDF allows literal values to be annotated with a reference to their data type, which provides some opportunity for applying automated mappings.
The standard allows use of the existing XML Schema data types as well as custom data types. The semantics of mapping between XML Schema types is well defined, but this isn’t the case for custom data types, and currently there’s no means to express that in RDF.
Adding data types to RDF literals is something thats often glossed over in discussions of how to publish data on the RDF. But if you think about it from the perspective of publishing facts, if you know the type of your data, then why wouldn’t you publish this too. It’ll almost certainly help someone downstream.

The One Where We’re Speaking At Different Levels of Abstraction

OK, this is the hard one. What if the data we’re trying to integrate has completely different models that use different levels of abstraction that mean we can’t use any of the previous techniques?
The most trivial example is where one set of data uses a Resource and the other a Literal:


@prefix foaf:  <http://xmlns.com/foaf/0.1> .
@prefix dc:  <http://purl.org/dc/elements/1.1/> .
# My Data
<http://www.example.org/paper/1>
dc:creator  "Leigh Dodds".
# Your Data
<http://www.example.com/page/ldodds>
foaf:maker  </person/ldodds>.
<http://www.example.com/person/ldodds>
foaf:name  "Leigh Dodds">.

We can use reasoning to help with this kind of scenario. For example you could infer that whenever I’m referring to the name of an author using the dc:creator property that there exists a foaf:Person with the same name (foaf:name). But without further disambiguating data, e.g. an email address, it’s not possible to infer that we’re actually talking about the same person.
You could obviously just infer this from their name, but you’d have to be pretty sure of the scope and reliability of your data sources to rely on that, particularly on the open Internet. But in other closed contexts this may not seem so rash: for example we may just need to infer equality based on a part number of other reliable indicator.
OWL offers a great deal of power for providing inferencing over data like this, which provides numerous ways to tackle this kind of problem. And importantly (I think) its a declarative approach, so it doesn’t necessarily mean grinding out lots of code. But that power comes at a price and OWL is certainly complex for the uninitiated.
Stefano Mazzocchi has an interesting essay that digs further into the problems that this scenario presents; a problem that he describes as “under modelling“. In that essay Stefano correctly argues that equivalences of the kind I’ve described above don’t address all integration problems and that there’s often a need for something more. I agree, and what I’ve attempted to do here is outline in exactly what contexts the techniques might be useful.
There’s obviously a whole lot more ground that could be covered. Ranging from more in-depth discussion of different ontology based approaches for data integration through to vocabulary versioning and evolution. But hopefully this essay is a useful reference for those interested in learning more about data integration on the semantic web. If you’ve found it useful, or think anything is just plain wrong, then drop me a line.