Yesterday Jeni Tennison published a blog post outlining some frustrations with RDF datatyping. In particular, the lack of an appropriate datatype on RDF literals, and difficulties it adds when processing the data, e.g. for visualisation.
I left a comment on Jeni’s blog, partly because I was responsible for generating some of the data she was unhappy with, partly to address the question about inability to cast literal values in SPARQL (you can, in some circumstances). To clarify one point in my comment, the main reason I didn’t add a datatype to certain literals was because source data had a variable format, including some embedded comments. As I was rushing to put the data together for the recent Guardian Hack Day, I thought it sufficient to live the data and comments intact, rather than strip them off and run the risk of either having data that was either incorrect or incomplete.
I think this ought to be generalised into a a further addition, or perhaps refinement, to Jeni’s list of suggestions at the end of her blog post. Jeni’s point is that one should always use a datatype and/or language code where possible. My refinement is that you should only do that when you’ve taken the trouble to ensure that the literal is lexically valid according to the datatype you’ve specified; or that you’re using a valid language code. Its just as bad to publish a literal with an incorrect datatype as it is to publish one without.
This may seem obvious, but from some work I’ve been doing recently, it seems like this isn’t always being followed when people are publishing linked open data. There are plenty of examples of invalid datatypes in dbpedia for example. And its pretty easy to flush them out, e.g. by attempting to parse the data using using the
tdbcheck command-line application shipped with TDB.
One reason why I don’t think the issue has been more obvious, is that the RDF specifications — and here I refer to the core specifications as well as the various alternate syntaxes — are not clear on how an RDF processor should handle incorrect data values or invalid language codes. The result is that different RDF parsers apply different rules. In my opinion this is an interoperability issue that needs to be addressed.
For example the Jena parsers are generally quite strict, and will emit an error if a literal doesn’t conform to its stated type. Whereas rapper, for example, doesn’t complain about typing errors, even if placed into strict mode. It would be a useful exercise to test out a range of parsers and RDF triplestores to see how they behave in this regard.
In the section on Typed Literals, the RDF primer notes that:
RDF does not define any datatypes, the actual interpretation of a typed literal appearing in an RDF graph (e.g., determining the value it denotes) must be performed by software that is written to correctly process not only RDF, but the typed literal’s datatype as well. Effectively, this software must be written to process an extended language that includes not only RDF, but also the datatype, as part of its built-in vocabulary. This raises the issue of which datatypes will be generally available in RDF software
The section goes on to note that as arbitrary URIs can be used to identify datatypes, then an RDF processor may well encounter types it doesn’t know anything about. In these circumstances I think its acceptable for the processor to simply store or report the type, but attempt no validation.
However, for a set of well-defined and well-specified types, such as those taken from XML schema, parsers ought to go the extra mile and attempt to validate the data, producing an error or at very least a warning, if the data is not valid. By my reading, this goes beyond what is currently required in the RDF semantics, but I would argue that this is a useful and practical step to ensuring interoperability in RDF data exchange.
The RDF semantics even define a subset of the XML schema types that are suitable for use in RDF. So in one sense my suggestion is simply taking this recommendation further and suggesting support for those types as a minimum for any processor. Silently processing data with invalid values won’t help flush out problems.
There’s also a list of “unsuitable” datatypes which includes
xsd:duration. This is currently in use in the BBC programmes data to capture the duration of broadcast episodes. The specification notes that:
…this may be corrected in later revisions of XML Schema datatypes, in which case the revised datatype would be suitable for use in RDF datatyping…
Which leaves me unclear about the status of
xsd:duration as a useful datatype. Is it or isn’t it?
I suspect the same interoperability issue may affect language tags. Jena, again, is fairly draconian in its parsing of
xml:lang attributes. But that only applies to its RDF/XML parser. The alternate parsers, e.g. for N-Triples, behave differently and will happily accept values that the RDF/XML parser will reject. This is undoubtedly because the Turtle, N3, and N-Triples specifications have little or nothing to say about language codes associated with literals, simply defining their lexical rules. Whereas the XML format builds on the xml:lang attribute rules, and those rules are defined in terms of BCP 47. However the RDF Concepts specification references the now obsolete RFC 3066.
It feels to me like there’s a need to help clarify some of the correct uses and approaches to not only publishing but also the processing of typed RDF data. If there’s no scope to do this within the W3C RDF activity, then the community could work together to clarify best practices?