Lee Feigenbaum has put together a really nice posting discussing different ways of modelling statistical data using RDF. I wanted to contribute to that discussion and add in a few comments about how I’ve been modelling some of the OECD‘s statistical publications using RDF.
Note the emphasis: what I’ve been doing is capturing metadata about individual statistical tables and graphs, their association with specific publications, their metadata, etc. I’ve not attempted to capture the detail of the statistics themselves, but do have a few relevant comments there.
The background to this is that I’m currently technically leading a project to build the latest version of OECD’s electronic library. All of the metadata is stored in RDF, with content available as HTML, PDFs, Excel spreadsheets or as views into the OECD.stat application that the OECD have developed as a power tool for housing and delivering their statistical data.
As Lee discovered in the EuroStat data, regions and countries are core concepts. All of the OECD’s statistical output can be classified by country and region, and these are types defined within our schema. We assign URIs to the countries using either the ISO 3166-1 alpha-2 country code or, in the case of classifying data that refers to countries that no longer exist as a specific entity (e.g. Yugoslavia), we use the ISO 3166-3 4 letter country code.
A country may be associated with zero or more Regions, using an Is Part Of relationship. A region may be the European Union, OECD member states or other arbitrary grouping. I suspect the same basic requirements will apply to other statistical datasets.
There are some other types of classification that we associate with the tables:
- An indicator of whether the table is a “comparative table”: e.g. does it include data from multiple countries?
- An association between the table and a “Table Series” which constitute a collection of tables published over time
- The statistical Variables that the table contains, e.g. GDP
- A summary of the time range that the table covers, e.g. “2007”, “2005-2007”, “2000, 2002-2005”, etc. These are captured as simple literals for now as we have to do little/no processing on them at this level.
And then there’s the usual collection of title, description, etc. all as multi-lingual literals. All tables are also assigned a DOI to provide a stable link that can be cited in publications. If the table was originally published in a specific Book or journal Article then that relationship is also captured.
Obviously this metadata is, largely, at a level above that which Lee has been exploring, but I thought this might provide some useful context. For anyone looking at capturing statistical data in RDF, there are some other useful places to look at for defining terms and drawing on prior experience.
Firstly the Journal of Economic Literature Classification provides some terms that can be associated with statistical publications to help categorize them. The OECD’s statistical glossary fills a similar role.
Secondly, the Statistical Data and Metadata EXchange (SDMX) initiative is also worthy of a look. It’s not RDF but, as well as defining XML Schemas and web services for exchanging statistical data, the guidelines include lists of cross-domain concepts and their mappings to those in use by EuroStat, OECD, IMF, etc. So plenty of scope for grounding RDF vocabularies for statistical in a lot of prior art.
Finally, the OECD have some public documentation about the design and implementation of their “MetaStore” database that supports OECD.stat (it’s a different beast to the Ingenta MetaStore, I should point out). For example, the document “Management of Statistical Metadata at the OECD” (PDF) has some interesting detail about the different types of metadata (structural, technical, publishing) that is stored in these multi-dimensional data cubes.