As I’ve alluded to in the past we’ve been exploring moving our content repository over to an RDF triple store.
It’s turning out to be pretty massive, we’ve learnt a few things along the way, and no doubt have much more to learn as we continue with the project. Seemed worthwhile submitting a conference paper to share the experience. So here’s what Priya, Katie, and myself have just submitted to ISWC2005:
The IngentaConnect website contains metadata from 17 million articles sourced from 20 thousand publications. The aim of the Metastore project is to build a flexible and scalable repository for the storage of this bibliographic metadata. The repository will replace several existing data stores and will act as a focal point for integration of a number of existing applications and future projects. Scalability, replication and robustness were important considerations in the repository design.
After introducing the benefits of using RDF as the data model for this repository, the paper will introduce the practical challenges involved in creating and managing a very large triple store. The repository contains over 200 million triples from a range of vocabularies including Dublin Core and PRISM. To our knowledge, this is the largest triple store of its type in existence.
The challenges faced range from schema design, initial data loading, query performance, and integration of the repository into existing applications.
The paper will introduce the solutions developed to meet these challenges with the goal of helping others looking to deploy a triple store within a commercial environment. The paper will also suggest some avenues for further research and development.
Won’t hear whether the submission will be accepted until July, but expect to read more here over the coming months.