I’ve been meaning to have a play with the POI API for some time now. So, when a colleague mentioned how easy it is to work with, I decided it was high time I had a look. Whilst thinking of a suitable utility it occured to me that Office documents have metadata stored in them (see the File -> Properties dialog), and so I wondered whether it would be able to extract this data as RDF.
The result is MORE (Microsoft Office RDF Extractor).
The tool is a simple command-line utility that generates an RDF document from one or more Office documents. Access to the embedded properties is made possible by the POI HPSF API, while the RDF manipulations are performed by Jena. So you’ll need these classes in your CLASSPATH before running the application.
Download MORE 0.1
The command-line is simple:
java com.ldodds.more.MORE -help
…will get you a usage message describing the available properties. To summarise, it’s possible to extract RDF from several documents in one go, add RDF statements to an existing RDF document, and dump the results to a file rather than the console which is the default.
The key part of MORE is the “mapping schema”. This is a concept that I’ve borrowed (read: “stolen”) from Norman Walsh’s rdfjpeg utility, which I’ve also been tinkering with lately. A mapping schema is basically just an RDF Schema that contains a number of rdf:Property elements. Each of these properties are annotated by a more:pidString
property as follows:
<rdf:Property rdf:about="http://purl.org/dc/elements/1.1/title">
<rdfs:label>Title</rdfs:label>
<more:pidString>PID_TITLE</more:pidString>
</rdf:Property>
Here’s a complete example schema.
Office documents store their metadata as name-value pairs. These property names are either “built-in”, these all start with the prefix “PID_”, or are defined by the user in the Custom tab of the File -> Properties dialog in the application (actually I’m glossing over a lot of details here, see the HPSF internals document for the ugly truth; HPSF makes things easy to handle). The pidString properties in the mapping schema are therefore just the names of metadata elements stored in a Word, Excel or Powerpoint document.
Upon encountering an item of metadata, MORE examines its mapping schema to determine which RDF properties it should add to the resulting RDF. The example mapping schema in the download shows how to create both Dublin Core and custom RDF properties. If an item of metadata doesn’t have an entry in the mapping schema then its just discarded, making it very easy to customise the tool to produce the output you desire. Also, if a property value starts with “http” or “mailto” then an rdf:resource element is generated rather than a literal.
Feedback is very welcome, particularly if it doesn’t work for you or there are bugs! (One thing I’m not sure about is how best to assign a URI to each document resource. I’ve defaulted to just using the file name, because that’s what jpegrdf does, and if its good enough for Norm…)
While I’ve no firm plans to extend this tool further — for me it’s just another step down the road in learning various RDF tools and technologies — I may add sensible new features if suggested. However I consider the code to be Public Domain (it’s pretty trivial after all) so feel free to do with it what you will.
Distantly related: http://www.computerbytesman.com/privacy/blair.htm
“Microsoft Word documents are notorious for containing private information in file headers which people would sometimes rather not share. The British government of Tony Blair just learned this lesson the hard way. ”
…can you extract revision history metadata with MORE?
Revision history information, other than “date last modified”, doesn’t seem to be available through POI. Or at least I don’t see anything in the javadoc or documentation anyway.
What use case were you thinking of?
MORE: Microsoft Office RDF Extractor
Leigh Dodds [1] hat ein kleines Java-Programm [2] entwickelt, welches die Metadaten aus Office-Dokumenten (Word, Excel, …) extrahieren kann. Genutzt wird daf