This post was originally published on the Talis “Nodalities” blog.
The second pre-conference session I attended at ISWC 2008 was a tutorial session on “Knowledge Representation and Extraction for Business Intelligence“.
I attended the session as I was curious to learn about more applied uses of Semantic Web technology particularly in the financial and business context. In terms of content the tutorial veered wildly from overview material through to some quite detailed looks at linguistic and semantic analysis to extract information from business reports. To that end I’m not going to attempt to summarize the full content of the tutorial but will pick out a few areas of interest.
Somne time was spent on looking at XBRL, the standard business reporting language which is becoming increasingly adopted around the world as a standard means to publish and share business reports. The initiative which began in 1999 was recently extended this year to include a European XBRL consortium. The broad goal of the project is to standardize the means and structure of publishing business financial reports with the goal of making it easy to compare and collate reports for regulatory and other purposes. The current financial crisis was referenced as an illustration of the need for greater transparency in business reporting and is an obvious driver for adoption of the technology.
XBRL draws on many of the same concepts as the Semantic Web, in particular the use of “taxonomies” that can be customized by specific businesses, sectors and regulatory areas, but uses XML technologies like XML Schema rather than RDF. There is growing interest in being able to capture this information using RDF and in mapping XBRL taxonomies into Semantic Web ontologies. For example there has been some early work on an the XBRL ontology, as well as some independent exploration and signs that a W3C incubator or interest group might be formed. The speaker at the tutorial also suggested that before long some standard GRDDL connectors would be available to automate the transformation of XBRL documents into RDF.
Much of the tutorial was discussion of applied uses of RDF data and ontologies within the context ofthe Musing Project an EU funded project exploring “next-generation business intelligence” in the areas of financial risk management, internationalisation and IT operational risk. Some of the applications that have been explored have been collecting company info from a range of multilingual sources; attempting to assess chances of success of a business in a specific region; semi-automated form filling, e.g. for returns; identifying appropriate business partners; and reputation tracking and opinion mining.
Many of the issues faced in the Musing project deal with how to assemble this data with a historical context: while XBRL data may be present for current or recent years, text mining is required to extract this data from historical reports. The last part of the tutorial was a general introduction to Information
Extraction using the Gate toolkit (this starts from around Slide 75 in the Powerpoint slides). This was a good overview of the capabilities of the toolkit and showed some nice use cases. OpenCalais certainly isn’t the only game in town and, while Gate requires more effort to set-up, looks like it could provide a great deal more customisation options for businesses that really need the extra power.
One of the telling things about the overall process was the need to collate useful data from a number of different sources in order to drive the information extraction process. In order to do Named Entity Extraction a good set of reference material is required, e.g. Gazetteers for place names, or lists of people’s names. While much of this data is already available — in Musing they drew on Wikipedia and the CIA World Factbook for example — a lot more information was either available only by crawling the web or from commercial resources. This suggests to me that there’s still a some ground work to be done in unlocking more data sets that can help drive the business intelligence use cases. There’s essentially a domino effect here: exposing often small focused datasets, can end up unlocking huge potential value further down the line.