Next Generation Business Intelligence at ISWC 2008

This post was originally published on the Talis “Nodalities” blog.

The second pre-conference session I attended at ISWC 2008 was a tutorial session on “Knowledge Representation and Extraction for Business Intelligence“.

I attended the session as I was curious to learn about more applied uses of Semantic Web technology particularly in the financial and business context. In terms of content the tutorial veered wildly from overview material through to some quite detailed looks at linguistic and semantic analysis to extract information from business reports. To that end I’m not going to attempt to summarize the full content of the tutorial but will pick out a few areas of interest.

Somne time was spent on looking at XBRL, the standard business reporting language which is becoming increasingly adopted around the world as a standard means to publish and share business reports. The initiative which began in 1999 was recently extended this year to include a European XBRL consortium. The broad goal of the project is to standardize the means and structure of publishing business financial reports with the goal of making it easy to compare and collate reports for regulatory and other purposes. The current financial crisis was referenced as an illustration of the need for greater transparency in business reporting and is an obvious driver for adoption of the technology.

XBRL draws on many of the same concepts as the Semantic Web, in particular the use of “taxonomies” that can be customized by specific businesses, sectors and regulatory areas, but uses XML technologies like XML Schema rather than RDF. There is growing interest in being able to capture this information using RDF and in mapping XBRL taxonomies into Semantic Web ontologies. For example there has been some early work on an the XBRL ontology, as well as some independent exploration and signs that a W3C incubator or interest group might be formed. The speaker at the tutorial also suggested that before long some standard GRDDL connectors would be available to automate the transformation of XBRL documents into RDF.

Much of the tutorial was discussion of applied uses of RDF data and ontologies within the context ofthe Musing Project an EU funded project exploring “next-generation business intelligence” in the areas of financial risk management, internationalisation and IT operational risk. Some of the applications that have been explored have been collecting company info from a range of multilingual sources; attempting to assess chances of success of a business in a specific region; semi-automated form filling, e.g. for returns; identifying appropriate business partners; and reputation tracking and opinion mining.

Many of the issues faced in the Musing project deal with how to assemble this data with a historical context: while XBRL data may be present for current or recent years, text mining is required to extract this data from historical reports. The last part of the tutorial was a general introduction to Information
Extraction using the Gate toolkit (this starts from around Slide 75 in the Powerpoint slides). This was a good overview of the capabilities of the toolkit and showed some nice use cases. OpenCalais certainly isn’t the only game in town and, while Gate requires more effort to set-up, looks like it could provide a great deal more customisation options for businesses that really need the extra power.

One of the telling things about the overall process was the need to collate useful data from a number of different sources in order to drive the information extraction process. In order to do Named Entity Extraction a good set of reference material is required, e.g. Gazetteers for place names, or lists of people’s names. While much of this data is already available — in Musing they drew on Wikipedia and the CIA World Factbook for example — a lot more information was either available only by crawling the web or from commercial resources. This suggests to me that there’s still a some ground work to be done in unlocking more data sets that can help drive the business intelligence use cases. There’s essentially a domino effect here: exposing often small focused datasets, can end up unlocking huge potential value further down the line.

Cross Pollination

I binged on TED talks whilst travelling over to the ISWC 2008 conference. One of those that I enjoyed the most was “Design and the elastic mind” by Paola Antonelli. Who doesn’t get a kick out of seeing some great design concepts?
One item that caught my attention was Antonelli’s reference to a regular “salon” that brought together designers and scientists in order to explore common ground and share ideas.
As the power of what is possible on the web increases, it strikes me that we need a bit more of this kind of cross-pollination between development and design. In order to encourage a bit more lateral thinking and a fuller exploration of the potential, and maybe kick us all in some new directions.
Looks like I’m not the only one thinking this: Tim Bray is encouraging folk to branch out and Ian Dickinson wants to be a “devsigner” when he grows up.
I think this is particularly true in the Semantic Web space. I’ve yet to see a really striking semantic web application that isn’t essentially a clone of an existing service or really does justice to the data. Are there exciting, challenging, or innovative user interfaces that I’ve missed? Parallax is great, but what else is there? What needs to happen to encourage more innovation?
I can remember a couple of years back when all of a sudden there were information architects and interaction designers at conferences like XTech, when it became clear that there were a lot of synergies between open data publishing and good (website) design. How long before this happens at Semantic Web conferences? There’s a couple of papers on this topic at ISWC, and a workshop next year. But what else can we do? How do we foster some good cross-pollination?

Jim Hendler at the INSEMTIVE 2008 Workshop

This post was originally published on the Talis “Nodalities” blog.

Along with a number of my colleagues, I’m currently attending the ISWC 2008 conference in Karlsruhe, Germany. Yesterday I attended the INSEMTIVE workshop (“Incentives for the Semantic Web”) which aimed to explore incentives for the creation of semantic web content, i.e. encourage the creation of more structured metadata. The workshop papers are available to browse online or you can download the complete proceedings. There were a real mix of papers, covering specific issues such as extraction of semantics from tagging, and identifying information needs of a community by analysing search patterns, through to position papers that attempted to highlight shortcomings in current semantic web applications that deter people from creating metadata.

I found the position papers most interesting if only because they provided confirmation of something that I’ve been thinking for a while now: that people will (and do) create metadata when there are obvious and immediate benefits in them doing so. No-one really consciously sits down to share or create metadata: they sit down to do a specific task and metadata drops out as a side-effect. For me this makes much of the problem highlighted by the workshop one of interaction design: how do we build good task-oriented user interfaces that encourage the creation of semantic web metadata, and how can we illustrate the benefits of semantic web technologies in an incremental fashion? In my opinion solving this will require close collaboration between semantic web researchers and developers, and interaction designers.

The end of the workshop was a discussion session chaired by Jim Hendler. Hendler chose to do a retrospective of some older presentations to explore how thinking has evolved (or not!) with respect to drivers towards the development of the semantic web.

Starting in 1999, Hendler showed some slides from DAML strategy talks that emphasised the need for a number of different areas to align before a real marketplace can be created for semantic web content and applications. These areas were tools, users, and languages (e.g. OWL, etc). Hendler noted that the Semantic Web community had mistakenly focused too heavily on languages and not enough on the other areas. He also thought that “Web 2.0″ had focused primarily on the users, to a lesser extent on the tools, and very little on the language aspects. Hendler thought that this alignment was now taking place.

Moving forward in time to show some slides from 2001-2002, Hendler introduced the idea that the development of the web itself will “force” the evolution of the semantic web, i.e. that internal pressures, such as the need to better manage and extract value from the massive amounts of online information, will require the semantic web to solve specific problems. Hendler observed that the web has demonstrated that people will do more work to share information with others than they will do to help themselves; i.e. people are lazy. When people want to, need to, or are rewarded for sharing information and content then they will work much harder than they would do to manage and organize information purely for their own uses. Hendler noted that there is a tendency to say “we’ll solve the data creation problem at the individual level, as solving it at a group level is harder to manage”, but a look at web history illustrates that the opposite is in fact the case.

Hendler also shared what he thought was the best piece of advice he’d been given by Tim Berners-Lee: start small but viral and you can change many things. Hendler’s slides characterized this as: “My friend sees it, wants one; My competitor sees it, needs one”.

Looking at slides from 2002, Hendler introduced the “Value proposition” supporting the creation of semantic web data & content, i.e. that there has to be some immediate return on the investment in creating metadata.

Hendler finished his retrospective with a slide from a 2008 talk that showed the range of commercial companies, government projects and vertical sectors that were now heavily engaged in the Semantic Web (I was happy to see Talis mentioned in the list!). In Hendler’s opinion there is a growing excitement, that the “next big thing” is going to come from the Semantic Web; not a “Google Killer”, but the next big revolutionary idea or service. The incentives here being the obvious one: money.

Hendler noted that there is a huge amount of data out there and that finding anything in the mess can be a win. So even a little semantics can make a difference here and could provide some competitive advantages. We don’t need perfect answers or solutions, just incremental improvements on what we have now.

I was also happy to see Hendler encourage researchers to “compete in the real world”, noting that they have to work within the context of a real world that is moving very fast, that they can’t really compete with the resources of commercial firms in creating semantic web applications and demonstrators and should instead try and work within that context to demonstrate real value from the technology. Hendler encouraged them to focus on issues of scalability. Does the fundamental technology scale? Do the concepts and ideas scale to a real user base? As an illustration Hendler noted that he was working with a number of companies that were using some simple OWL constructs in order to add semantics to applications, but that none of them were using a formal reasoner just “little pieces of procedural code that scale really well”.

Overall, an interesting workshop!

Paul Miller did a podcast with Jim Hendler back in March if you want to hear more about his thoughts on the Semantic Web.

Explaining REST and Hypertext: Spam-E the Spam Cleaning Robot

I’m going to add to Sam Ruby’s amusement and throw in my attempt to explicate some of Roy Fielding’s recent discussion of what makes an API RESTful. If you’ve not read the post and all the comments then I encourage you to do so: there’s some great tidbits in there that have certainly given me pause for thought.

The following attempts to illustrate my understanding of REST. Perhaps bizarrely, I’ve chosen to focus more on the client than on the design of the server, e.g. what resources it exposes, etc. This is because I don’t think enough focus has been placed on the client, particularly when it comes to the hypermedia constraint. And I think that often, when we focus on how to design an “API”, we’re glossing over some important aspects of the REST architecture which includes after all, other types of actors, including both clients and intermediaries.

I’ve also deliberately chosen not to draw much on existing specifications, again its too easy to muddy the waters with irrelevant details.

Anyway, I’m well prepared to stand corrected on any or all of the below. Will be interested to hear if anyone has any comments.

Lets imagine there are two mime types.

The first is called application/x-wiki-description. It define a JSON format that describes the basic structure of a Wiki website. The format includes a mixture of simple data items, URIs and URI templates that collectively describe:

  • the name of the wiki
  • the email address of the administrator
  • a link to the Recent Changes resource
  • a link to the Main page
  • a link to the license statement
  • a link to the search page (as a URI template, that may include a search term)
  • a link to parameterized RSS feed (as a URI template that may include a date)

Another mime type is application/x-wiki-page-versions. This is another JSON based format that describes the version history of a wiki page. The format is an ordered collection of links. Each resource in that list is a prior version of the wiki page; the most recent page is first in the list.
Spam-E is a little web robot that has been programmed with the smarts to understand several mime types:

  • application/x-wiki-description
  • application/x-wiki-page-versions
  • RSS and Atom

Spam-E also understands a profile of XHTML that defines two elements: one that points to a resource capable of serving wiki descriptions, another that points to a resource that can return wiki page version descriptions..

Spam-E has internal logic that has been designed to detect SPAM in XHTML pages. It also has a fully functioning HTTP client. And it also has been programmed with logic appropriate to processing those specific media types.

Initially, when starting Spam-E does nothing. It waits to receive a link, e.g. via a simple user interface. Its in a steady state waiting for input.

Spam-E then receives a link. The robot immediates dereferences the link. It does so by submitting a GET request to the URL, and includes an Accept header:

Accept: x-wiki/description;q=1.0, x-wiki/page-versions;q=0.9, application/xhtml+xml;q=0.8, application/atom+xml;q=0.5, application/rss+xml;q=0.4

This clearly states Spam-E’s preference to receive specific mime-types.

In this instance is receives an XHTML document in return. Not ideal, but Spam-E knows how to handle it. After parsing it, it turns out that this is not a specific profile of XHTML that Spam-E understands, so it simply extract all the anchor elements from the file and uses it to widen its search for wiki spam. Another way to say this is that Spam-E has changed its status to one of searching. This state transition has been triggered by following a link, receiving and processing a specific mimetype. This is “hypermedia as the engine of application state” in action.

Spam-E performs this deference-parse-traverse operation several times before finding an XHTML document that conforms to the profile it understands. The document contains a link to a resource that should be capable of serving a wiki description representation.

Spam-E is now in discovery mode. Spam-E uses an Accept header of application/x-wiki-description when following the link and is returned a matching representation. Spam-E parses the JSON and now has additional information at its disposal: it knows how to search the wiki, how to find the RSS feed, how to contact the wiki administrator, etc.

Spam-E now enters Spam Detection mode. It requests, with a suitable Accept header, the recent changes resource, stating a preference for Atom documents. It instead gets an RSS feed, but thats fine because Spam-E still knows how to process that. For each entry in the feed, Spam-E requests the wiki page, using an Accept header of application/xhtml+xml.

Spam-E now tries to find if there is spam on the page by applying its local spam detection logic. In this instance Spam-E discovers some spam on the page. It checks the XHTML document it was returned and discovers that it conforms to a known profile and that embedded in a link element is a reference to the “versions” resource. Spam-E dereferences this link using an Accept header of application/x-wiki-page-versions.

Spam-E, who is now in Spam Cleaning mode, fetches each version in turn and performs spam detection on it. If spam is found, then Spam-E performs a DELETE request on the URI. This will remove that version of the wiki page from the wiki. Someone browsing the original URI of the page will now see an earlier, spam free version.

Once it has finished its cycle of spam detection and cleaning, Spam-E reverts to search mode until it runs out of new URIs.

There are several important points to underline here:

Firstly, at no point did the authors of Spam-E have to have any prior knowledge about the URL structure of any site that the robot might visit. All that Spam-E was programmed with was logic relating to some defined media types (or extension points of a media type in the case of the XHTML profiles) and the basic semantics of HTTP.

Secondly, no one had to publish any service description documents, or define any API end points. No one had to define what operations could be carried out on specific resources, or what response codes would be returned. All information was found by traversing links and by following the semantics of HTTP.

Thirdly, the Spam-E application basically went through a series of state transitions triggered by what media types it received when requesting certain URIs. The application is basically a simple state machine.

Anyway, hopefully that is a useful example. Again, I’m very happy to take feedback. Comments are disabled on this blog, but feel free to drop me a mail (see the Feedback link).