Bayesian Agents

Classifier4J is a Java text classification library that includes a text summariser and a Bayesian classifier. It was my interest in the latter that lead me to play with the API recently, as I wanted to demonstrate to some colleagues the ease with which one can use Bayesian classification to create a content filter/recommender. Well, it’s easy if all the hard work is done for you in a library!

The Classifier4J API is very easy to use, and you can plug a Bayesian classifier into an application with very few lines of code.

One of the things that intrigued me about the API design was that it separates out the Classifier from the storage of the words and their probabilities. The API comes with a simple in-memory implementation and a JDBC Words Data Source which stores the data in a database table.

It occured to me that it’d be an interesting experiment to create an implementation of the data source interface that stored the data as RDF.

Why RDF? Because then we’d have the share and aggregate the results of training classifiers.

For example I could export and share a classifier trained to spot spam, semantic web topics, or any number of other categories. The classifiers could be imported into both desktop applications (e.g. Thunderbird) as well as web applications. For example I might train a classifier to spot articles that I’m interested in, and then upload that configuration into a content management system and have it mine that data for material I may be interested in — hence “bayesian agents”

By tieing my exported bayesian probabilities to my FOAF file an aggregator may merge my data with others known to share similar interests. Trust is another aspect that may reflect whether my data is shared.

Anyone have any comments on this? Is anyone doing anything similar already? (They must be…)

I’ll try and hack something up when I get a few minutes.

For the RDF I was thinking of something like the following:


<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdfs:Class rdf:ID="WordProbability"/>
<rdf:Property rdf:ID="classifier">
<rdfs:domain rdf:resource="#WordProbability"/>
<rdfs:range rdf:resource="http://xmlns.com/foaf/0.1/Agent"/>
</rdf:Property>
<rdf:Property rdf:ID="word">
<rdfs:domain rdf:resource="#WordProbability"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
</rdf:Property>
<!-- classifier4j uses strings for categories, but URIs seem better -->
<rdf:Property rdf:ID="category">
<rdfs:domain rdf:resource="#WordProbability"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
</rdf:Property>
<!-- need to type these two... -->
<rdf:Property rdf:ID="matchCount">
<rdfs:domain rdf:resource="#WordProbability"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
</rdf:Property>
<rdf:Property rdf:ID="nonMatchCount">
<rdfs:domain rdf:resource="#WordProbability"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
</rdf:Property>
</rdf:RDF>

2 thoughts on “Bayesian Agents

  1. Heh, I have a paper that never got written, and a domain that has yet to be filled: semtext.org. The general idea was very like you suggest, to make latent semantics blatant. Starting point I had in mind was a service that you gave a URI and it returned:
    <s:wordCount>234</s:wordCount>
    Still on the to-do list, time permitting etc etc. A related idea I’ve been wanting to play with for *years* is automatic classification using Kohonen’s self-organising maps, with the service could returning things like similarity measures (again, everything in explicit RDF). But anyhow I’ll be watching closely what you get up to with Classifier4j…
    btw, you may also want to check out the data mining stuff “Weka”.
    http://dannyayers.com/archives/2003/09/25/proposal-for-etcon-2004/

Comments are closed.