Classifier4J is a Java text classification library that includes a text summariser and a Bayesian classifier. It was my interest in the latter that lead me to play with the API recently, as I wanted to demonstrate to some colleagues the ease with which one can use Bayesian classification to create a content filter/recommender. Well, it’s easy if all the hard work is done for you in a library!
The Classifier4J API is very easy to use, and you can plug a Bayesian classifier into an application with very few lines of code.
One of the things that intrigued me about the API design was that it separates out the Classifier from the storage of the words and their probabilities. The API comes with a simple in-memory implementation and a JDBC Words Data Source which stores the data in a database table.
It occured to me that it’d be an interesting experiment to create an implementation of the data source interface that stored the data as RDF.
Why RDF? Because then we’d have the share and aggregate the results of training classifiers.
For example I could export and share a classifier trained to spot spam, semantic web topics, or any number of other categories. The classifiers could be imported into both desktop applications (e.g. Thunderbird) as well as web applications. For example I might train a classifier to spot articles that I’m interested in, and then upload that configuration into a content management system and have it mine that data for material I may be interested in — hence “bayesian agents”
By tieing my exported bayesian probabilities to my FOAF file an aggregator may merge my data with others known to share similar interests. Trust is another aspect that may reflect whether my data is shared.
Anyone have any comments on this? Is anyone doing anything similar already? (They must be…)
I’ll try and hack something up when I get a few minutes.
For the RDF I was thinking of something like the following:
Read More »
Via Gavin (via the chumpologica): An application architecture that should yield superior productivity.
Interesting stuff. I’ve been pondering something similar myself, mainly because I have a slice of an application I’m working on that I want to replace with an RDF data model and storage. To achieve this successfully I need to make sure that the data nicely dovetails with the JSP 2.0/JSTL templating environment we’ve built on top. However I don’t want to model everything as objects if I can help it, because by doing so I’m going to sacrifice some of the flexibility I gain from using RDF.
Ideally I want to gut the current Data Access Objects and replace them with node that navigates the underlying RDF graph, perhaps using an RDF query language, and then return a subset of that graph in a form that suitable for traversing with JSTL. There’s not a great deal of business logic in that slice of the application so there’s little else to change.
I had been wondering whether the technique used in RDF Twig could be generalized to creation of simple object hierarchies (Lists and Maps). Rx4RDF might be another useful place to mine for ideas.
Suggestions for other useful APIs to techniques to explore will be gratefully received.
btw, if you find that you start extending your object model to allow arbitrary property annotation, and some of those properties are actually pointers to other objects in your graph, then that’s probably a sign that you may be better off using an RDF based model. And possibly Python too but I’ve not explored that angle yet.
Via Cafe con Leche I notice that Saxon 7.9 has been released. The interesting thing is that Mike Kay has founded Saxonica Limited which will offer professional services and additional modules, including a schema-aware processor as a commercial offering.
I’ve used Saxon for a long time now. It’s my XSLT processor of choice. I’ve never bothered with Xalan or other processors as Saxon has always Just Worked.
Like any good tool Saxon is adjustable enough to help you solve any particular problem. Just recently I’ve benefited from both the
saxon:preview which helped me deal with a large transform and the very easy extension mechanism that allowed me to invoke some Java code during a transformation (generating a SHA1 sum for an email address).
I think it’s good news that Mike is intending to continue offering the basic product for free and wish him well in the commerical venture.
In response to a feature request from L. M. Orchard I’ve just spent a couple of hours packaging up the FOAF-a-Matic Mark 2 as a Java Web Start application.
Actually creating the requisite JNLP file was straight-forward; the specification is clear and the format simple. I very quickly had the application launching from a web page link. What took a bit longer is working out how to sign the jar files so that I could request permission to access the file system, open local ports and remote connections. Actually with the current version of JNLP you have to create all permissions, there’s no granularity in what you can request or grant access to. Suprising really as you’d expect this to be relatively easy to implement giving that the underling security manager and permissions model is all in place.
Anyway, the JNLP and jarsigner documentation just refer you to a certificate authority to get a certificate to sign your jar files. This is frustrating as I’m not about to fork out for a certificate when I’m giving the code away for free. A quick bit of googling dug up this excellent document from Richard Dallaway, “Java Web Start and Code Signing“. Dallaway had met exactly this problem and documented how to sign up for a free certificate from Thawte.
Completing the requisite application forms, and awaiting for email confirmations ate up the rest of the time required to get FM Mark 2 running under Web Start. Happily Ant already has tasks for signing jars so it was quite straight-forward to add a new target to my build file to create the Web Start distribution.
The lesson to be learned here is to take the time to write up any non-trivial problems you resolve, because you’re going to save someone (and probaby many people) from floundering around. Doing so with bring good karma. Guaranteed
The Web Start enabled FM Mark 2, plus a couple of bug fixes, will be beta-2.1 arriving at a browser near you shortly.
I’m very pleased to say that my latest tutorial for IBM developerWorks is now up on their site:
Enity Management in XML applications
It covers the XML catalog specification and using the Apache XML Resolver classes to add catalog support to your XML applications. Why would you do that? Read the tutorial and find out…
I’ve just uploaded the beta-1 of my shiny new Java API onto the MusicBrainz RDF web service.
If you’re not familiar with MusicBrainz, it’s similar to CDDB: it stores lists of artists, albums and tracks that can be used to add metadata to your music collection. Aaron Swartz wrote a nice article on it a while ago: “MusicBrainz: A Semantic Web Service” (warning PDF).
There’s been a C/C++ API for some time now with bindings for other languages, but no Java API. And as I want to hook some Java code up to the server I went ahead and wrote one.
It’s not complete yet. It’s read-only at the moment so doesn’t support the query methods used to authenticate and submit data to the service. However this is enough for me at present and I thought I’d release it in case anyone else finds it useful.
The API is built on the spangly new Jena 2 API, and provides “raw” access to the RDF responses from the server or a simple bean interface for those of you not interested in the RDF.
You can download the API and read the package documentation online. The latter contains a few code fragments and enough information to get you started. The unit tests are pretty comprehensive too, so look there for additional examples.
This API is released under the Creative Commons Attribution-ShareAlike License
A while ago I bemoaned the fact that there wasn’t an independent JBoss documentation project. I was pleasantly surprised to discover via a comment left under that posting that there is now such as beast:
JBoss Documentation Wiki
There’s even some initial content in there. If this gets some serious attention from you Java bloggers out there, who knows what we’ll end up with?
In fact if all the people who have spent the last few days raving about the JBoss project management, and the fact that some guy called Gavin has recently changed his job, instead spent their time doing a bit of Wiki gardening, we’d probably have a very useful resource indeed.