Semantic Web != Text Analysis; Semantic Web != Controlled Vocabularies

Stefano says The future of the semantic web is LSI. While I agree that LSI is definitely a cool technology, and is an interesting alternative to Bayesian techniques I don’t agree that it’s the future of the semantic web. The semantic web isn’t about text analysis. It’s not about data mining a document corpus. But those are possible applications of the technology.
There are vast amounts of high quality data being produced with semantic web technologies all the time. No need to apply LSI there.
Danny has already deconstructed this piece so I won’t comment a great deal further beyond saying that despite the title the article has nothing to say about what works or doesn’t work about ontologies, it merely describes some of the issues a search engine vendor faces when indexing text.
The article also glosses over a lot of interesting activity elsewhere. For example Norvig says that:
Essentially what we’re doing here is using the power of masses of untrained people who you aren’t paying to do all your work for you, as opposed to trying to get trained people to use a well-defined formalism and write text in that formalism and let’s just use the stuff that’s already out there. I’m all for this idea of harvesting this “unskilled labor” and trying to put it to use using statistical techniques over masses of large data and filtering through that yourself, rather than trying to closely define it on your own.
What about all the buzz about folksonomies? What’s that if its not harnessing “unskilled labor” to generate structured metadata? Just because its a few simple tags doesn’t mean it’s not metadata, just because it’s categorization without using a formal ontology doesn’t mean it’s not generating useful machine-readable metadata. And should we ignore it because it’s free-form and user-generated? Technorati apparently don’t think so.
Over that metadata I can start making additional statements, drawing together related tags, drill down to extract rich metadata about the photos, etc. This is what the semantic web is about: building a machine-readable infrastructure over and above what we already have.
But while the new technorati service is undoubtedly cool, before one gets too excited over it read through this paper on folksonomies which compares tagging practices in flickr and The different media and communities of the two sites lead to some quite different results. The technorati service can provide some good illustrations of that and other issues raised in the paper.
We don’t have to throw everything out and start again. We don’t have to restrict ourselves to mining what’s out there, and we don’t have to wait for “ontology astronauts” to deliver us a set of fixed ontologies before we can being doing useful work. While I agree with Shirky that comparisons of controlled vocabularies and folksonomies should account for economic factors, I disagree with the implication that it’s an either/or choice.
What about a folksonomy created by a designated community of experts, e.g. researchers in the field of interest? It’s not a controlled vocabulary in the usual sense, has the economic benefits of a folksonomy, but needn’t suffer from some of its problems. What about applying an editorial layer on top of a folksonomy to draw together related tags, etc? It’d address some of the issues outlined in the folksonomy review paper I referenced above. There’s plenty of room in between the two extremes of controlled vocabularies and user-generated tagging.
It’s a requirement of the key technologies underlying the semantic web that one can create vocabularies in just such distributed fashion, and relate them together when we need to.
Boot-strapping machine readable data from existing sources. Late binding of data to application schemas.
That’s the future and benefit of the semantic web IMO.

One thought on “Semantic Web != Text Analysis; Semantic Web != Controlled Vocabularies

  1. Web 2.0: Bottom-up and Self-Organizing

    When I was working on the first release of Photoshop Album, one of the biggest areas of contention was around tags. It was clear that there was a benefit to building an organizational model around tags, but it was unclear…

Comments are closed.