The Web’s Rich Tapestry

This week I co-chaired a plenary session at the ALPSP International Conference.
The goal of the session, titled “The Web’s Rich Tapestry” (abstract), was to discuss the continuing evolution of the web from a document-centric view of the world to one that was more data and link centric.
The first half of the session was presented by my friend and former colleague Geoff Bilder, Director of Strategic Initiatives at CrossRef. Geoff focussed on discussion the nature of the link and its implementation both on the web and in early hypertext systems. Geoff discussed some of the power that was evident in these hypertext environments and the growing need and awareness for features like stable, persistent links and multi-directional links not just in scholarly communication (where they’re already very common) but more widely on the web.
I’ve explored this theme myself. It seems to me that what we’re doing is slowly rebuilding many of the features of early hypertext environments but in a more distributed, open and scalable fashion.
In my half of the talk I focused on the evolution towards the Semantic Web. I’ve included my notes below. I don’t normally write up talks in this way, but it proved a useful way to organize my thoughts on this occasion. They’re reproduced below without much editing. The accompanying slides are on Slideshare.
(Note: this was a presentation for a non-technical audience, so may not be much new content here for Planet RDF readers)

The House of Leaves (cover)

I want to start today with a digression and tell you about a book I once read. It’s called the House of Leaves by Mark Danielewski.
Its basically a tale of a haunted house, the people that live in, investigate it, and what happens to them. It’s one of those slow burning horror stories that sticks with you for some time. If I had to sum it up it’d be as “The Blair Witch” of haunted house tales.
But the reason this book stuck in my mind is because of how its put together. Its quite a challenging read, made up of overlapping narratives, documentary evidence, etc. So as the reader you’re assembling a story out of the bits and pieces of text that the author presents you with.

The House of Leaves (sample page)

The author has even played with the printed form. The pages are put together with overlapping bits of text. There are footnotes that run for pages. And footnotes to footnotes, and so on. Certain words are coloured differently. There are even blocks of text that run DOWN through pages so you have to read a small block of text over a few pages, before returning to where you started.
You can see one on the slide. But as this is on the left-hand side of the page its printed in reverse, as if we were looking back UP the text. The idea is to have the labyrinth of the haunted house be reflected in the structure of the book.
The book is basically a hypertext novel.
And it stuck with me because it challenged my notions of the printed medium.
I’m telling you this today because what I’d like to do, is challenge your notion of the medium of the web, or at least try and get you to start thinking about things slightly differently.

The Medium of the Web

So what is the medium of the web? Well as Geoff has explained, its all about the links.
At the most fundamental level we have a growing number of internetworked hardware devices. Not just computers, but phones, games consoles, set-top boxes and all manner of home electronics and peripherals.
Layered on top of that is the Web: which is currently a collection of interlinked web pages, music, videos, and other media. It has been basically an island of documents sparsely connected by links.
It’s hard to escape the importance of the simple link. A key web design principle is that users should not be faced with “dead ends”. There should always be some links or navigation options. We still have a common perception of links being a mechanism for humans, but thats not strictly true.
The first applications that created and followed links were search engines. And the Google PageRank algorithm has demonstrated the power that can be gained from performing computation on links; in that case to analyze the structure of the web in order to recommend information.
But the trend is continuing. A lot of Web 2.0 has been about discovering the power that comes from being able to assemble and process information by having machines follow links.

Blurring of Boundaries

Mashups and all that excitement around Web 2.0 is part of the general trend that is causing a blurring of the notion of “web site” and “web page”. In fact they’re becoming increasingly antiquated notions.
Web sites are becoming decomposed into distributed, commodity chunks of functionality. Again, search was the first of these: its now trivial to create a site specific search engine using a number of different services.
But we’re seeing this in other areas too. E.g. authentication. Authentication and user data is becoming federated and shared between web sites. Within the library sector there are standards like Shibboleth, but on the wider web there is OpenId which is a rapidly adopted standard for authentication. So this is another chunk of application functionality that is moving out onto the network. And its all tied together by links. Web site usage statistics is another example.
Smaller pieces of functionality: hosting and embedding of media (photos, videos), visualization tools, etc. This is more at the web page level. When we’re looking at a page now, its much more likely to be composed of services and data from a number of different sites.
And then there’s the shift from the “producer-consumer” model, to one where users are more actively contributing to the content and structure of sites.
Even the user experience is no longer completely uniform. Many people now routinely have a number of browser extensions running which are changing how they view and interact with sites.
To sum it all up, what we’re experiencing is a shift from a small number of websites pushing content out, towards a true network of peers, of different tools, services and data. Basically a move from broadcast models suitable for radio and TV towards a network model more suited for an medium which is based on linking.

The Future

The end goal here is what Tim Berners-Lee calls the Semantic Web. An environment which is not document-centric, but one that is data-centric. An environment where data is easily available for reuse and recombination to create new services and visualizations. And one that is easier for machines to understand so that we can have better tools to interact with that richer environment.
The Semantic Web is not a replacement for the current web. Its an extension of it, a layer on top. It too is based on links between resources, at a very fundamental level. Dealing with issues such as trust and provenance of information are also key goals of the Semantic Web project.

The Semantic Web Illustrated

It’s a little hard to try and introduce the Semantic Web without getting technical. So I’m going to try and illustrate this for you so you can hopefully start to understand the basic idea. A few of you may have seen this before, so apologies
[demonstration, using “animated” circles and arcs to build up a “semantic web”]

Linking Open Data

This process, the creation of a large scale semantic web, is not the task of any one company. Like the development of the web itself it will happen over time, as individual people, communities and businesses begin to share data.
And it is happening now. Already there is a large, and growing number of data sets available. Some of these are shown on the slide. The links indicate which data sets are related to one another and the relative sizes give some idea of the amount of data. All of wikipedia is in there. There’s census and government data from the US, and signs that the same may be coming from the UK. There’s (some) bibliographic data, music metadata, personal information, TV and radio information from the BBC.
There’s also a growing body of other data sets, but which are not yet linked in.
Like the web itself, the more there is, the more value there is for everyone taking part.

What Does This Buy Us?

OK. So much for the evangelism. What are the practical benefits? If you were to open up your bibliographic metadata to the semantic web, and link it into other data sets?

More Links; More Traffic

The first is quite simple: the more data / resources that are exposed, the more links there can be to the information and content. And more links equals more potential traffic to content and services.
A sizeable portion, if not the majority, of the traffic to each of your websites is coming from Google. That’s an incredibly fragile situation and puts a lot of power in their hands. The size of this traffic is obviously related to the popularity of the Google search engine, but its ENABLED, by the fact that Google can link to any bit of your website. Because their crawler gets into all sorts of places, including some in which it probably shouldn’t. There’s a large surface area they can link to.
If you’re publishing semantic data, then that surface area is going to be much, much larger. Because there will be more stuff, more resources to link to. Think back to the demo. Every blob on there could be a linkable resource. I’m certain that none of your are exposing as much data as you might.
So the situation with Google is a good illustration of the potential, but is, IMO, one that need only be temporary

Better Research Tools

I think that the main benefit though is the ability to create better research tools. Research and learning environments that can make better use of all this rich data and provide all kinds of better discovery tools and productivity improvements.
Geoff coined a phrase a while ago, “Hegemony of Search”: how search is currently the predominant metaphor for how people do research and find information. But search is an awful way to do research. Search is what you do BEFORE real work begins. Research, learning and analysis is what happens AFTER we’ve (hopefully) found some useful information. And yet we focus too much on search and not enough on what happens afterwards.
With good tools we shouldn’t need to search at all. It should be a starting point or a fallback position.
I wanted to point out a couple of interesting, and fairly recent developments in this space which I think show some promise.

FreeBase

The first is based around a site called FreeBase. They describe themselves as “the world’s database”. You can think of it as a wikipedia for structured information. Instead of entering text, you can enter data, using structured forms. Their model is very similar to that of the Semantic Web: inter-linked resources and data.
If you haven’t looked at it, then its worth exploring to see an alternative model for capturing data in a wiki style. There are a number of businesses in this space at the moment. Another is Twine which is due to launch shortly.

Parallax

Parallax is a new user interface for exploring the FreeBase data. Its built on their API by a guy that has been doing a lot of interesting research work in visualizing linked data.
There’s a web cast which I’m not going to show now as I’m not a fan of showing videos in presentations. So I urge you to copy the tinyurl on the screen and talk a look at his introduction. It will better convey the power of the tool than I can do over the course of a few slides. And if you’re underwhelmed, then blame my poor presentation and still take a look. Its one of the coolest things I’ve seen in a long time.
The insight on which Parallax is based is that some research tasks are not well supported by current search engines or information tools. The example he uses is attempting to find information about US presidents. What if you wanted to find out the birth places of all US presidents, or the schools attended by the children of all US presidents?
Conventionally you’d have to work pretty hard to achieve that. If you struck lucky and found all the data on wikipedia you would have to laboriously click through a number of different pages, e.g. on each of the presidents, and on each of their children in turn, in order to find that data, collate it yourself and THEN do something useful with it. Better user interfaces onto structured data can help with that.

Ubiquity

Mozilla Ubiquity is basically a productivity tool.
Exploring adding a command-line to the browser, so you can quickly select content and ask the tool to do something useful with it, using a “natural language” style interface. E.g. select a date and add to calendar, email something, translate text, create a map from an address or collection of addresses, etc.
The interesting thing is that you can share commands with one another and so build up a little library of productivity tools with trusted colleagues.
Very early days, but interesting development.

Exploring the Potential

How can you explore the potential for yourselves?
Well, if you’re on or moving to the Ingenta pub2web platform, then you’re well on your way. Pub2web is semantic web enabled from the ground up. Internally its holding data in the right form for sharing on the semantic web.

Talis Platform

I will briefly mention a couple of things that we’re doing at Talis, projects that are currently underway.
First there’s the Talis Platform: a more general purpose Semantic Web platform, aimed at supporting development of semantic web applications, and the sharing of open data.

Project Xiphos

But more immediately relevant to this community is a pilot project that we’re conducting with TBI Communications. One of the Talis divisions that been exploring the potential of taking semantic web data and using the connections between data, resources and people to create new ways of finding, processing and sharing ideas.
An initial prototyping exercise has illustrated the potential for taking bibliographic data from publishers and then building social networking and learning tools around it. Its a great illustration of how data created for one purpose (i.e. publishing) can be reused in another context.
There’s a free pilot project which is about to get underway and the team is inviting society publishers to apply to join. Ideally they’re looking for several societies within a specific discipline to get a good coverage of content and members from a number of different societies. The goal is to open up this prototype environment for a limited period to explore its potential.
If you’re interested, then let me know afterwards, or look at the TBI Communications site where there’s an application form.

Closing Points

OK. I’d best come to some closing points.
Hopefully I’ve given you some food for thought today and started getting you thinking with a slightly different perspective about the web and how it is evolving.
I’d encourage you all to look closely at how you can explore the potential that these trends are providing, and consider how you fit into the web. And I’d also encourage you to try and think beyond search and consider what other kinds of research, discovery, and importantly productivity tools that researchers and students need.
And if I didn’t manage to get you thinking differently about the web, then a least you got a book recommendation out of me!