Digital public institutions for the information commons?

I’ve been thinking a bit about “the commons” recently. Specifically, the global information commons that is enabled and supported by Creative Commons (CC) licences. This covers an increasingly wide variety of content as you can see in their recent annual review.

The review unfortunately doesn’t mention data although there’s an increasing amount of that published using CC (or compatible) licences. Hopefully they’ll cover that in more detail next year.

I’ve also been following with interest Tom Steinberg’s exploration of Digital Public Institutions (Part 1, Part 2). As a result of my pondering about the information and data commons, think there’s a couple of other types of institution which we might add to Tom’s list.

My proposed examples of digital public services are deliberately broad. They’re intended to serve the citizens of the internet, not just any one country.

Commons curators

Everyone has seen interesting facts and figures about the rapidly growing volume of activity on the web. These are often used as examples of dizzying growth and as a jumping off point for imagining the next future shocks that are only just over the horizon. The world is changing at an ever increasing rate.

But it’s also an archival challenge. The majority of that material will never be listened to, read or watched. Data will remain unanalysed. And in all likelihood it may disappear before anyone has had any chance to unlock its potential. Sometimes media needs time to find its audience.

This is why projects like the Internet Archive are so important. I think the Internet Archive is one of the greatest achievements of the web. If you need convincing then watch this talk by Brewster Kahle. If, like me, you’re of a certain age then these two things alone should be enough to win you over.

I think we might see and arguably need more digital public institutions who are not just archiving great chunks of the web, but also the working with that material to help present it to a wider audience.

I see other signals that this might be a useful thing to do. Think about all of the classic film, radio and TV material that is never likely to ever see the light of day again. Not just for rights reasons, but also because its not HD quality or hasn’t been cut and edited to reflect modern tastes. I think this is at least partly the reason why we so many reboots and remakes.

Archival organisations often worry about how to preserve digital information. One tactic being to consider how to migrate between formats to ensure information remains accessible. What if we treated media the same? E.g. by re-editing or remastering it to make it engaging to a modern audience? Here’s an example of modernising classic scientific texts or and another that is remixing Victorian jokes as memes.

Maybe someone could spin a successful commercial venture out of this type of activity. But I’m wondering whether you could build a “public service broadcasting” organisation that presented refined, edited, curated views of the commons? I think there’s certainly enough raw materials.

Global data infrastructure projects

The ODI have spent some time this year trying to bring into focus the fact that data is now infrastructure. In my view the best exemplar of a truly open piece of global data infrastructure is OpenStreetMap (OSM). A collaboratively maintained map of our world. Anyone can contribute. Anyone can use it.

OSM was set up to try to solve the issue that the UK’s mapping and location infrastructure was, and largely still is, tied up with complex licensing and commercial models. Rather than knocking at the door of existing data holders to convince them to release their data, OSM shows what you can deliver with the participation of a crowd of motivated people using modern technology.

It’s a shining example of the networked age we live in.

There’s no reason to think that this couldn’t be done in for other types of data, creating more publicly owned infrastructure. There are now many more ways in which people could contribute data to such projects. Whether that information is about themselves or the world around us.

Getting coverage and depth to data could also potentially be achieved very quickly. Costs to host and serve data are also dropping, so sustainability also becomes more achievable.

And I also feel (hope?) there is a growing unease with so much data infrastructure being owned by commercial organisations. So perhaps there’s a movement towards wanting more of this type of collaboratively owned infrastructure.

Data infrastructure incubators

If you buy into the fact that we need more projects like OSM, then its natural to start thinking about the common features of such projects. Those that make them successful and sustainable. There are likely to be some common organisational patterns that can be used as a framework for designing these organisations. Currently, while focused on scholarly research, I think this is the best attempt at capturing those patterns that I’ve seen so far.

Given a common framework then it’s becomes possible to create incubators whose job it is to launch these projects and coach, guide and mentor them towards success.

So that is my third and final addition to Steinberg’s list: incubators that are focused not on the creation of the next start-up “unicorn” but on generating successful, global collaborative data infrastructure projects. Institutions whose goal is the creation of the next OpenStreetMap.

These type of projects have a huge potential impact as they’re not focused on a specific sector. OSM is relevant to many different types of application, its data is used in many different ways. I think there’s a lot more foundational data of this type which could and should be publicly owned.

I may be displaying my naivety, but I think this would be a nice thing to work towards.

Thoughts on Coursera and Online Courses

I recently completed my first online course (or “MOOC“) on Coursera. It was an interesting experience and wanted to share some thoughts here.

I decided to take an online course for several reasons. Firstly the topic, Astrobiology, was fun and I thought the short course might make an interesting alternative to watching BBC documentaries and US TV box sets. I certainly wasn’t disappointed as the course content was accessible and well-presented. As a biology graduate I found much of the content was fairly entry-level, but it was nevertheless a good refresher in a number of areas. The mix of biology, astronomy, chemistry and geology was really interesting. The course was very well attended, with around 40,000 registrants and 16,000 active students.

The second reason I wanted to try a course was because MOOCs are so popular at the moment. I was curious how well an online course would work, in terms of both content delivery and the social aspects of learning. Many courses are longer and are more rigorously marked and assessed, but the short Astrobiology course looked like it would still offer some useful insights into online learning.

Clearly some of my experiences will be specific to the particular course and Coursera, but I think some of the comments below will generalise to other platforms.

Firstly, the positives:

  • The course material was clear and well presented
  • The course tutors appeared to be engaged and actively participated in discussions
  • The ability to download the video lectures, allowing me to (re)view content whilst travelling was really appreciated. Flexibility around consuming course content seems like an essential feature to me. While the online experience will undoubtedly be richer, I’m guessing that many people are doing these courses in spare time around other activities. With this in mind, video content needs to be available in easily downloadable chunks.
  • The Coursera site itself was on the whole well constructed. It was easy to navigate to the content, tests and the discussions. The service offered timely notifications that new content and assessments had been published
  • Although I didn’t use it myself, the site offered good integration with services like Meetup, allowing students to start their own local groups. This seemed like a good feature, particularly for longer running courses.

However there were a number of areas in which I thought things could be greatly improved:

  • The online discussion forums very quickly became unmanageable. With so many people contributing, across many different threads, it’s hard to separate the signal from the noise. The community had some interesting extremes: people associated with the early NASA programme, through to alien contact and conspiracy theory nut-cases. While those particular extremes are peculiar to this course, I expect other courses may experience similar challenges
  • Related to the above point, the ability to post anonymously in forums lead to trolling on a number of occasions. I’m sensitive to privacy, but perhaps pseudonyms may be better than anonymity?
  • The discussions are divorced from the content, e.g. I can’t comment directly on a video I have to create a new thread for it in a discussion group. I wanted to see something more sophisticated, maybe SoundCloud style annotations on the videos or per-video discussion threads.
  • No integration with wider social networks: there were discussions also happening on twitter, G+ and Facebook. Maybe its better to just integrate those, rather than offer a separate discussion forum?
  • Students consumed content at different rates which meant that some discussions contained “spoilers” for material I hadn’t yet watched. This is largely a side-effect of the discussions happening independently from the content.
  • Coursera offered a course wiki but this seemed useless
  • It wasn’t clear to me during the course what would happen to the discussions after the course ended. Would they be wiped out, preserved, or would later students build on what was there already? Now that it’s finished it looks each course is instanced and discussions are preserved as an archive. I’m not sure what the right option is there. Starting with a clean slate seems like a good default, but can particularly useful discussions be highlighted in later courses? Seems like the course discussions would be an interesting thing to mine for links and topics, especially for lecturers

There are some interesting challenges with designing this kind of product. Unlike almost every other social application the communities for these courses don’t ramp up over time: they arrive en masse at a particular date and then more or less evaporate over night.

As a member of that community this makes it very hard to identify which people in the community are worth listening too and who to ignore: all of a sudden I’m surrounded by 16000 people all talking at once. When things ramp up more slowly, I can build out my social network more easily. Coursera doesn’t have any notion of study groups.

I expect the lecturers must have similar challenges as very quickly they’re faced with a lot of material that they might have to potentially read, review and respond to. This must present challenges when engaging with each new intake.

While a traditional discussion forum might provide the basic infrastructure for enabling the necessary basic communication, MOOC platforms need to have more nuanced social features — for both students and lecturers — to support the community. Features that are sensitive to the sudden growth of the community. I found myself wanting to find out things like:

  • Who is posting on which topics and how frequently?
  • Which commentators are getting up-voted (or down-voted) the most?
  • Which community members are at the same stage in the course as me?
  • Which community members have something to offer on a particular topic, e.g. because of their specific background?
  • What links are people sharing in discussions? Perhaps filtered by users.
  • What courses are my fellow students undertaking next? Are there shared journeys?
  • Is there anyone watching this material at the same time?

Answering all of these requires more than just mining discussions but it feels like some useful metrics could be nevertheless. For example, one common use of the forums was to share additional material, e.g. recent news reports, scientific papers, you tube videos, etc. That kind of content could either be collected in other ways, e.g. via a shared reading list, or as a list that is automatically surfaced from discussions. I ended up sifting through the forums and creating a reading list on readlists, as well as a YouTube playlist just to see whether others would find them useful (they did).

All of these challenges we can see playing out in wider social media, but with a MOOC they’re often compressed into relatively short time spans.

(Perhaps inevitably) I also kept thinking that much of the process of creating, delivering and consuming the content could be improved with better linking and annotation tools. Indeed, do we even need specialised MOOC platforms at all? Why not just place all of the content on services like YouTube, ReadLists, etc. Isn’t the web our learning infrastructure?

Well I think there is a role for these platforms. The role in certification — these people have taken this course — is clearly going to become more important, for example.

However I think their real value is in marking out a space within which the learning experience takes place: these people are taking this content during this period. The community needs a focal point, even if its short-lived.

If everything was just on the web, with no real definition to the course, then that completely dissolves the community experience. By concentrating the community into time-boxed, instanced courses, it creates focus that can enrich the experience. The challenge is balancing unwieldy MOOC “flashmobs” against a more diffused internet community.

The Science of Alien

I’ve been digging through some old files and papers recently, partly prompted by sorting out the loft and also various hard disks with backups of documents and photos.

Amongst the papers I found this fun piece that I wrote back in 1994: A Speculative Paper on Xenomorph Biology.

I wrote it whilst watching a re-run of Alien shortly after finishing my degree. I got to wondering: what if we took the events in the films at face-value, what could we then guess about the Alien’s biology and origin? Reading it back now has made me wince quite a bit. Younger me needed an editor. I think I was trying for the feel of an academic paper or report, but its also obviously part science fiction story.

Despite it being a bit sketchy — and clear evidence as to why I never built a career as a writer! — it’s stood up pretty well I think. Even against the revelations in Prometheus. My fictional scientist even guessed that the “Space Jockey” (as its now called) was there as part of a terra-forming team, and that they were over-run by their own engineered, bio-mechanical servants.

For some better informed attempts at applying science to scifi/fantasy then you might want to look at “Godzilla from a Zoological Perspective” (why isn’t it free?!) or “The pyrophysiology and sexuality of Dragons“. The former is a semi-serious paper, while the latter was published on 1st April 2002. Also, check the lead authors name.

Anyway, thought I’d post that as a bit of fun for a Friday evening. Have a good weekend.

Ants, Overlays and Open Data

Whilst standing behind the yellow line on the platform this morning, waiting for a train to Oxford, I noticed an ant on the floor wending its way along the tarmac, within the bounds of the thick yellow paint. The little black speck stood out quite sharply against the bright yellow.
Obviously the ant wasn’t following the line, but neither was it moving randomly. It was clearly following its own little invisible marker, an ant scent trail, that just happened to co-incide with the platform markings.
Last night BBC 1 showed Britain from Above an ariel view of Britain during a 24 hour period. The show had some great information visualisations of including traffic patterns for taxis, garbage collection, commuters, shipping, aircraft, as well as more static landmarks such as railway lines, electricity cables, water courses and telephone and network cabling. If you didn’t catch it the programme is definitely worth a watch.
It was this birds eye view of the world that lead me to reflect on that ant and it’s invisible trail. I wonder how many other layers of information could have been
added to the human-centric views shown in the programme? Animal migratory paths are an obvious one. Paths of dispersal, ranges and colonization are some others. It doesn’t take long to come up with many, many more.
The combinations of different paths and layers are also interesting to explore. Are many of these chance overlaps, like the ant on the paint or are there dependencies or inter-relations? For example how are migratory routes affected by no-fly zones or shipping lanes? Do migratory pathways begin to align with man-made features like roads and railways? And where have features like fish ladders and toad tunnels been introduced to avoid clashes between competing uses for the same space?
It’s doubtful that these kinds of questions will be answered in the rest of the series. Judging by the trailer for next week’s episode there seems to be a more of a “Pop geography” focus. (I’ll be tuning in regardless)
The truly exciting thing is that we can do this kind of exploration of layered information sources through map based visualizations ourselves using a huge, and growing, range of commodity tools and data sets.
Whilst watching the programme, what intrigued me more than the admittedly beautiful, animations were questions such as: how did they approach the
information holders in order to get permission to use it? What steps were made towards privacy and anonymity? For the BBC it’s going to be very easy to get access to all kinds of data. Not least because they have resources to spend, but also because their reputation proceeds them and the result of the sharing of data is immediate: “don’t you want to be on the telly”?
Open data advocates may do well to band together to form an organization that can become the focal point for activism and importantly trust. Such an organization could recommend best practices, including auditing of data for privacy results. It could also put together a showcase of the end results: creative visualizations of published data. It may be easier to approach data owners as a member or representative of such an collective, open, distributed, collegial organization than as an independent interested hacker.
But creating a compelling presentation is about more than just having the right technology and data. A good visualization tells a story. It’s through stories that data, really comes alive. The open data movement needs the involvement of strongly creative people as much as (and perhaps more than) technology people.
You need do be able to do more than animate a little black speck against a yellow band: where was that little ant going?

The Modern Palimpsest

The following is a brief summary of a talk I gave recently at the Ingenta Publisher Forum on the 28th November. The slides are available as a Powerpoint presentation.
In the presentation I tried to highlight some of the possibilities that could become available if academic publishers begin to share more metadata about the content they publish, ideally by engaging with the scientific community to expose “raw” data and results.

Read More »

Nature Quote

There’s a short article in Nature (subscribers only I’m afraid) this week about Google Base and its potential impacts on the science community. In particular whether it might galvanise greater data sharing between scientists.
I’ve been corresponding with Declan Butler, the author of the piece, on this and some related topics recently, and he ended up quoting me:

WebCite

Alf Eaton posts today to point to the new WebCite service. This is going to be very useful. Don’t think so? Well there’s plenty of research to show that link atrophy is a big problem in scientific literature:
Persistence of Web References in Scientific Research
See also: A study of missing Web-cites in scholarly articles: towards an evaluation framework which reports that “[a]fter evaluating 2162 bibliographic references it was found that 48.1% (1041) of all citations used in the papers referred to a Web-located resource. A significant number of references to URLs were found to be missing (45.8%)…