The phrase “fictional data” popped into my head recently, largely because of odd connections between a couple of projects I’ve been working on.
It’s stuck with me because, if you set aside the literal meaning of “data that doesn’t actually exist“, there are some interesting aspects to it. For example the phrase could apply to:
- data that is deliberately wrong or inaccurate in order to mislead – lies or spam
- data that is deliberately wrong as a proof of origin or claim of ownership – e.g. inaccuracies introduced into maps to identify their sources, or copyright easter eggs
- data that is deliberately wrong, but intended as a prank – e.g. the original entry of Uqbar on wikipedia. Uqbar is actually a doubly fictional place.
- data that is fictionalised (but still realistic) in order to support testing of some data analysis – e.g. a set of anonymised and obfuscated bank transactions
- data that is fictionalised in order to avoid being a nuisance, cause confusion, or accidentally linkage – like 555 prefix telephone numbers or perhaps social media account names
- data that is drawn from a work of fiction or a virtual world – such as the marvel universe social graph, the Elite: Dangerous trading economy (context), or the data and algorithms relating to Pokemon capture.
I find all of these fascinating, for a variety of reasons:
- How do we identify and exclude deliberately fictional data when harvesting, aggregating and analysing data from the web? Credit to Ian Davis for some early thinking about attack vectors for spam in Linked Data. While I’d expect copyright easter eggs to become less frequent they’re unlikely to completely disappear. But we can definitely expect more and more deliberate spam and attacks on authoritative data. (Categories 1, 2, 3)
- How do we generate useful synthetic datasets that can be used for testing systems? Could we generate data based on some rules and a better understanding of real-world data as a safer alternative to obfuscating data that is shared for research purposes? It turns out that some fictional data is a good proxy for real world social networks. And analysis of videogame economics is useful for creating viable long-term communities. (Categories 4, 6)
- Some of the most enthusiastic collectors and curators of data are those that are documenting fictional environments. Wikia is a small universe of mini-wikipedias complete with infoboxes and structured data. What can we learn from those communities and what better tools could we build for them? (Category 6)