Fearful about personal data, a personal example

I was recently at a workshop on making better use of (personal) data for the benefit of specific communities. The discussion, perhaps inevitably, ended up focusing on many of the attendees concerns around how data about them was being used.

The group was asked to share what made them afraid or fearful about how personal data might be misused. The examples were mainly about use of the data by Facebook, by advertisers, as surveillance, etc. There was a view that being in control of that data would remove the fear and put the individual back in control. This same argument pervades a lot of the discussion around personal data. The narrative is that if I own my data then I can decide how and where it is used.

But this overlooks the fact that data ownership is not a clear cut thing. Multiple people might reasonably claim to have ownership over some data. For example bank transactions between individuals. Or about cats. Multiple people might need to have a say in how and when that data is used.

But setting aside that aspect of the discussion for now, I wanted to share what made me fearful about how some personal data might be misused.

As I’ve written here before my daughter has Type-1 diabetes. People with Type-1 diabetes live a quantified life. Blood glucose testing and carbohydrate counting are a fact of life. Using sensors makes this easier and produces better data.

We have access to my daughter’s data because we are a family. By sharing it we can help her manage her condition. The data is shared with her diabetes nurses through an online system that allows us to upload and view the data.

What makes me fearful isn’t that this data might be misused by that system or the NHS staff.

What makes me fearful is that we might not be using the data as effectively as we could be.

We are fully in control of the data, but that doesn’t automatically give us the tools, expertise or insight to use it. There may be other ways to use that data that might help my daughter manage her condition better. Is there more that we could be doing? Is there more data we could be collecting?

I’m technically proficient enough to do things with that data. I can download, chart and analyse it. Not everyone can do that. What I don’t have are the skills, the medical knowledge, to really use it effectively.

We have access to some online reporting tools as a consequence of sharing the data with the NHS. I’m glad that’s available to us. It does a better job than I can do.

I also fear that there might be insights that researchers could extract from that data, by aggregating it with data shared by other people with diabetes. But that isn’t happening, because have no way to really allow that. And even so I’m not sure we would be qualified to judge the quality of a research project to know where it might best be shared.

My aim here is not to be melodramatic. We are managing very well thank you. And yes there are clearly areas where unfettered access to personal data is problematic. There’s no denying that. My point is to highlight that ownership and control doesn’t automatically address concerns or create value.

We are not empowered by the data, we are empowered when it is being used effectively. We are empowered when it is shared.

Some tips for open data ecosystem mapping

At Open Data Camp last month I pitched to run a session on mapping open data ecosystems. Happily quite a few people were interested in the topic, so we got together to try out the process and discuss the ideas. We ended up running the session according to my outline and a handout I’d prepared to help people.

There’s a nice writeup with a fantastic drawnalism summary on the Open Data Camp blog. I had a lot of good feedback from people afterwards to say that they’d found the process useful.

I’ve explored the idea a bit further with some of the ODI team, which has prompted some useful discussion. It also turns out that the Food Standards Agency are working through a similar exercise at the moment to better understand their value networks.

This blog post is just gather together those links along with a couple more examples and a quick brain dump of some hints and tips for applying the tool.

Some example maps

After the session at Open Data Camp I shared a few example maps I’d created:

That example starts to present some of the information covered in my case study on Discogs.

I also tried doing a map to illustrate aspects of the Energy Sparks project:

Neither of those are fully developed, but hopefully provide useful reference points.

I’ve been using Draw.io to do those maps as it saves to Google Drive which makes it easier to collaborate.

Some notes

  • The maps don’t have to focus on just the external value, e.g. what happens after data is published. You could map value networks internal to an organisation as well
  • I’ve found that the maps can get very busy, very quickly. My suggestion is to focus on the key value exchanges rather than trying to be completely comprehensive (at least at first)
  • Try to focus on real, rather than potential exchanges of value. So, rather than brainstorm ways that sharing some data might provide useful, as a rule of thumb check whether you can point to some evidence of a tangible or intangible value exchange. For example:
    • Tangible value: Is someone signing up to a service, or is there an documented API or data access route?
    • Intangible value: is there an event, contact point or feedback form which allows this value to actually be shared?
  • “Follow the data”. Start with the data exchanges and then add applications and related services.
  • While one of the goals is to identify the different roles that organisations play in data ecosystems (e.g. “Aggregator”) its often easier to start with the individual organisation and their specific exchanges first, rather than the goal. Organisations may end up playing several roles, and that’s fine. The map will help evidence that
  • Map the current state, not the future. There’s no time aspect to these maps, I’d recommend drawing a different map to show how you hope things might be, rather than how they are.
  • There was a good suggestion to label data exchanges in some way to add a bit more context, e.g. by using thicker lines for key data exchanges, or a marker to indicate open (versus shared or closed data sharing)
  • Don’t forget that for almost all exchanges where a service is being delivered (e.g. an application, hosting arrangement, etc) there will also be an implicit, reciprocal data exchange. As a user of a service I am contributing data back to the service provider in the form of usage statistics, transactional data, etc. Identifying where that data is accruing (but not being shared) is a good way to identify future open data releases
  • A value network is not a process diagram. The value exchanges are between people and organisations, not systems. If you’ve got a named application on the diagram it should only be as the name of tangible value (“provision of application X”) not as a node in the diagram
  • Sometimes you’re better off drawing a process or data flow diagram. If you want to follow how the data gets exchanged between systems, e.g. to understand its provenance or how it is processed, then you may be better of drawing a data flow diagram. I think as practitioners we may need to draw different views of our data ecosystems. Similar to how systems architects have different ways to document software architecture
  • The process of drawing a map is as important as the output itself. From the open data camp workshop and some subsequent discussions, I’ve found that the diagrams quickly generate useful insights and talking points. I’m keen to try the process out in a workshop setting again to explore this further

I’m keen to get more feedback on this. So if you’ve tried out the approach then let me know how it works for you. I’d be really interested to see some more maps!

If you’re not sure how to get started then also let me know how I can help, for example what resources would be useful? This is one of several tools I’m hoping to write-up in my book.

The British Hypertextual Society (1905-2017)

With their globe-spanning satellite network nearing completion, Peter Linkage reports on some of the key milestones in the history of the British Hypertextual Society.

The British Hypertextual Society was founded in 1905 with a parliamentary grant from the Royal Society of London. At the time there was growing international interest in finding better ways to manage information, particularly scientific research. Undoubtedly the decision to invest in the creation of a British centre of expertise for knowledge organisation was also influenced by the rapid progress being made in Europe.

Paul Otlet‘s Universal Bibliographic Repertory and his ground-breaking postal search engine were rapidly demonstrating their usefulness to scholars. Otlet’s team began publishing the first version of their Universal Decimal Classification only the year before. Letters between Royal Society members during that period demonstrate concern that Britain was losing the lead in knowledge science.

As you might expect, the launch of the British Hypertextual Society (BHS) was a grand affair. The centre piece of the opening ceremony was the Babbage Bookwheel Engine, which remains on show (and in good working order!) in their headquarters to this day. The Engine was commissioned from Henry Prevost Babbage, who refined a number of his fathers ideas to automate and improve on Ramelli’s Bookwheel concept.

While it might originally have been intended as only a centre piece, it was the creation of this Engine that laid the ground work for many of the Society’s later successes. Competition between the BHS members and Otlet’s team in Belgium encouraged the rapid development of new tools. This includes refinements to the Bookwheel Engine, prompting its switch from index cards to microfilm. Ultimately it was also instrumental in the creation of the United Kingdom’s national grid and the early success of the BBC.

In the 1920s, in an effort to improve on the Belgium Postal Search Service, the British Government decided to invest in its own solution. This involved reproducing decks of index cards and microfilm sheets that could be easily interchanged between Bookwheel Engines. The new, standardised electric engines were dubbed “Card Wheels”.

The task of distributing the decks and the machines to schools, universities and libraries was given to the recently launched BBC as part of its mission to inform, educate and entertain. Their microfilm version of the Domesday book was the headline grabbing release, but the BBC also freely distributed a number of scholarly and encyclopedic works.

Problems with reliable supply of electricity to parts of the UK hampered the roll out of the Card Wheels. This lead to the Electricity (Supply) Act of 1926 and the creation of Central Electricity Board. This simultaneously laid the foundations for a significant cabling infrastructure that would later carry information to the nation in digital forms.

These data infrastructural improvements were mirrored by a number of theoretical breakthroughs. Drawing on Ada Lovelace’s work and algorithms for the Difference Engine, British Hypertextual Society scholars were able to make rapid advances in the area of graph theory and analysis.

These major advances in the distribution of knowledge across the United Kingdom lead to Otlet moving to Britain in the early 1930s. A major scandal at the time, this triggered the end of many of the projects underway in Belgium and beyond. Awarded a senior position in the BHS, Otlet transferred his work on the Mundaneum to London. Close ties between the BHS members and key government
officials meant that the London we know today is truly the “World City” envisioned by Otlet. It’s interesting to walk through London and consider how so much of the skyline and our familiar landmarks are influenced by the history of hypertext.

The development of the Memex in the 1940s laid the foundations for the development of both home and personal hypertext devices. Combining the latest mechanical and theoretical achievements of the BHS with some American entrepreneurship lead to devices rapidly spreading into people’s homes. However the device was the source of some consternation within the BHS as it was felt that British ideas hadn’t been properly credited in the development of that commercial product.

Of course we shouldn’t overlook the importance of the InterGraph in ensuring easy access to information around the globe. Designed to resist nuclear attack, the InterGraph used graph theory concepts developed by the BHS to create a world-wide mesh network between hypertext devices and sensors. All of our homes, cars and devices are part of this truly distributed network.

Tim Berners-Lee‘s development of the Hypertext Resource Locator was initially seen as a minor breakthrough. But it actually laid the foundations for the replacement of Otlet’s classification scheme and accelerated the creation of the World Hypertext Engine (WHE) and the global information commons. Today the WHE is ubiquitous. It’s something we all use and contribute to on a daily basis.

But, while we all contribute to the WHE, it’s the tireless work of the “Controllers of The Graph” in London that ensures that the entire knowledge base remains coherent and reliable. How else would we distinguish between reliable, authoritative sources and information published by any random source? Their work to fact check information, manage link integrity and ensure maintenance of core assets are key features of the WHE as a system.

Some have wondered what an alternate hypertext system might look like. Scholars have pointed to ideas such as Ted Nelson’s “Xanadu” as one example of an alternative system. Indeed it is one of many that grew out of the counter-culture movement in the 1960s. Xanadu retained many of the features of the WHE as we know it today, e.g. transclusion and micro-transactions, but removed the notion of a centralised index and register of content. This not only removed the ability to have reliable, bi-directional links,  but would have allowed anyone to contribute anything, regardless of its veracity.

For many its hard to imagine how such a chaotic system would actually work. Xanadu has been dismissed as “a foam of ever-popping bubbles“. And a heavily commercialised and unreliable system of information is a vision to which a few would subscribe.

Who would want to give up the thrill of seeing their first contributions accepted into the global graph? It’s a rite of passage that many reflect on fondly. What would the British economy look like if it were not based on providing access to the world’s information? Would we want to use a system that was not fundamentally based on the “Inform, Educate and Entertain” ideal?

This brings us to the present day. The launch of a final batch of satellites will allow the British Hypertextual Society to deliver on a long-standing goal whilst also enabling its next step into the future.

Launched from the British space centre at Goonhilly, each of the standardised CardSat satellites carries both a high-resolution camera and an InterGraph mesh network node. The camera will be used to image the globe in unprecedented detail. This will be used to ensure that every key geographical feature, including every tree and many large animals can be assigned a unique identifier, bringing them into
the global graph. And, by extending the mesh network into space the BHS will ensure that the InterGraph has complete global coverage, whilst also improving connectivity between the fleet of British space drones.

It’s an exciting time for the future of information sharing. Let’s keep sharing what we know!

Designing CSV files

A couple of the projects I’m involved with at the moment are at a stage where there’s some thinking going on around how to best provide CSV files for users. This has left me thinking about what options we actually have when it comes to designing a CSV file format.

CSV is a very useful, but pretty mundane format. I suspect many of us don’t really think very much about how to organise our CSV files. It’s just a table, right? What decisions do we need to make?

But there are actually quite a few different options we have that might make a specific CSV format more or less suited for specific audiences. So I thought I’d write down some of the options that occured to me. It might be useful input into both my current projects as well as future work on standard formats.

Starting from the “outside in”, we have decisions to make about all of the following:

File naming

How are you going to name your CSV file? A good file naming convention can help ensure that a data file has an unambiguous name within a data package or after a user has downloaded it.

Including a name, timestamp or other version indicator will avoid clobbering existing files if a user is archiving or regularly collecting data.

Adopting a similar policy to generating URL slugs can help generate readable file names that work across different platforms.

Tabular Data Packages recommends using a .csv file name extension. Which seems sensible!

CSV Dialect

CSV is a loosely defined format of which there are several potential dialects. Variants can use different delimiter, line-endings and quoting policies. Content encoding is another variable. CSV files may or may not have headers.

The CSV on the Web standard defines a best practise CSV file dialect. Unless there’s a good reason, this ought to be your default dialect when defining new formats. But note that the recommended UTF-8 encoding may cause some issues with Excel.

CSV on the Web doesn’t say how many header rows a CSV file should have, but does define how, when parsing a CSV file, multiple header rows can be skipped. Multiple header rows are often used as a way to add metadata or comments, but I’d recommend using a CSV on the Web file instead as it provides more options.

Column Naming

What naming convention to use for columns? Options are to use an all lower case convention similar to a URL slug. This might make it marginally easier when accessing columns by name in an application.  But if there are expectations that a CSV file will be opened in a spreadsheet application, having readable column names (including spaces) will make the data more user friendly.

CSV on the Web has a few other notes about column and row labelling.

Also, what language will you use in the column headings?

Column Ordering

How are you going to order the columns in your CSV? The ordering of columns in a CSV file can enhance readability. But there’s is likely to be several different orderings, some of them more “natural” than others.

A common convention is to start with an identifier and other properties (dimensions) that describe what is being reported first, and then the actual observed values. So for example in a sales report we might have:

region, customer, product, total

Or in a statistical dataset

dimension1, dimension2, dimension3, value

Or

dimension1, dimension2, dimension3, value, qualifier

This has the advantage of having a more natural reading order for the table. Particularly if as you move from left to right the columns can have fewer values. Adding qualifiers and notes to the end also ensures that they sit naturally next to the value they are annotating

Row Ordering

Is your CSV sorted by default? Sorting may be less relevant if a CSV is being automatically processed and not worrying about order might reduce overheads when generating a data dump.

But if the CSV is going to be inspected or manipulated in a spreadsheet, then defining a default order can help a reader make sense of it.

If the CSV isn’t ordered, then document this somewhere.

Table Layout

How is the data in your table organised?

The Tidy Data guidance recommends having variables in columns, observations in rows, and only a single type of measure/value per table.

In addition to this, I’d also recommend that where there are qualifiers for reported values (as there often are for statistical data) that these are always provided in a separate column, rather than within the main value column. This has the advantage of letting you value column be numeric, rather than a mix of numbers and symbols or other codes. Missing and surpressed values can also then just be omitted and accompanied by an explanation in an adjacent column.

Another pattern I’ve seen with table layouts is to include an element of redundancy to include both labels and identifiers for something referenced in a row. Going back to the sales report example, we might structure this as follows:

region_id, region_name, customer_id, customer_name, product, total

This allows an identifier (which might be a URI) to be provided alongside a human-readable name. This makes the data more readable, at the cost of increasing file size. But it does avoid the need to publish a separate lookup table of identifiers.

You might also sometimes fine a need for repeated values. This is sometimes handled by adding additional redundant columns, e.g. “SICCode1″…”SICCode4” as used in the Companies House data. This works reasonably well and should be handled by most tools, at the potential cost of having lots of extra columns and a sparsely populated table. The alternative is to use a delimiter to put all the values in a single CSV. Again, CSV on the Web defines ways to process this.

Data Formats

And finally we have to decide how to include values in the individual cells. In the section on parsing the CSV on the Web recommends XML Schema data types and date formats as a default, but also allows formats to be defined in an accompanying metadata file.

Other things to think about are more application specific issues, such as how to specify co-ordinates, e.g. lat/lng or lng/lat?

Again, you should think about likely uses of the data and how, for example data and date formats might be interpreted by spreadsheet applications as well as other internationalisation issues.

This is just an initial list of thoughts. CSV on the Web clearly provides a lot of useful guidance that we can now build on, but there are still reasonable questions and trade-offs to be made. I think I’d also now recommend always producing a CSV on the Web metadata file along with any CSV file to help document its structure and any of the decisions made around its design. It would be nice to see the Tabular Data Package specification begin to align itself with that standard.

I suspect there a number of useful tips and guidance which could be added to what I’ve drawn up here. If you have any comments or thoughts then let me know.

Open Data Camp Pitch: Mapping data ecosystems

I’m going to Open Data Camp #4 this weekend. I’m really looking forward to catching up with people and seeing what sessions will be running. I’ve been toying with a few session proposals of my own and thought I’d share an outline for this one to gauge interest and get some feedback.

I’m calling the session: “Mapping open data ecosystems“.

Problem statement

I’m very interested in understanding how people and organisations create and share value through open data. One of the key questions that the community wrestles with is demonstrating that value, and we often turn to case studies to attempt to describe it. We also develop arguments to use to convince both publishers and consumers of data that “open” is a positive.

But, as I’ve written about before, the open data ecosystem consists of more than just publishers and consumers. There are a number of different roles. Value is created and shared between those roles. This creates a value network including both tangible (e.g. data, applications) and intangible (knowledge, insight, experience) value.

I think if we map these networks we can get more insight into what roles people play, what makes a stable ecosystem, and better understand the needs of different types of user. For example we can compare open data ecosystems with more closed marketplaces.

The goal

Get together a group of people to:

  • map some ecosystems using a suggested set of roles, e.g. those we are individually involved with
  • discuss whether the suggested roles need to be refined
  • share the maps with each other, to look for overlaps, draw out insights, validate the approach, etc

Format

I know Open Data Camp sessions are self-organising, but I was going to propose a structure to give everyone a chance to contribute, whilst also generating some output. Assuming an hour session, we could organise it as follows:

  • 5 mins review of the background, the roles and approach
  • 20 mins group activity to do a mapping exercise
  • 20 mins discussion to share maps, thoughts, etc
  • 15 mins discussion on whether the approach is useful, refine the roles, etc

The intention here being to try to generate some outputs that we can take away. Most of the session will be group activity and discussion.

Obviously I’m open to other approaches.

And if no-one is interested in the session then that’s fine. I might just wander round with bits of paper and ask people to draw their own networks over the weekend.

Let me know if you’re interested!

 

Mega-City One: Smart City

“A smart city is an urban development vision to integrate multiple information and communication technology (ICT) and Internet of Things (IoT) solutions in a secure fashion to manage a city’s assets – the city’s assets include, but are not limited to, local departments’ information systems, schools, libraries, transportation systems, hospitals, power plants, water supply networks, waste management, law enforcement, and other community services…ICT allows city officials to interact directly with the community and the city infrastructure and to monitor what is happening in the city, how the city is evolving, and how to enable a better quality of life. Through the use of sensors integrated with real-time monitoring systems, data are collected from citizens and devices – then processed and analyzed. The information and knowledge gathered are keys to tackling inefficiency.” – Smart City, Wikipedia

We’d like to thank the Fforde Foundation for grant funding this project. We’re also grateful to the Fictional Cities Catapult for ongoing advice and support.

In this post we share some insights from early work by our lead researcher Thursday Next. Thursday has recently been leading a team carrying out an assessment of Mega-City One against our smart city maturity model.

Housing

Homelessness is rare among the official citizenry of Mega City One. Considerable investment has been made in building homes for its rapidly growing population. Self-contained city blocks encourage close-knit communities who identify very strongly with their individual blocks.

Citizens enjoy the ability to live, shop and socialise together. Some even choose to spend their entire lives within the secure environment provided by their home block, each of which can house up to 50,000 people. Block provides immediate access to hospitals, gyms, leisure activities, schools and shopping districts. Everything a citizen needs is available on their doorstep.

Transport

Meta-City One boasts a huge variety of transportation systems, covering every form of travel. Pedestrians are able to use Eeziglide and Pedway systems, whilst mass transit is provided by Sky-Rail and other public transit systems.

Roads are adequately sized and are home to a range of autonomous vehicles. Indeed these vehicles are so spacious and reliable that many citizens choose to live in them permanently.

Transport in Mega-City One is reliable, efficient and typically only faces issues during large-scale emergencies (e.g. the Apocalypse War, robot uprising and dark judge visitations).

Education and training

While education is freely available to all citizens, there is little need for many to follow a formal education pathway. Ready access to robot butlers and high levels of automation mean that citizens rarely need to work. Many citizens choose to embrace hobbies and follow vocational training, e.g. in human taxidermy or training as professional gluttons.

But, for those citizens that display a strong aptitude, there are always opportunities in the Justice Department. A rigorous programme of physical and education training is available. Individualised learning pathways mean that citizens can find employment in a variety of public sector roles.

Leisure

Leisure is the primary pursuit of many citizens and there are many opportunities and means of participating. A culture of innovation surrounds the leisure sector which includes a range of new sports including Sky surfing, Batgliding and PowerBoarding.

Citizens are able to quickly learn of new opportunities meaning that crazes often sweep the city (see, for example, Boinging).

Health

Mega-City One is almost completely self-sufficient. Food is primarily created from artificial or synthetic sources. Popular brands like Grot Pot, provide a low-cost balanced diet. These are supplemented with imported produce such as Munce, which is sourced from artisan-lead Cursed Earth communities.

Environmental Services

Weather data and control infrastructure in Mega-City One is highly developed. Justice Department have long had control over local weather and climate conditions, allowing them to provide optimum conditions for citizens. Weather has also factored into policing, e.g. during large scale rioting and other disasters.

There is a strong culture of recycling in Mega-City One and there have been citizen-led movements encouraging greater environmental awareness. The cities Resyk centres ensure that nothing (and nobody) goes to waste.

Policing and Emergencies

Little needs to be said about Mega-City One’s crime and justice department. It is an exemplar of integrated and optimised policing solutions. The Justice Department are able to react rapidly to issues and are glad to offer a personalised service for citizens.

While data from homes, public areas and “eye in the sky” cameras are fed into central systems, actual delivery of justice is federated. Sector Houses provide local justice services across the city. This is supplemented with Citi-Def forces that handle community policing and enforcement activities in individual city blocks. Mega-City One has also embraced predictive policing through its small but effect Psi Division.

We hope this post has helped to highlight a number of important smart city innovations. Exploring how these have been operationalised and optimised to deliver services to citizens will be covered in future research. Please get in touch if you’d like us to undertake a maturity assessment of your fictional city!

 

A river of research, not news

I already hate the phrase “fake news”. We have better words to describe lies, disinformation, propaganda and slander, so lets just use those.

While the phrase “fake news” might originally have been used to refer to hoaxes and disinformation, it’s rapidly becoming a meaningless term used to refer to anything you don’t disagree with. Trump’s recent remarks being a case in point: unverified news is something very different.

Of course this is all on a sliding scale. Many news outlets breathlessly report on scientific research. This can make for fun, if eye-rolling reading. Advances in AI and discovery of alien mega-structures are two examples that spring to mind.

And then there’s the way in which statistics and research is given a spin by the newspapers or politicians. This often glosses over key details in favour of getting across a political message or point scoring. Today I was getting cross about Theresa May’s blaming of GP’s for the NHS crisis. Her remarks are based on a report recently published by the National Audit Office. I haven’t seen a single coverage of the piece link to the NAO press release or the high-level summary (PDF), so you’ll either have to accept their remarks or search for it yourself.

Organisations like Full Fact do an excellent job of digging into these claims. They link the commentary to the underlying research or statistics alongside a clear explanation. In the same vein is NHS Choices Behind the Headlines which fills a similar role, but focuses on the reporting of medical and health issues.

There’s also a lot of attention focused on helping to surface this type of fact checking and explanations via search results. Fact checking, to properly dig into statistics and clearly present them is, I suspect, a time consuming exercise. Especially if you’re hoping to present a neutral point of view.

What I think I’d like though is a service that brings all those different services together. To literally give me the missing links between research, news and commentary.

But rather than aggregating news articles or fact checking reports to give me a feed, or what we used to call a “river of news”, why not present a river of research instead? Let me see the statistics or reports that are being being debated and then let me jump off to see the variety of commentary and fact checking associated with it.

That way I could choose to read the research or a summary of it, and then decide to look at the commentary. Or, more realistically, I could at least see the variety of ways in which a specific report is being presented, described and debated. That would be a useful perspective I think. It would shift the focus away from individual outlets and help us find alternative viewpoints.

I doubt that this would become anyone’s primary way to consume the news. But it could be interesting to those of who like to dig behind the headlines. It would also be useful as a research tool in its own right. In the face of consistent lack of interest from news outlets in linking to primary sources, this might be something that could be crowd-sourced.

Does this type of service already exist? I suspect there are similar efforts around academic research, but I don’t recall seeing anything that covers a wider set of outputs including national and government statistics.