In “What Does Your Dataset Contain?” I outlined a conceptual framework for thinking about how we might want to describe datasets, e.g. how they’re produced, what they contain, etc. I’ve been reading with interest the series on dataset summaries in Scraperwiki which is exploring similar ideas.
I finally found the time to do some quick practical exploration of my own. One area that interests me is understanding the geographic coverage of a dataset. There’s lots of ways to approach that, mainly because datasets can vary widely in how they include geographical data. Some might include direct references to regions, whilst others might have more fine-grained latitude/longitude data.
I recently discovered local-geocoder which allows bulk reverse geocoding of lat/lng data to country names. I decided to apply this to data to dbpedia to see if I could get a sense of its overall coverage.
The result is a simple shell script that:
- Downloads the geographic data from the English version of Dbpedia 3.8
- Extracts the georss:point predicates and runs them through the local_geocode command-line tool
- Runs the results through some command-line tools to sort and summarise the data to create a simple CSV file
I created a gist that contains the script and the output as formatted text and CSV.
Quick summary of the results:
- 475,001 geographic points in Dbpedia 3.8.
- 26,763 (recorded as “nil” in the results) were unmatched, giving 448,238 points that can be geocoded to a country
- 122,230 points were from US (25.7% of full set)
- US, Poland (46,316; 9.75%), and United Kingdom (45,917, 9.67%) are the three most represented countries
- 178 countries referenced in totaal
From a quick inspection, I think the results that can’t be geocoded are simply those that are outside country boundaries. E.g. the location for Apollo 8 is the middle of the Pacific).
The main caveat with the results (other than potential bugs) is that the boundary data used in local-geocoder is of unclear provenance. Its intended for quick prototyping only. However I’ve had a pull request accepted to local-geocoder to make it easier to use alternate data so there are now options to use alternative sources.
Most online geocoders are rate-limited or have specific terms and conditions that limit re-use of the resulting data. It would be interesting to create a good reference set of open boundary data for countries and administrative regions for use in open source geocoding tools.
I’ve been exploring how the Ordnance Survey data could be converted to GeoJSON for use with the tool. This would give more fine-grained data for England, Scotland and Wales.