I started looking into the realities of uploading the SEPA data about waste (and related) data into a triplestore/semantic graph database, with a view to turning it into linked data. My thinking was: tackling the challenges of semantically linking such bulk data will be a useful learning exercise and will spark interesting ideas…
-
Fetch the SEPA’s data file about waste tonnes, parse it, put it into a simple data structure, and write it to the sepa-waste-data CSV file. (The details of this step are in the parse-sepa-waste-data executable notebook.)
-
Fetch the National Register of Scotland’s data file about populations, parse it, put it into a simple data structure, and write it to the nrs-population-data CSV file. (The details of this step are in the parse-nrs-population-data notebook.)
-
Take the datasets output from steps 1 and 2, and add these into a triplestore to make them available as linked data. (The details of this step are in the populate-triplestore notebook.)
-
Query the linked data in our triplestore to plot some info-graphics. Here is what a query over the triplestore’s facts - for the waste tonnage generated per council citizen per year – looks like:
(The details of this step are in the plot-info-graphics notebook.)
-
The public triplestores Wikidata and DBpedia are impressive but their mechanisms for ingesting the kind of bulk data that we are interested in, are piecemeal, indirect and slow – therefore unsuitable for our purpose of iterative experimentation with bulk data. Instead, I used the Datomic triplestore because it is simple and I have had some experience with it. (Last weekend, Ian and Bruce coordinated an exercise that involved Wikidata and bulk data - its result will be very relevant for this work.)
-
The data re-structuring in steps 1, 2 & 3 - to make its information representable as
subject-predicate-object
facts in a triplestore – was not trivial. Some aspects of it could be supported by automation but, I suspect, other aspects require creative input/design decisions.
Could we develop a platform to support “community curated linked open data about waste in Scotland”?
-
Store all data as linked data to increase its utility.
-
Support fast and easy handling of bulk data (i.e. datasets), unlike Wikidata.
-
Data re-structuring and linking might require creative input from humans but the platform could help steer and automate many of the steps.
-
-
Calculate trust/recommender metrics.
-
The idea is that such metrics can be used to help resolve conflicts and converge the data. For example: Sue uploads data that says that in 2014 23% of the waste generated in the Stirling area was recycled. Later, Bob uploads data that says that the percentage was 21%. Which data entry “wins”? We could configure the triplestore to present both data entries but that probably will lead to data divergence and less utility. Instead we could configure the triplestore to present the more trusted data entry… say, Sue’s if her data entry has a higher "commonshare" than Bob’s.
-
-
Use community curated linked open data as the substrate on which to build a Scottish waste focussed, community driven website.
I think that there is some novelty/a “USP” in that combination.