Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review WikiData aggregation to check the format count is accurate #13

Closed
anjackson opened this issue Sep 2, 2022 · 6 comments
Closed
Assignees

Comments

@anjackson
Copy link
Contributor

The WikiData aggregation appears to generate a denormalised listing, i.e. if a given format has multiple something (extensions? signatures?) then there are separate records for each ID. i.e. if you look at the query in question:

https://query.wikidata.org/#%23%20Return%20all%20file%20format%20records%20from%20Wikidata.%0A%23%0Aselect%20distinct%20%3Furi%20%3FuriLabel%20%3Fpuid%20%3Fextension%20%3Fmimetype%20%3FencodingLabel%20%3FreferenceLabel%20%3Fdate%20%3FrelativityLabel%20%3Foffset%20%3Fsig%0Awhere%0A%7B%0A%20%20%3Furi%20wdt%3AP31%2Fwdt%3AP279%2a%20wd%3AQ235557.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Return%20records%20of%20type%20File%20Format.%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP2748%20%3Fpuid.%20%20%20%20%20%20%7D%20%20%20%20%20%20%20%20%20%20%23%20PUID%20is%20used%20to%20map%20to%20PRONOM%20signatures%20proper.%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP1195%20%3Fextension.%20%7D%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP1163%20%3Fmimetype.%20%20%7D%0A%20%20optional%20%7B%20%3Furi%20p%3AP4152%20%3Fobject%3B%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Format%20identification%20pattern%20statement.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP3294%20%3Fencoding.%20%20%20%7D%20%20%20%20%20%23%20We%20don%27t%20always%20have%20an%20encoding.%0A%20%20%20%20optional%20%7B%20%3Fobject%20ps%3AP4152%20%3Fsig.%20%20%20%20%20%20%20%20%7D%20%20%20%20%20%23%20We%20always%20have%20a%20signature.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP2210%20%3Frelativity.%20%7D%20%20%20%20%20%23%20Relativity%20to%20beginning%20or%20end%20of%20file.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP4153%20%3Foffset.%20%20%20%20%20%7D%20%20%20%20%20%23%20Offset%20relatve%20to%20the%20relativity.%0A%20%20%20%20optional%20%7B%20%3Fobject%20prov%3AwasDerivedFrom%20%3Fprovenance%3B%0A%20%20%20%20%20%20%20optional%20%7B%20%3Fprovenance%20pr%3AP248%20%3Freference%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20pr%3AP813%20%3Fdate.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%7D%0A%20%20service%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2C%20%3C%3Clang%3E%3E%22.%20%7D%0A%7D%0Aorder%20by%20%3Furi

Then the same Q######## identifiers appear in multiple lines. The current imported may not be handling this correctly. It should gather records by ID and assemble a list of extensions/mimetypes for each ID.

@anjackson anjackson self-assigned this Sep 2, 2022
@anjackson anjackson added the bug label Sep 2, 2022
@anjackson
Copy link
Contributor Author

I think this is actually manifesting as file extensions (and possibly MIME types) getting dropped, because we end up with one record per ID.

anjackson added a commit to digipres/digipres.github.io that referenced this issue Sep 2, 2022
@anjackson
Copy link
Contributor Author

Seems to be more accurate now, with there being minor discrepancies if there are malformed file extensions.

@ross-spencer
Copy link

I was worried this was the SPARQL query for a moment! 😉

Testing with Q100243790 Q1023647 in Siegfried, extensions and mimes look okay.

@anjackson
Copy link
Contributor Author

I stole the query from Siegfried so it should work!

The issue is in my post-processing. I should perhaps use roy directly instead of having my own fetcher/normaliser, but it's not trivial to switch (see #15).

@ross-spencer
Copy link

I know, I wrote the query (hence the concern!).

The issue certainly looks complicated. There are a handful of reasons I don't think you're going in the wrong direction working with the WDQS output in Python, but perhaps it's a useful feature in Siegfried. Keep an eye out for linting getting in the way of stats: https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#linting and perhaps inspect the Wikidata module in more detail: https://pkg.go.dev/github.com/richardlehane/[email protected]/pkg/wikidata (it can theoretically be used independently, or more functions/structures can be exposed to any potential callers since it has done most of the work). Also, Fido have it on their roadmap, so, something Python is going to appear at some point.

Happy to talk more next week if you're interested.

One concern I have in your issue 15 is: modify the wikidata.sig build so the Archiveamatica extensions can be omitted (like -pronom) - those extensions should be omitted anyway, so is that a bug with Siegfried we need to correct?

@anjackson anjackson mentioned this issue Sep 2, 2022
3 tasks
@anjackson
Copy link
Contributor Author

anjackson commented Sep 2, 2022

Ah right, thanks for that. I have just been filtering out results that don't have at least a file extension or MIME type, rather than doing proper linting of records. I'd appreciate talking over some of this with you next week if we get chance!

I don't know if that's a Siegfried bug, but it seems weird so I guess I'll raise it.

EDIT Also, I added some notes on possible importer improvements in #16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants