-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review WikiData aggregation to check the format count is accurate #13
Comments
I think this is actually manifesting as file extensions (and possibly MIME types) getting dropped, because we end up with one record per ID. |
Seems to be more accurate now, with there being minor discrepancies if there are malformed file extensions. |
I was worried this was the SPARQL query for a moment! 😉 Testing with |
I stole the query from Siegfried so it should work! The issue is in my post-processing. I should perhaps use |
I know, I wrote the query (hence the concern!). The issue certainly looks complicated. There are a handful of reasons I don't think you're going in the wrong direction working with the WDQS output in Python, but perhaps it's a useful feature in Siegfried. Keep an eye out for linting getting in the way of stats: https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#linting and perhaps inspect the Wikidata module in more detail: https://pkg.go.dev/github.com/richardlehane/[email protected]/pkg/wikidata (it can theoretically be used independently, or more functions/structures can be exposed to any potential callers since it has done most of the work). Also, Fido have it on their roadmap, so, something Python is going to appear at some point. Happy to talk more next week if you're interested. One concern I have in your issue 15 is: |
Ah right, thanks for that. I have just been filtering out results that don't have at least a file extension or MIME type, rather than doing proper linting of records. I'd appreciate talking over some of this with you next week if we get chance! I don't know if that's a Siegfried bug, but it seems weird so I guess I'll raise it. EDIT Also, I added some notes on possible importer improvements in #16 |
The WikiData aggregation appears to generate a denormalised listing, i.e. if a given format has multiple something (extensions? signatures?) then there are separate records for each ID. i.e. if you look at the query in question:
https://query.wikidata.org/#%23%20Return%20all%20file%20format%20records%20from%20Wikidata.%0A%23%0Aselect%20distinct%20%3Furi%20%3FuriLabel%20%3Fpuid%20%3Fextension%20%3Fmimetype%20%3FencodingLabel%20%3FreferenceLabel%20%3Fdate%20%3FrelativityLabel%20%3Foffset%20%3Fsig%0Awhere%0A%7B%0A%20%20%3Furi%20wdt%3AP31%2Fwdt%3AP279%2a%20wd%3AQ235557.%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Return%20records%20of%20type%20File%20Format.%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP2748%20%3Fpuid.%20%20%20%20%20%20%7D%20%20%20%20%20%20%20%20%20%20%23%20PUID%20is%20used%20to%20map%20to%20PRONOM%20signatures%20proper.%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP1195%20%3Fextension.%20%7D%0A%20%20optional%20%7B%20%3Furi%20wdt%3AP1163%20%3Fmimetype.%20%20%7D%0A%20%20optional%20%7B%20%3Furi%20p%3AP4152%20%3Fobject%3B%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Format%20identification%20pattern%20statement.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP3294%20%3Fencoding.%20%20%20%7D%20%20%20%20%20%23%20We%20don%27t%20always%20have%20an%20encoding.%0A%20%20%20%20optional%20%7B%20%3Fobject%20ps%3AP4152%20%3Fsig.%20%20%20%20%20%20%20%20%7D%20%20%20%20%20%23%20We%20always%20have%20a%20signature.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP2210%20%3Frelativity.%20%7D%20%20%20%20%20%23%20Relativity%20to%20beginning%20or%20end%20of%20file.%0A%20%20%20%20optional%20%7B%20%3Fobject%20pq%3AP4153%20%3Foffset.%20%20%20%20%20%7D%20%20%20%20%20%23%20Offset%20relatve%20to%20the%20relativity.%0A%20%20%20%20optional%20%7B%20%3Fobject%20prov%3AwasDerivedFrom%20%3Fprovenance%3B%0A%20%20%20%20%20%20%20optional%20%7B%20%3Fprovenance%20pr%3AP248%20%3Freference%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20pr%3AP813%20%3Fdate.%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%7D%0A%20%20%7D%0A%20%20service%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2C%20%3C%3Clang%3E%3E%22.%20%7D%0A%7D%0Aorder%20by%20%3Furi
Then the same
Q########
identifiers appear in multiple lines. The current imported may not be handling this correctly. It should gather records by ID and assemble a list of extensions/mimetypes for each ID.The text was updated successfully, but these errors were encountered: