Review WikiData aggregation to check the format count is accurate #13

anjackson · 2022-09-02T08:56:54Z

The WikiData aggregation appears to generate a denormalised listing, i.e. if a given format has multiple something (extensions? signatures?) then there are separate records for each ID. i.e. if you look at the query in question:

Then the same Q######## identifiers appear in multiple lines. The current imported may not be handling this correctly. It should gather records by ID and assemble a list of extensions/mimetypes for each ID.

The text was updated successfully, but these errors were encountered:

anjackson · 2022-09-02T09:01:59Z

I think this is actually manifesting as file extensions (and possibly MIME types) getting dropped, because we end up with one record per ID.

anjackson · 2022-09-02T10:29:54Z

Seems to be more accurate now, with there being minor discrepancies if there are malformed file extensions.

ross-spencer · 2022-09-02T10:41:58Z

I was worried this was the SPARQL query for a moment! 😉

Testing with Q100243790 Q1023647 in Siegfried, extensions and mimes look okay.

anjackson · 2022-09-02T11:48:47Z

I stole the query from Siegfried so it should work!

The issue is in my post-processing. I should perhaps use roy directly instead of having my own fetcher/normaliser, but it's not trivial to switch (see #15).

ross-spencer · 2022-09-02T12:37:04Z

I know, I wrote the query (hence the concern!).

The issue certainly looks complicated. There are a handful of reasons I don't think you're going in the wrong direction working with the WDQS output in Python, but perhaps it's a useful feature in Siegfried. Keep an eye out for linting getting in the way of stats: https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#linting and perhaps inspect the Wikidata module in more detail: https://pkg.go.dev/github.com/richardlehane/[email protected]/pkg/wikidata (it can theoretically be used independently, or more functions/structures can be exposed to any potential callers since it has done most of the work). Also, Fido have it on their roadmap, so, something Python is going to appear at some point.

Happy to talk more next week if you're interested.

One concern I have in your issue 15 is: modify the wikidata.sig build so the Archiveamatica extensions can be omitted (like -pronom) - those extensions should be omitted anyway, so is that a bug with Siegfried we need to correct?

anjackson · 2022-09-02T16:10:37Z

Ah right, thanks for that. I have just been filtering out results that don't have at least a file extension or MIME type, rather than doing proper linting of records. I'd appreciate talking over some of this with you next week if we get chance!

I don't know if that's a Siegfried bug, but it seems weird so I guess I'll raise it.

EDIT Also, I added some notes on possible importer improvements in #16

anjackson self-assigned this Sep 2, 2022

anjackson added the bug label Sep 2, 2022

anjackson added the high priority label Sep 2, 2022

anjackson added a commit to digipres/digipres.github.io that referenced this issue Sep 2, 2022

Improved WikiData Aggregation results, from digipres/sentinel#13.

df3c32d

anjackson added a commit that referenced this issue Sep 2, 2022

Gather WikiData informationby ID, for #13.

70584c1

anjackson closed this as completed Sep 2, 2022

anjackson mentioned this issue Sep 2, 2022

Improve WikiData import #16

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review WikiData aggregation to check the format count is accurate #13

Review WikiData aggregation to check the format count is accurate #13

anjackson commented Sep 2, 2022

anjackson commented Sep 2, 2022

anjackson commented Sep 2, 2022

ross-spencer commented Sep 2, 2022

anjackson commented Sep 2, 2022

ross-spencer commented Sep 2, 2022

anjackson commented Sep 2, 2022 •

edited

Loading

Review WikiData aggregation to check the format count is accurate #13

Review WikiData aggregation to check the format count is accurate #13

Comments

anjackson commented Sep 2, 2022

anjackson commented Sep 2, 2022

anjackson commented Sep 2, 2022

ross-spencer commented Sep 2, 2022

anjackson commented Sep 2, 2022

ross-spencer commented Sep 2, 2022

anjackson commented Sep 2, 2022 • edited Loading

anjackson commented Sep 2, 2022 •

edited

Loading