Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Make metadata IDs persistant #269

Closed
MansMeg opened this issue Apr 8, 2023 · 37 comments
Closed

Make metadata IDs persistant #269

MansMeg opened this issue Apr 8, 2023 · 37 comments
Milestone

Comments

@MansMeg
Copy link
Collaborator

MansMeg commented Apr 8, 2023

There is a need from the wikidata people to refer to our corpus (from version 1.0) as a reference on the data. Hence we should make our ids persistent.

  • This would include creating uuids for all csv-files
  • Create wikidataid to person id csv mapping file.
@ninpnin
Copy link
Collaborator

ninpnin commented Apr 11, 2023

I suggest we use firstname_lastname_yyyymmdd (birthdate). It is static given that the primary name of the person and the birthdate don't change, and for the most part they shouldn't. I have also checked that there are no conflicts. On the other hand, only using birthyear leads to a handful of conflicting IDs.

If the birthday isn't available, we would use firstname_lastname_yyyymmXX or firstname_lastname_yyyyXXXX.

@MansMeg
Copy link
Collaborator Author

MansMeg commented Apr 11, 2023

People change names so this might be confusing long term. Maybe just use a uuid? That we know will persistent.

@salgo60
Copy link
Contributor

salgo60 commented Apr 11, 2023

I would say you should have id:s for everything parties/PM members/departments/electoral districts/subjects/.... and do like Wikidata just an id with no meaning (Q is from the name of Dennys wife Qamarniso Q61768970)

redirect

Another lesson learned is support redirects ---> When e.g. #88 Riksdagens does mistakes and adds 2 id:s for the same person (and never fix it 😢 ) its easy you also get "2 people" --> they should be merged on your side and IF the end user still have the "old id" they should find the merged target..,.. --> owl:sameAs

image

@MansMeg
Copy link
Collaborator Author

MansMeg commented Apr 11, 2023

That sounds like a good idea. Best of both worlds. =)

@BobBorges
Copy link
Collaborator

Why are the wiki_ids not persistent? It seems like the least expensive solution (for us, since we used the QIDs in protocol documents) would be to convince wikidata to make the QIDs persistent.

@MansMeg
Copy link
Collaborator Author

MansMeg commented Apr 15, 2023

@salgo60 know this better than me. But I think the core problem is that anyone can create a new person (hence a new id). This can then be merged. So it is a ”flaw” of the wikidata structure.

In addition, wikidata would like us to have persistant id that they could reference to. Ie our corpus will (after 1.0) be a reference for the quality control of wikidata.

I hope this explains why.

@salgo60
Copy link
Contributor

salgo60 commented Apr 24, 2023

@MansMeg @ninpnin maybe its time for starting the process of getting persistant unique Welfare state analytics ids #269

See how Nobelprize.org redesigned its data with an API and then @miroli proposed a Wikidata id P8024 --> we can now access the WD object using the Nobelprize unique id...

@salgo60
Copy link
Contributor

salgo60 commented Apr 24, 2023

@salgo60 know this better than me. But I think the core problem is that anyone can create a new person (hence a new id). This can then be merged. So it is a ”flaw” of the wikidata structure.

In addition, wikidata would like us to have persistant id that they could reference to. Ie our corpus will (after 1.0) be a reference for the quality control of wikidata.

I hope this explains why.

I would say that Wikidata is not designed to be the source and its better as I describe above that you have an unique persistent id as the update frequency in WD is crazy and its an open system with its strengths and weakness... also supporting > 200 languages make this equation nearly impossible and we merge a lot - see real time stream

image

The design as I understand it is not about the truth more what other sources claim --> Wikidata can also store contradicting facts...

image

  1. possibility to have more facts with contradicting values
  2. rank the preferred one
    1. see how we can track facts from Riksarkivet SBL #33 and how we also track the reason why we dont trust what Riksarkivet SBL presents like "contemporary constraint issue Q74557669" / "not confirmed by birth records Q111149276"

@ninpnin ninpnin changed the title Make our ids persistant Make metadata IDs persistant May 3, 2023
@BobBorges
Copy link
Collaborator

@MansMeg @ninpnin @fredrik1984 @liamtabib

We discussed persistent IDs this morning. There's already an open issue, so I didn't want to start a new one. Regardless of the format we use for the IDs, it seems like we need to obtain/create a property item on wikidata, something like SWERIK_MP_ID. According the this such an needs to be proposed and discussed "for some time" before it can be approved --- do we know @salgo60 if it's already been proposed and/or how long is "some time"? Maybe we should decide on the property name and propose it ASAP if it hasn't been done already.

There has been discussion about whether to use name/birth date or a uuid. I see the sense in using a UUID, but also sense in having a deterministic ID -- I suggest that we create a UUID deterministically using the primary name/surname and birth date as a seed (we can use pyriksdagen.utils.get_formatted_uuid as a starting point) -- best of both worlds?.

What do you all say?

@liamtabib
Copy link
Contributor

Good idea!

@MansMeg
Copy link
Collaborator Author

MansMeg commented Jun 16, 2023

That works for me. The only important thing is that the IDs are persistent. I.e. we need to commit to the IDs, and they will never change after they are assigned to an individual. How we create them is less important, as long as it is uuids.

I think the discussions on Wikidata will be less of a problem if we set up a persistant id, since these IDs will probably be the only persistent ids for MPs going far back in time.

@salgo60
Copy link
Contributor

salgo60 commented Jun 16, 2023

WD need a formatter string and some examples

See how a proposal looks like that I created 11:39, 21 September 2016

https://www.wikidata.org/wiki/Wikidata:Property_proposal/SBL

Anyone can create a proposal and everyone can comment and vote on it.... my experience is that it takes some weeks to get it approved...

I am out kayaking this week and can help you when I am back but it is no rocket science so give it a try...

One thought I had if we could use Liberis-URI or the one Riksdagens has dependent were you will store your data

Landing pages

Would be nice if you had landing pages --> we could link you from Swedish Wikipedia

objects like

  • Swedish PM
  • parties
  • electoral districts
  • ...

It's easy extracting text and pictures from Swedish Wikipedia see examples I did for people doing an app with Swedish cemeteries

OT there is a WD conference

Would be interesting if you shared you experience as researcher's how you experience working with Wikidata see tweet what is missing and can be better...

UPDATE: Wikidata modelling days 2023 looks like a researcher Daniel Mietchen is part he is also involved in designing Scholia see video

image

@fredrik1984
Copy link
Collaborator

#237

@BobBorges
Copy link
Collaborator

I'll draft a text for the Motivation part of the wikidata proposal in the next couple of days and post it here for commentary before submitting it. I think there's one unsettled issue, though. There's some consensus on using a UUID solution, but do we want to add some kind of human readable segment so it's clear that these are our UUIDs? E.g.: "SWERIK-6a28a4b0-8f46-4134-a88e-2645b704c9fc" or similar? @salgo60 @ljo any thoughts or best-practices around this?

@salgo60
Copy link
Contributor

salgo60 commented Nov 1, 2023

  1. unique is the key and and a having a human readable string maybe Will add value or just complexity 😃

Extra bonus can be done when approved
a) a regular expression Property:P1793 --> we can easy catch wrong edits

  • SWERIK-6a28a4b0-8f46-4134-a88e-2645b704c9fc --> if we ask chatGPT

^SWERIK-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$

b) URL match pattern Property:P8966 we have tools using the URL to understand what Wikidata property it relates to eg. ^http?://(?:www.)?fossilworks.org/cgi-bin/bridge.pl?a=taxonInfo&taxon_no=(([1-9]\d{0,5})) relates to Property:P842
c) stability of property value Property:P2668
d) formatter URI for RDF resource Property:P1921
e) Property constraints wikidata has the possibility to add rules as unique see Help:Property_constraints_portal)
f) will this PID also support lexemes? Wikidata has > 41000 swedish lexemes see example riksdagen
g) owned by Property:P127
h) issue tracker URL Property:P1401
i) user manual URL Property:P2078
j) always nice to understand how its used see used by Property:P1535 I hope those PIDs will be used by Riksarkivet, Riksarkivet SBL, RAÄ, LIBRIS, Europeana, Riksdagens open data.....
h) API endpoint URL Property:P6269
i) SPARQL endpoint Property:P5305
.....

@salgo60
Copy link
Contributor

salgo60 commented Nov 1, 2023

Would be cool if we could do linked data of your Push release tests we have Software_quality_assurance property = Property:P2992

  • that maybe could be used for adding all the tests you do -->
    ** we then create Q numbers for a test like check in Wikidata that Swedish PM people does not

    • have position held "member of the First Chamber" and "member of the Second Chamber" at the same time

    OT WIkidata has started to release Wikifunctions video and 2023-10-25 it was released Running on WebAssembly

@salgo60
Copy link
Contributor

salgo60 commented Nov 2, 2023

Good document about persistent identifiers and see also my "The Magnus list" created 2021 "One way to design a system to be a good external identifier in Wikidata" this list was mentioned by David Shorthouse at 27:50 in the Stanford video - slides "Keepin 'N Sync... with wikidata ... and ORCID...and GBIF"

image image

A Persistent Identifier (PID) policy for the European Open Science Cloud (EOSC)

image

Good design pattern use tombstone pages

image

How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

image

@salgo60
Copy link
Contributor

salgo60 commented Nov 2, 2023

I have also tried to get Riksarkivet to support archived documents and PIDs --> status work in progress :sad:😭 maybe your project can explain that PIDs support in archives are very important for research people

Today, I perceive that there is no one else on the line when it comes to discussing persistent identifiers and how they should be supported in archives. DIGG's project does not seem to firmly decide that the National Archives and the Royal Library (KB) should handle this.

@salgo60
Copy link
Contributor

salgo60 commented Nov 2, 2023

@salgo60
Copy link
Contributor

salgo60 commented Nov 3, 2023

but do we want to add some kind of human readable segment so it's clear that these are our UUIDs

@BobBorges doi.org/10.1101/117812 states in Lesson 3. Opt for simple, durable web resolution

Trailing characters after the local ID
are discouraged as they unnecessarily increase the variability with which the identifier is represented
and also complicate straightforward appending of the local ID 

@MansMeg
Copy link
Collaborator Author

MansMeg commented Nov 4, 2023

I think going with a pure uuid is probably the simplest. I dont see the value of adding swerik as a slug. Ideally the pid will live longer (with the vorpus) than with the swerik project name.

@salgo60
Copy link
Contributor

salgo60 commented Nov 4, 2023

@MansMeg
Isnt SWERIK used for every PID? That I feel is not a problem maybe make it easier to understad the context of the PID ... the problem I see is when doing as Riksdagen then you get problems not knowing if you find the some PID...

I hope we in Sweden will move i direction creating our resolving service something lika a Swedish DOI maybe SWEDOI


Maybe related I read this paper Introducing Innovative Indicators to Track Sweden's Open Research Data Objective: How to Measure Progress? Defining Indicators to Track Open Research Data Across Swedish Universities

image

Observer pattern

I thinks loosely coupled systems should implement the observer pattern so that you can maybe easier show citation graphs - see my suggestion to DIGG people "Best practice needed for understanding who is referencing my PID" and "#17 Vem använder en identifierare"

image

image

@MansMeg
Copy link
Collaborator Author

MansMeg commented Nov 4, 2023

I see that point. But I doubt the swerik name will live long enough. Whatever slug we use we will have this or similar problems. Just going with a uuid is probably the easiest minimal viable uuid and would have the least long term risks, I think.

@BobBorges
Copy link
Collaborator

There's some motivation for a persistent SWERIK person ID here: https://docs.google.com/document/d/10_SEVNI7dF46hhnucTps242ntSr1nm_R3EHC7_9Mkjk/edit?usp=sharing

Modeled on @salgo60's example in scope/length/level of detail. Feel free to add any commentary directly to that google document.

@MansMeg
Copy link
Collaborator Author

MansMeg commented Nov 9, 2023

This is excellent @BobBorges !

I will read and comment. I think this is an issue that I think we can discuss now, and then have a discussion with the TAB next Friday as a last pair of eyes before we go forward and implement.

@salgo60
Copy link
Contributor

salgo60 commented Nov 9, 2023

I think one good motivation is with your own persistent identifier you can VERY easy start use SKOS and explain a difference with Wikidata, Riksdagens Oppna data, Riksarkivet SBL, the book "Tvåkammar Riksdagen".....

  • the party we call xxx is a broader term than WD yyy - skos:broader
image

WIkidata merge a lot - maybe too much....

@salgo60
Copy link
Contributor

salgo60 commented Nov 10, 2023

There's some motivation for a persistent SWERIK person ID here:

@BobBorges The best motivation I feel is FAIRDATA F1 as you produce research data ut should be FAIRDATA.

Principle F1 is arguably the most important because it will be hard to achieve other aspects of FAIR without globally unique and persistent identifiers

see also DOI 10.1101/117812

image

Other good resources

image image image

@BobBorges
Copy link
Collaborator

Thanks @salgo60! FAIR is a good thing to mention in the motivation. As someone with a research background, the R in FAIR seems the most problematic in our case now without persistent IDs -- How can we reuse and verify research findings when the primary keys of our database change regularly?

@salgo60
Copy link
Contributor

salgo60 commented Nov 10, 2023

@BobBorges as Wikidata addictive I also would like to see the provenance - PROV of every singel data point i.e. something like a more advanced version history combined with the role of who did the change.... I.e what trust does the agent has and what data is that change based on... I feel we see that problem with "party" vilde #139 and chatGPT using PROV

image

image

one Wikidata anti-pattern

One antipattern I see in Wikidata that "every" source should confirm the birth of Selma Lagerlöf Q44519#P569 right now 23 references

image

The Wikidata model lack a Trust dimension. I asked Denny the WD designer of his point of view and wrote a blogpost about it WikidataCon 2019: We need a better model communicating quality/relevance of sources in Wikidata / Provenance

@salgo60
Copy link
Contributor

salgo60 commented Nov 11, 2023

I did a small test using PROV with chatGPT and also show how good change tracking SPA Svensk Porträttarkiv has when you use the API link 139#issuecomment-1806804671

@BobBorges
Copy link
Collaborator

https://www.wikidata.org/wiki/Wikidata:Property_proposal/Person#SWERIK_Person_ID

@salgo60
Copy link
Contributor

salgo60 commented Nov 17, 2023

If you have a Wiki account don’t hesitate to support it syntax

image

https://www.wikidata.org/wiki/Wikidata:Property_proposal/SWERIK_Person_ID

@salgo60
Copy link
Contributor

salgo60 commented Nov 18, 2023

@BobBorges I heard comments from your statement

Wikidata IDs, however, are dynamic, and with each update, a handful of errors occur due to mismatched IDs in the dynamic database and static quality control files

As said before more times should I show you WD? What can happen is that 2 ids are merged…

A merge will have an redirect from the old to the new… and if we speak semantics SKOS exactMatch

the problem with Wikidata is that most people are not domain experts and as it’s an open system we also get anonymous edits and vandalism….

@BobBorges
Copy link
Collaborator

I understand the reason for changes -- our issue is that part of our work involves static files, e.g. manually curated, theoretically correct data with sources, that we want to check against info extracted with new queries to wikidata.

image
Do I need to do something more with this, or your edit is enough?

@salgo60
Copy link
Contributor

salgo60 commented Nov 18, 2023

@BobBorges wait and see we now have enough people I guess to get this approved… next step is to get the focus of a wiki admin which could take 1 minute or more weeks :sad:

@salgo60
Copy link
Contributor

salgo60 commented Dec 4, 2023

FYI: I added P12192 to Template:Sweden_properties / diff and Template:Politician_properties / diff

image image

Feels like its wrong set up I guess you will have persistent identifiers for everything not just people as P31 indicates

image

@ninpnin
Copy link
Collaborator

ninpnin commented Jan 15, 2024

@BobBorges can we close this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants