Skip to content

Commit

Permalink
Extended notes on PID-use in NFDI4Cat
Browse files Browse the repository at this point in the history
  • Loading branch information
dalito committed Feb 26, 2024
1 parent eea699b commit 1f34b80
Showing 1 changed file with 171 additions and 47 deletions.
218 changes: 171 additions & 47 deletions nfdi4cat_details.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,52 +3,37 @@

*David Linke, LIKAT, 2024-02-23 (first draft)*

## Permanent IRIs for terminology or ontology terms

For linked-data applications permanent IRIs are required.
Several services exists that provide a stable base URL and
a redirect service that can be configured to redirect
to the actual resource location.

Such services should support serving different data to humans vs. machines.
One way to do this is via content-negotiation. Depending on the request that
is send to the redirect service, the client is redirected to human-oriented representation of the resource of to a machine-oriented one.

NFDI4Cat has selected w3id.org for this service. Comparable services are purl.org or pida.org which work in exactly the same way.

In principle, it would also be possible to use handle-based PIDs for the terms in ontologies.
However, the additional features and the more reliable redundant network of the handle-system are of minor importance for this application.
Persistent identifiers (PIDs) play in important role in achieving FAIR data.
This text is about the different use cases for PIDs in NFDI4Cat and how each case is solved technically.

## PIDs for Data Portals

NFDI4cat has selected Dataverse as software for the data portal(s).
Dataverse supports two types of PIDs: DOIs and handles. It is not possible to
use handles and DOIs at the same time.
use handles and DOIs at the same time in a single Dataverse installation.

According to the architecture document, different repositories are considered
for the NFDI4Cat data infra structure.
According to the architecture document, different repositories are considered for the NFDI4Cat data infra structure.
Since Researchers expect DOIs for the data publications that they make,
the Dataverse for publishing final data should use DOIs.
However, DOIs are not a good fit for earlier stages of research due to
the metadata requirements and associated costs.
Therefore, another Dataverse is offered for data shared in an early phase of research
which uses handle-PIDs. This Dataverse is especially targeted at helping
researchers to collaboratively work with data in joint projects.
Therefore, another Dataverse is offered for data shared in an early phase of research which uses handle-PIDs.
This Dataverse is especially targeted at helping researchers to collaboratively work with data in joint projects.

To summarize, HLRS offers two central Dataverse installations:

- "Data4Cat" for publishing final datasets that using DataCite DOIs
- "Colab4Cat" - A Dataverse that uses PID4Cat-Handles. Its purpose is to allow early sharing of data and collaborating on data prior to publishing.
- **Data4Cat** for publishing final datasets that using DataCite DOIs
- **Colab4Cat** - A Dataverse that uses PID4Cat-Handles. Its purpose is to allow early sharing of data and collaborating on data prior to publishing.

In addition there may be other local Dataverse installation like at BasCat
which also have an interest to use the PID4Cat handle server to mint PIDs.

Things to think about:

- If a data set is published it will get a DOI in addition to the handle.
Could this be a problem? For example, can the handle record be updated
to point to the same landing page as the DOI. This is probably not
a standard functionality in Dataverse so requires a custom solution.
Could this be a problem?
For example, can the handle record be updated to point to the same landing page as the DOI.
This is probably not a standard functionality in Dataverse so requires a custom solution.
- more?

For the handle-PIDs in Dataverse it is suggested:
Expand All @@ -59,22 +44,27 @@ For the handle-PIDs in Dataverse it is suggested:
- [:DataFilePIDFormat](https://guides.dataverse.org/en/latest/installation/config.html#datafilepidformat) = independent

- It is suggested to generate the ID-part via a procedure stored in the DB.
What scheme is used does not matter much,
but clashes with PID4Cat-handles or other Dataverse (BasCat) must be avoided,
for example by adding a prefix to the numeric ID.
The start ID should maybe not set to 1 but to 1001 to make the numeric part stick out.
Numeric-IDs may be encoded to shorten the ID length. For example:
- d4c-3yS (prefix and base58 encoded integer number)
- "3yS" is 10001 in [Base58 encoding](https://de.wikipedia.org/wiki/Base58) using alphabet [1-9A-HJ-NP-Za-km-z]
What scheme is used does not matter much, but clashes with PID4Cat-handles or other Dataverse (BasCat) must be avoided.
Via a stored procedure UUID4 (or UUID7) identifiers could be created.
An alternative if to add a prefix to the numeric ID used by default.
The numeric ID may be base32 encoded to shorten the ID length.
In this case the start ID should maybe not set to 1 but to 1001 to make the numeric part stick out.
For example:
- d4c-9OH (prefix and base32 encoded integer number)
- "9OH" is 10001 using [base-32 alphabet](https://en.wikipedia.org/wiki/Base32#base-32)
- d4c-10001 (prefix & integer number)

Preferred is the first scheme, because it is shorter.
The advantage of the base58 encoding (vs. base 36 or base 64) is that it avoids reading problems by excluding O, 0, I, l.
Different alphabets for base32 could be used than the one from RFC4648 which uses letters A to Z and numbers 2 to 7.
For example, the z-base-32 encoding has the advantage of a more human-friendly alphabet that avoids reading problems by excluding 0, l, v, 2 (number 0 can be confused with letter o, letter l with number 1 or letter i, letter v is close to u or r especially in handwriting, also number 2 is close to z in handwriting).
Another alternative is crockford32 which uses 0-9 and uppercase letters without the letters I, L, O, U.

Current preference: z-base-32 alphabet as it's optimized for lowercase letters

Configuration:
- [:IdentifierGenerationStyle](https://guides.dataverse.org/en/latest/installation/config.html#identifiergenerationstyle) = storedProcGenerated

Related documentation:
Related Datavers documentation:

- [General Dataverse PID documentation](https://guides.dataverse.org/en/latest/installation/config.html#persistent-identifiers-and-publishing-datasets)
- [Dataverse handle configuration](https://guides.dataverse.org/en/latest/installation/config.html#configuring-your-dataverse-installation-for-handles)
Expand All @@ -94,20 +84,154 @@ This has the advantage that these metadata are available directly from the resol

PID4Cat comes with a well-defined schema for how and which metadata to store in the handle record. The schema is defined in [LinkML](https://linkml.io/linkml/) and developed here ([NFDI4Cat/PID4Cat](https://github.com/dalito/pid4cat-model/)).

### Mapping of PID4Cat schema to handle record

The PID4Cat schema is mapped to the handle record as follows:

**Table** Example record for a PID4Cat handle with the suffix "lik-dfi345", URL: https://hdl.handle.net/20.1000/lik-dfi345 <BR> The *PID4CatRecord* column contains the slot name of the LinkML model that is written to the respective line of the handle record.
**Table** Example record for a PID4Cat handle with the suffix "lik-dfi345", URL: https://hdl.handle.net/20.1000/lik-dfi345 <BR> The *PID4CatRecord* column contains the slot name of the LinkML model that is written to the respective line of the handle record; this column is not part of the handle record.

| Index | Type | Timestamp | Data | *PID4CatRecord* |
|-------|------|-----------|------|-----------|
| 1 | URL | ... | https://pid4cat.example.org/lik-dfi345 | *landing pageURL* |
| 2 | | ... | REGISTERED | *status* |
| 3 | | ... | 20240219v0 | *record_version* |
| 4 | | ... | 1.0.0 | *pid_schema_version* |
| 5 | | ... | CC0-1.0 | *dc_rights* |
| 6 | | ... | [email protected] | *curation_contact* |
| 7 | | ... | {json} | *resource_info* |
| 8 | | ... | {json} | *related_identifiers* |
| 9 | | ... | {json} | *change_log* |

In a future version the Type-column could reference type declarations in a datatype registry (DTR). Such DTRs are still under development and not yet open for general use.
| 1 | URL | 2024-01-01 10:47:38Z | https://pid4cat.example.org/lik-dfi345 | *landing pageURL* |
| 2 | DESC | 2024-02-19 13:40:02Z | REGISTERED | *status* |
| 3 | DESC | 2024-02-19 13:40:02Z | 20240219v0 | *record_version* |
| 4 | DESC | 2024-01-01 10:47:38Z | 1.0.0 | *pid_schema_version* |
| 5 | DESC | 2024-01-01 10:47:38Z | CC0-1.0 | *dc_rights* |
| 6 | EMAIL | 2024-01-01 10:47:38Z | [email protected] | *curation_contact* |
| 7 | DESC | 2024-01-01 10:47:38Z | {json} | *resource_info* |
| 8 | DESC | 2024-02-19 13:40:02Z | {json} | *related_identifiers* |
| 9 | DESC | 2024-02-19 13:40:02Z | {json} | *change_log* |

In a future version, some "DESC" Types could be refined to reference type declarations in a datatype registry (DTR). Such DTRs are still under development and not yet open for general use.

Since PID4Cat is a linkML-model we have all tools at hand to create records or an API. For example, we can use the pydantic-model created from the PID4cat schema to create the json-objects for the PID record above, for example the *resource_info* json-object:

```python
from linkml_runtime.dumpers import json_dumper
from pid4cat_model.datamodel import pid4cat_model_pydantic as p4c

pid1_ressource_info = p4c.ResourceInfo(
label="Resource label",
description="Resource description",
resource_category=p4c.ResourceCategory.SAMPLE,
rdf_url="https://example.org/resource/pid1",
rdf_type="TURTLE",
schema_url="https://example.org/resource_schema",
schema_type="XSD",
)

print(json_dumper.dumps(pid1_ressource_info, inject_type=False))
```

which will print the json-object to be stored under index 7 in the handle-record:

```json
{
"label": "Resource label",
"description": "Resource description",
"resource_category": "SAMPLE",
"rdf_url": "https://example.org/resource",
"rdf_type": "TURTLE",
"schema_url": "https://example.org/schema",
"schema_type": "XSD"
}
```
The pydantic model can also be used to create web APIs with other Python-based tools like django-ninja or fastapi.

### Handle-suffix design

*This text is based on the proposal "PID-suffix, Version 2" made in NFDI4Cat-openproject on 2023-12-07*

Two handle-suffix schemes are supported:

- **UUID4** (_uuid pid4cat_)
- Can be either supplied or generated by the gateway (both ways should be possible).
If supplied UUID7 should be also accepted.
- **namespace\_of\_agent/local_ID** (_name-spaced pid4cat_)
- The handle-gateway should only validate the uniqueness of `namespace_of_agent/local_ID`-combinations.
It is suggested to use 3 characters from a human-friendly base32-alphabet as namespaces (e.g. **z-base-32**, see above). This gives 3**32 prefixes and if ever necessary we can switch to 4 chars.
- The uniqueness of the `local_ID` has to be taken care of by the namespace owner.
- The `local_ID` can be freely chosen by the namespace owner but is limited to ASCII letters [a-zA-Z], numbers [0-9] and symbols `.-`. - *OK or too restrictive?*
- The request for becoming the owner of a new namespace should be made by email.

Examples for PID4Cat handles (non-resolvable):

- **UUID4 pid4cat**: `https://hdl.handle.net/21.T11978/pid4cat/7e82d892-6acf-41a8-9c91-df826f67a806`
- **name-spaced pid4cat** in the example-namespace &quot;`k3a`&quot;:
- `https://hdl.handle.net/21.T11978/pid4cat/k3a/123-456`
- The corresponding regex in Python syntax is: 
`re.compile(r"21\.T11978/pid4cat/(?P<namespace>k3a)/(?P<local_ID>\d{3}-\d{3})")` 
...which extracts "k3a" as group "namespace" and "123-456" as group "local_ID".

> Note, the `pid4cat`-part in the above URLs/suffixes has no functional meaning.
> It is only present to make pid4cat-handles easily recognizable.
### API of handle server gateway

The role of the handle-server-gateway (HSG) is to restrict and manage write access to the handle-server and to add PID4Cat-specific validation for the handles. The HSG provides an API only.
Create and updating PID4Cat-handles will exclusively managed via the HSG.
Suggested URLs for the HSG are https://pid4cat.nfdi4cat.org or https://pid.nfdi4cat.org

> Note: Dataverse will not use the HSG but interface the handle-server directly.
The HSG controls which users are allowed to create und update handle records.
Except for sysadmins, HSG users will only be allowed to create and update handles for named-spaced PIDs.
All PID name-spaces will have at least one name-space admin user.
Institutions that like to manage a PID4Cat-PID-namespace need to register a PID-namespace admin with NFDI4Cat.

Suggested minimal API of the HSG:

[unedited copy from open project]

Here is a tentative minimal API (to be discussed). API access is limited to special users &quot;namespace-owners&quot; (read-write), and &quot;viewers&quot; (read-only). Anonymous users have no access.

* _Regarding permissions and the minimal API we like to get your feedback. Is this OK? Do you see gaps or need more functionality?_

#### List all PIDs records of owner \[GET\] or create new PID \[PUT\]:

`https://pid.nfdi4cat.org/<api-version>/<name-space> [GET,PUT]`

- `https://pid.nfdi4cat.org/` is the base url of the gateway server.
- `api-version` (e.g. `v1.0.0`) is a version specifier for the API (it is recommended  to version APIs)
- `name-space` is the first part of the suffix of name-spaced PID.
- Special notes on create [PUT]:
- HSG should set "record_version" automatically on create
- *...and some more validation & consistency checks*

#### Show PID record \[GET\] or update PID record \[PUT\]

`https://pid.nfdi4cat.org/<api-version>/<name-space>/<local-ID-of-pid> [GET,PUT]`

where `<name-space>/<local-ID-of-pid>` form the suffix of the handle to get or update. For the examples above it is `k3a/123-456` and `7e82d892-6acf-41a8-9c91-df826f67a806`, respectively.

- Special notes on updates [PUT]:
- The data are supplied as PID4Cat-schema-conform json object in the body of the request.
- each handle record update needs an update of the changelog (append)
- HSG should set/increment "record_version" automatically on updates
- *...and some more validation & consistency checks*

We could allow [DELETE] but we should not perform any true delete of a handle. Instead the record should be modified to state "OBSOLETE" (see PID4CatStatus enum in schema).

#### Possible API Extensions

The API may be extended to make the information in the handle record more accessible, for example

- Routes to retrieve all PIDs by category [GET]
- Routes to retrieve all PIDs by status [GET]


## Permanent IRIs for terminology or ontology terms

For linked-data applications permanent IRIs are required.
Several services exists that provide a stable base URL and
a redirect service that can be configured to redirect
to the actual resource location.

Such services should support serving different data to humans and to machines.
One way to do this is via content-negotiation. Depending on the request that
is send to the redirect service, the client is redirected to human-oriented representation of the resource of to a machine-oriented one.

NFDI4Cat has selected [w3id.org](https://w3id.org/) for this service. Comparable services are [purl.org]() or [pida.org]() which work in exactly the same way.

In principle, it would also be possible to use handle-based PIDs for the terms in ontologies.
However, the additional features and the more reliable redundant network of the handle-system are of minor importance for this application.

0 comments on commit 1f34b80

Please sign in to comment.