Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specification for how to extend the schema #27

Open
briri opened this issue Mar 10, 2020 · 18 comments
Open

Add specification for how to extend the schema #27

briri opened this issue Mar 10, 2020 · 18 comments
Assignees
Labels
decision Decision to be taken that alligns the approach

Comments

@briri
Copy link

briri commented Mar 10, 2020

We are currently converting our API over to use this Common Standard metadata schema. We have a few scenarios where we also need to convey information that its required for our system but outside the scope of this schema.

It would be good if the schema provided guidance on how best to include this type of information. So that systems adopting the Common Standard schema follow similar patterns.

For example, the DMPTool API requires that a DMP template identifier be specified along with some other information specific to the caller's system (called 'abc' below) when creating a new DMP.

We will be using the following structure to accomplish this:

{
  "dmp": {
    "title": "My new DMP",
    ...
    // the rest of the common standard attributes
    ...
    "extended_attributes": [
      "dmptool": { "template_id": "123" },
      "abc": { "reserve_id": { "type": "doi", "identifier": "https://dx.doi.org/10.9999/999xyz" } }
    ]
  }
} 

Apologies if this has already been discussed and I just missed it in the documentation somewhere.

@hmpf
Copy link

hmpf commented Jun 2, 2020

Would it be relevant to have extensions elsewhere than the top level as well? Like, extra stuff for host/distribution for instance.

@briri
Copy link
Author

briri commented Jun 2, 2020

I can see value in allowing for extensions at the dataset, distribution and host levels (perhaps project as well). For us (so far) the use case for using extensions has been in the import (creation) of a DMP via our API. It could be useful as well in

We ended up using the following during the hackathon:

{
  "dmp": {
    "extension": [
      {
        "dmptool": {
          "template": {
            "id": 946,
            "title": "Environmental Resilience Institute Data Management Plan"
          }
        }
      }
    ]
  }
}

Related issue: RDA-DMP-Common/hackathon-2020#3

@TomMiksa
Copy link
Contributor

Do you have more examples of extensions needed? This could help us find the best strategy for including them.

What about doing it in a slightly different way by using within the dmp section a field to define extensions. This would indicate at the beginning what specific extensions are used and hence what additional fields are to be expected. Each extension must be identified by an URL to a JSON schema. For example:

{
  "dmp": {
     ...
    "extensions": [
      "http://json-schema.org/dmptool",
      "http://json-schema.org/funderX"
    ],
    ...
    "dataset": [
      {
        "title": "My Dataset",
        "dmptool-specific-field": "generated by DMPTool"
       ....
      }
    ]
  }
}

@TomMiksa TomMiksa added the decision Decision to be taken that alligns the approach label Aug 28, 2020
@briri
Copy link
Author

briri commented Aug 28, 2020

I think that could be a useful approach.

We are currently working through an integration that is using the common standard as the method of communication. We are still in the early stages of the project though and have not finished defining what additional information we would like to pass along. Much of the information is at the project/dmp level for example:

  • the DOI of the research field station where the research will be performed
  • yes|no|unknown value along with some descriptive text (like the ethical issues section) that would be used to indicate whether the research involves endangered species
  • yes|no|unknown value along with some descriptive text indicating if culturally sensitive information for native populations are a factor

@cpina
Copy link

cpina commented Jan 19, 2021

I'm new here - sorry if I miss-interpreted something in this issue. I was in the call earlier on and I thought of adding some of my thoughts here.

{
  "dmp": {
     ...
    "extensions": [
      "http://json-schema.org/dmptool",
      "http://json-schema.org/funderX"
    ],
    ...
    "dataset": [
      {
        "title": "My Dataset",
        "dmptool-specific-field": "generated by DMPTool"
       ....
      }
    ]
  }
}

I like it. In the Frictionless Data community we had a similar discussion: frictionlessdata/datapackage#663

In that case we were looking at adding specific fields. E.g. at the Swiss Polar Institute we are prefixing them with x_spi_:
https://github.com/Swiss-Polar-Institute/frictionless-data-packages/blob/master/10.5281_zenodo.2616605/datapackage.json#L146
It makes clear that these fields are extensions from SPI (the approach in this issue also makes it clear).

Only one possible problem (hypothetical) with the current suggestion: might two institutions come up with two extensions with the same name and some fields would be the same? I can think of two possible solutions:

  • Prefix the extensions by an institution name (similar to Frictionless Data approach but importing extensions instead of adding fields)
  • In the "import" step have an alias (like Python): such as:
{
  "dmp": {
     ...
    "extensions": [
      {"uri": "http://university/citations", name: "university-citations"},
      {"uri": "http://school/citations", name:"school-citations"},
    ],
    ...
    "dataset": [
      {
        "title": "My Dataset",
        "school-citations-specific-field": "generated by DMPTool",
        "university-citations-specific-field": "something else",
       ....
      }
    ]
  }
}

@froggypaule
Copy link

hello ... also following this morning's call. Thanks to @cpina : this is the reservation I was trying to convey at the call:

  1. If fields coming from two different extensions share the same name and the same meaning, then all is well: they are simply mapped one onto the other.
  2. If fields coming from two different extensions share the same name BUT not the same meaning, then the solution proposed by @cpina would work.
  3. If fields coming from two different extensions do NOT share the same name but the same meaning: again, un mapping would do the trick
  4. If fields coming from two different extensions do NOT share the same name NOR the same meaning, then all is well also.

Sorry if I misunderstand the question.

@cpina
Copy link

cpina commented Jan 20, 2021

hello ... also following this morning's call. Thanks to @cpina : this is the reservation I was trying to convey at the call:

1. If fields coming from two different extensions share the same name and the same meaning, then all is well: they are simply mapped one onto the other.

2. If fields coming from two different extensions share the same name BUT not the same meaning, then the solution proposed by @cpina would work.

3. If fields coming from two different extensions do NOT share the same name but the same meaning: again, un mapping would do the trick

4. If fields coming from two different extensions do NOT share the same name NOR the same meaning, then all is well also.

This is a perfect summary, thanks!

Sorry if I misunderstand the question.

My thoughts are: should we make work the case 2? (two different extensions share the name, share a field name and not the same meaning). If this is a concern and should work: what's the best way to go? (a "name" or a "prefix")

@briri
Copy link
Author

briri commented Jan 29, 2021

We are going to begin work on the schema extensions for DMPRoadmap in late March or early April.

We will plan to follow the pattern described by @cpina @froggypaule above by using a tool/codebase specific prefix like:dmproadmap-[x].

Any early suggestions or feedback (once we start work on it) would be welcome. :)

@froggypaule
Copy link

Hello! a quick one: why the name 'dmproadmap' ? I am saying this because that DMPRoadmap is the common code base to DMPTool and DMPonline. Is the name intentional?

@briri
Copy link
Author

briri commented Feb 1, 2021

Yes. Any changes we'd be making would benefit the entire codebase (DMPTool, DMPonline, DMPOPIDoR, DMPAssistant, etc.).

For example the DMPRoadmap system is driven in part by specific templates (e.g. Horizon2020, NSF, USGS, etc.). We have an API endpoint that allows users to create a DMP by passing in this metadata standard. To help facilitate the use of specific templates we would add a dmproadmap_template_id or something similar to convey that information to the system.

@froggypaule
Copy link

ok thanks.... I was just commenting :)

@paulwalk
Copy link
Contributor

Hi - I've been reading this thread, and I'm concerned that the consensus seems to be to invent a mechanism for handling namespaces in JSON.

I would strongly recommend not doing this.

At the start of this work, we decided to limit our focus and ambition with the standard, so that it was developed and managed as an information exchange format. More formally, it could be described as a metadata application profile. However, the interest in this work has grown and, as such, we are now faced with a decision. Do we accept that there is demand for a more expansive standard - essentially an ontology within which new concepts can be added? Or do we continue to limit our scope, while recognising that there is demand to include extra information in, or alongside, the information exchange?

As I understand it, there are two viable options available to us:

Option 1: Widen our scope, and become an ontology

It could be argued that this is inevitable. In any case, there is already work underway to formally describe the standard as an OWL ontology, so there does appear to be demand for this. If this is the direction of travel for the DMP Common Standard, then I would recommend that we act sooner rather than later, and move from supporting plain JSON to supporting JSON-LD.

Pros:

  • JSON-LD allows us to extend by adding contexts (namespaces) which are easily and robustly implemented
  • JSON-LD allows us to describe DMPs in a manner which is not just machine-readable, but which is more machine-understandable
  • JSON-LD is increasingly well supported in software libraries

Cons:

  • this may be disruptive to the current implementations

Option 2: Continue as before, with a new section for arbitrary extensions

We had certainly been considering how to handle extensions from the beginning of this work, and this was our original idea. With this approach, the scope of the DMP Common Standard is unchanged, but a place is added for third-parties to add arbitrary data. With this approach, the DMP Common Standard has nothing to say about how these extensions are encoded. If implementers add extensions which cause name collisions, then they will need to sort this out (typically by agreeing conventions).

Pros:

  • potentially less disruptive to current implementations (although this needs some verification)

Cons:

  • risk to the DMP Common Standard that it becomes gradually marginalised as demand increases for the extensions to be more broadly interoperable.

My recommendation:

  1. Absolutely do not invent a new mechanism for name-spacing JSON properties as part of the DMP Common Standard
  2. Consider the implications of moving to JSON-LD. In many cases, it may simply involve adding a context to the JSON, and changing to a JSON-LD software library for parsing. However, there may be other issues for the software that has implemented the standard. It would be good to find out - how disruptive actually is this?
  3. If not moving to JSON-LD, then define the place for extensions (as already suggested above) and then say no more. Make it clear that all further definition is out of scope for this standard. However, we could consider providing a place for implementers to document "community conventions" for using these.

Of these two options, I think that the JSON-LD option is the more future-proof at this point.

@froggypaule
Copy link

Thanks @paulwalk for clairifying this: having come to the CS quite late, this helps a lot.
And yes, I agree with you on JSON-LD and option 1 (not that I am quite versed in these matters....)

@cpina
Copy link

cpina commented Mar 18, 2021

Thanks @paulwalk . Sadly I'm not extremely familiar with JSON-LD and I need to do some refreshing on it. I 100% agree to avoid reinventing the wheel. If any of the ideas of my suggestions already exist in a standard I would say to go with the standard unless there is a very good reason for this use-case.

@fekaputra
Copy link
Collaborator

fekaputra commented Mar 18, 2021

Hi @paulwalk, in case it is decided that the community will go with the first option, we (mainly me, @JoaoMFCardoso, @ljgarcia and Marie-Christine) have been working on the ontology version of the DMP Common Standard (DMP Common Standard Ontology - DCSO), which is already committed as a part of this repository (https://github.com/RDA-DMP-Common/RDA-DMP-Common-Standard/tree/master/ontologies). This was a result of the DCS hackathon last year.

The goal of the ontology is to have a 1-to-1 mapping to the current DCS, to ensure the compatibility between the DCSO and the original DCS standard.

We will be very happy to discuss the ontology development (which you can later serialise as JSON-LD) to include the latest changes since the hackathon if you wish.

As a note, we are currently working on an (invited) journal paper to showcase the DCSO and its features. So in case that the community decided to go with the JSON-LD, we can also report this development in the paper as well.

@MarekSuchanek
Copy link
Collaborator

MarekSuchanek commented Mar 20, 2021

Hi, I would vote for JSON-LD way.

  • There is already DCSO, (re)using it would be great... It would be one unified definition of DMP Common Standard.
  • Working with JSON-LD will be definitely more convenient and flexible.
  • The documentation (of the specification, i.e., ontology) could be generated... no duplication of information, less inconsistency issues.
  • Easier to refer to specific parts of the standard.
  • Could also link directly to concepts/ontologies that the standard (re)uses, e.g., DCAT.

@paulwalk It should be possible to remain backwards compatible (when someone ignores @context, @type, etc., the structure can be done in the same way as is now), right? Question is if that is a good idea or it would be better to work directly on some redesign (again, sooner rather than later)...

One might also ask why JSON-LD and not directly RDF.

@paulwalk
Copy link
Contributor

@paulwalk It should be possible to remain backwards compatible (when someone ignores @context, @type, etc., the structure can be done in the same way as is now), right? Question is if that is a good idea or it would be better to work directly on some redesign (again, sooner rather than later)...

I think it would remain backwards-compatible for people parsing the document as JSON rather than JSON-LD. As far as I can see, the main thing that would be lost would be the namespace URI mapping - but the namespace prefixes would still be in the JSON.

One might also ask why JSON-LD and not directly RDF.

This is really just about tooling. The DMP system APIs are already handling JSON. Developers mostly prefer it to RDF because they get native programming language support etc. JSON-LD seems to hit the "sweet-spot" for many.

@nicolasfranck
Copy link

nicolasfranck commented Aug 2, 2022

I think the use of JSON-LD would only break usage if you would decide to use a different way of expressing your attributes.
JSON-LD allows for short attribute names or expanded names (name vs http://schema.org/name), compacted result or not; allows to express your values as regular strings, array of strings, array of hashes ..

A little sidewalk: IIIF v2 uses JSON-LD, but implementers rapidly started to realise that attribute values can be anything (reference url? regular string? array of reference urls?). IIIF v3 therefore decided to be far more strict;

And that is what one should probably do to make other developers' live easier. Let's not forget that most JSON parsers are just JSON parsers, and are not like XML parsers that can handle namespaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision Decision to be taken that alligns the approach
Projects
None yet
Development

No branches or pull requests

10 participants