Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata keys incongruency #939

Open
nitrosx opened this issue Dec 7, 2023 · 31 comments
Open

Metadata keys incongruency #939

nitrosx opened this issue Dec 7, 2023 · 31 comments
Assignees
Labels
enhancement New feature or request

Comments

@nitrosx
Copy link
Contributor

nitrosx commented Dec 7, 2023

Metadata keys Incongruency

Summary

Currently the backend accept any type of string as key for a metadata. This is not an issue per se, but it becomes when users access the metadata both through the frontend and the backend.
The frontend performs some changes to the metadata keys when it renders them. Here is an example:

  • BE: Data Type -> FE: Data Type
  • BE: data type -> FE: Data Type
  • BE: data_type -> FE: Data Type

This behavior is by design to render the metadata keys more human readable, but it can be confusing to data users.

Proposed Solution

We propose to have the backend storing the metadata entries as follow:

'key_name' : { 
  '#value' : 'metadata_entry_value',
  '#unit' : 'metadata_entry_unit',
  '#type' : 'type of this field',
  '#human_readable_name`: `Key Name`
  'sub_key_name' : {
   ....
  }
}

key_name is expected to be always lower case and with no spaces, only underscores are allowed. If #human_readable_human is not specified, the frontend will default back to the current behavior, which is to use the key, substitute the underscores with spaces and capitalize every word.

We also suggest to add the type (such as string, number, quantity, datetime, etc) to reduce ambiguity of interpretation in the frontend and across different languages when the data is used. if type is not specified, the system will default back to the current behavior.

Also, the system fields value, unit, type, and human_readable_name are suffixed with #, so it is clear that they are system fields and we reduce the probability of collisions with user fields when nested metadata are used.

@nitrosx nitrosx added enhancement New feature or request meeting Discuss in meeting before merge labels Dec 7, 2023
@nitrosx
Copy link
Contributor Author

nitrosx commented Dec 7, 2023

This issue covers the frontend also. If accepted, this issue will generated two PRs, one for FE and one for BE.

@bpedersen2
Copy link
Contributor

Sounds like a solid proposal.

@dylanmcreynolds
Copy link
Contributor

What backend change are you proposing here? Currently there is no validation or other type of processing on the scentific metadata dictionary. Are you proposing that we start? I would be very worried about that...since we have almost 100k datasets that don't comply to that spec (merely because when we setup the ingestor we didn't know that the frontend did special handling like this.

@bpedersen2
Copy link
Contributor

I think we would need:

  • a migration script that migrates the db entries into the new format
  • handling in the ingest api for old-style entries, that does the transformation for older clients ( at leaste for 1-2 years until clients are upgraded)
  • support in the sdk libs ( pysciat, scitacean, etc.)

It would also be a chance to implement nested entries better (by using new type key for the parent entries.

@nitrosx
Copy link
Contributor Author

nitrosx commented Dec 8, 2023

@dylanmcreynolds I'm proposing to adopt this individual schema for each scientific metadata. The intent is to make it easier to manage, interpret and visualize the information that we store as metadata. Also notice that it is not a schema on the overall metadata, but just the single metadata entry.
I do not want to force this individual schema on existing datasets, it can be optional. Both FE and BE will default back to the current behavior if they do not find this schema and if they are configured to opt-out.
If we decide to go this route, in the feature we should also include a script to update the data to the new format.

Mr @bpedersen2 just beat me and also he read my mind.

@dylanmcreynolds
Copy link
Contributor

I support the concept. I definitely think it needs to be optional for a long time. Migration scripts will be a little tricky, give that it might involve adding information that does not existing in existing scientific_medata fields.

@dylanmcreynolds
Copy link
Contributor

A couple of other random thoughts about this.

You are using the # in your examples, I believe that it's to indicate an "attribute". I have seen examples of using @, especially in cases where there is an attempt to represent XML data in JSON. One example is the JSON-LD spec.

This leads to propose looking at JSON-LD to define fields here. If I understand the protocol, it provides a standards based way of defining the meanings of fields.

@nitrosx
Copy link
Contributor Author

nitrosx commented Dec 11, 2023

@dylanmcreynolds I have no objections to adopt json-ld.
According to what I have read so far we could do implementing the above idea as follow:

'key_name' : { 
  '@value' : 'metadata_entry_value',
  '@unit' : 'metadata_entry_unit',
  '@type' : 'type of this field',
  '@human_readable_name`: `Key Name`
  'sub_key_name' : {
   ....
  }
}

than I assume that we need to add the @context key and define the structure somewhere. Am I correct?

@dylanmcreynolds
Copy link
Contributor

Well, my suggestion was kind of in two parts. The first, yes, that looks better to me as I have seen @ used a lot for attributes in json.

Let's forget the second part of my suggestion for now. :)

@nitrosx
Copy link
Contributor Author

nitrosx commented Dec 11, 2023

Do you mean the migration script?

@dylanmcreynolds
Copy link
Contributor

No, the reference to json-ld. I think there's something there (describing metadata in standard ways) but the json-ld standard is more about describing links and not metadata fields themselves.

@sbliven
Copy link
Contributor

sbliven commented Dec 13, 2023

Do you really want to include human_readable_name in the data representation? This seems like something that could be done with just frontend changes (eg show the real key on hover, or just show the real key without capitalization, or add an option to view the raw json metadata).

I guess the bigger issue is that the frontend does expect a fixed metadata format for some sites. We should decide on and document standard formats supported by the frontend. We should also add options for validating scientificMetadata (needs a separate issue).

JSON-LD seems like a reasonable way to annotate scientificMetadata with type information and could be used by the frontend to determine whether it can be displayed as a table, a tree, etc. But from what I see this would require much more than spelling some fields with an @ prefix. We would want to publish a context describing valid data. We would also want to integrate standard ontologies (eg using qudt units) if we want the json-ld to be parsable by semantic tools.

Also, the system fields value, unit, type, and human_readable_name are suffixed with #, so it is clear that they are system fields and we reduce the probability of collisions with user fields when nested metadata are used.

If we don't want to actually enforce json-ld for metadata then it might actually be better not to use @ in the spelling. I don't think any prefix is necessary, since 'system fields' should alternate with 'user fields' on different levels of the hierarchy. Your example above should rather be:

{
  "key_name": {
    "unit": "metadata_entry_unit",
    "type": "type of this field",
    "human_readable_name": "Key Name",
    "value": {
      "sub_key_name": {
        "value": ...
      }
    }
  }
}

@nitrosx
Copy link
Contributor Author

nitrosx commented Dec 13, 2023

Currently you can have a metadata key with a value and sub-fields. If that's the case, your example above will not be able to capture them.

I have seen examples where the fields that are required to interpret the entry being clearly marked with a suffix, while the fields defined by the user are not.

I'm looking for suggestions to find a solution that allows both human and machine to recognize if the field is user defined or is required by the system to understand how to interpret the information stored in it. All of it should be self-contained.

Going back to # prefix, I can see that been really useful (IMHO) in examples like the following:

{
  "key1": {
    "#unit": "m",
    "#value": 10,
    "#type": "quantity",
    "#human_readable_name": "Key 1",
    "key1_1": {
     "#unit": "mm",
     "#value": 10,
     "#type": "quantity",
     "#human_readable_name": "Key 1.1",
     }
    }
  }
}

This scientific metadata encodes the following structure:

key1 = 10 m
key1.key1_1 = 10 mm

which we are able to present it without any issue as follow in the frontend:

Key 1 = 10m
Key 1 -> Key 1.1 = 10mm

@nitrosx
Copy link
Contributor Author

nitrosx commented Dec 13, 2023

An additional example is the following:
we have a property named sample_temperature with a value of 10C that is retrieved from the data file dat_file.nxs under the path nexus/path/to/sample/temperature, but the instrument scientist would like to be shown as Temperature but the user would like to be called Sample Condition.

With the # prefixed notation, we could create the following metadata entry:

"sample_temperature": { 
 "#value": 10,
 "#unit": "C"
 "#valueSI": 283.15,
 "#unitSI": Kelvin
 "#human_readable_name": "Temperature",
 "nexus_path" : "nexus/path/to/sample/temperature",
 "user_alias" : "Sample Condition"
}

@nitrosx
Copy link
Contributor Author

nitrosx commented Dec 13, 2023

Please comment!!! I Would like to be able to do something like that, but I'm not convinced 100% and unsure of the scope.

@dylanmcreynolds
Copy link
Contributor

Do you really want to store all possible renamings/mappings of the data in scicat's scientific metadata? Or, in your example, do you want to leave the key as "Temperature" and let downstream systems that read nexus interpret the nexus file as it needs to? It seems to me like SciCat is the front end search and display tool, and has stayed clear of analysis.

@nitrosx
Copy link
Contributor Author

nitrosx commented Dec 14, 2023

@dylanmcreynolds I'm just brainstorming and trying to collect all the examples that I cross path with. I would like to keep SciCat simple but allow maximum flexibility and solve the issue of metadata representation in the frontend.

@sbliven
Copy link
Contributor

sbliven commented Dec 14, 2023

key1 = 10 m
key1.key1_1 = 10 mm

I guess if this is currently supported then we can argue for including it for backwards compatibility. However I think this is a bad design because the type of key1 is unclear. Is it an object or a literal? How would this be represented as a class?

@minottic
Copy link
Contributor

minottic commented Dec 15, 2023

I feel we are touching on a core concept of scicat which is the flexibility in the metadata structure. And I am a little afraid of that, as it would require circulating the information to existing adopters, convincing them, and making sure they keep using the datacatologue, which is already a challenge as they often don't see its value.

This said, in general, I like the idea of having some sort of high-level structure of the scientific metadata, but a the same time I think it should be customisable at least by every facility, but also, maybe, at a lower level, by every instrument or experiment.

So, as a first step: why don't we allow defining a scientific metadata structure when deploying the backend, as part of its configuration, and build the frontend functionality dynamically? The FE would need to fetch the defined structure schema first and then know what to do. From there one could expand the concept and add a feature to the FE which enables e.g. some members (e.g. the principal investigator of an experiment) to define a schema that all the members of that experiment should comply with. And from there, we could expose all these "user defined" or "developer defined" schemas and make them then machine readable (simply by having another set of endpoints that exposes the schemas).

To wrap it up, I think the idea of having a schema is ok (which is still up to discussion from what I see), but I would strongly prefer to be able to opt out of its enforcement and leave freedom for customisation.

@bpedersen2
Copy link
Contributor

I think we really need to be clear that this 'schema' only applies to the leaves of the meta tree, but does not enforce anything on the overall tree.

Currently we already have to different schemas on the leaves:

  • 'key': 'value' (=opaque string)
  • 'key':'{'value': value, 'unit':unit}' ( + SIUnit if returned from the BE)

Adding a richer version here seems helpful in a number of cases.
E.g. allowing to specifiy a range as value and making it searchable.

For keeping maximum backwards compatiblity, we should keep the opaque type as it is , and provide migrations on types where we can infer richer information automatically.

@minottic
Copy link
Contributor

but who will manage the changes on the leaves in the future? Is it something that we will be able to do automatically or requires users' intervention? This latter part is what I fear a little, namely needing to convince all users to change their scripts, and on the other side, I struggle to see how we could automate this fully. E.g., how can the BE understand if the user is storing a temperature or something else if the user gives it in a non trivial name?

@sbliven
Copy link
Contributor

sbliven commented Dec 20, 2023

E.g. allowing to specifiy a range as value and making it searchable.

Search is a good example for how structured scientificMetadata could be important for other BE functionality, not just the FE visualization. By default SciCat should not enforce any particular structure.

@minottic I think searching by temperature is already supported for leaves with a unit, with conversions provided via mathjs (eg #926)? Or is this still in development?

Finally, I created an issue for the validation feature (#966). Let's move the general conversation there and focus this more narrowly on @nitrosx's issue.

@minottic
Copy link
Contributor

thanks. That's better I think, and with that distinction in mind, I would suggest that we leave the "aliases" part as something that could be tackled by custom schemas (as well as the JSON-LD part, and my previous comment). If I understand right, we are asking ourselves:

  1. do we want to add type and human_readable_name fields?
  2. do we want to prefix "machine" fields with #?
  3. are there any other fields that we want to add?

IMHO:

  1. I think it's a good idea to add the type and human_readable_name, as long as the system defaults to the current behaviour if they do not exist (as pointed out by @nitrosx already).
  2. I don't fully understand why the "prefix" is needed, as @sbliven mentioned. Aren't we simply reserving some fields for machines? Also this, together with the default from 1, would allow us to run no migration script
  3. For the moment, I would only add the fields which we deem useful for the FE existing use case, so I would not add anything else.

Last, brainstorming I still see some overlap between this issue and #966 as one could expand the concept of type and encode the "aliases" and potentially the human_readable_name by defining a "custom" type. But this is probably for later, if at all.

@nitrosx
Copy link
Contributor Author

nitrosx commented Jan 2, 2024

Dear All
thank you so much for all the contributions.
After reading everything, I feel confident in proposing the following:

  • by default, Scientific Metadata will not follow any schema.
  • introduce the concept of metadata entry schema, which states that a metadata entry is a field in the scientific metadata of the dataset, which has a unique machine readable name used as key and it is defined as an object which should contains the following fields: value, unit, type, and human_readable_name.
  • a fully defined scientific metadata entry should adhere to the metadata field schema:
"my_key" : {
  "value": my_value,
  "unit": "my_unit",
  "type" : <type enumeration>,
  "human_readable_name": "my_readable_key",
}
  • if field type is set to quantity, field *unit" is required and fields valueSI and unitSI will be added automatically by the BE
  • if field human_readable_name is not provided, FE defaults back to the current behavior which is to remove underscores and capitalize every words.
  • if field human_readable_name is provided, FE will use its value as the name of the metadata property
  • if field type is provided,
    • BE could use it to validate that the value is of the correct type and format.
    • FE will use it to provide the correct widget for visualization and editing
  • if field type is not provided,
    • BE will save the value as it is
    • FE will infer the data type and provide the matching widget
  • if field type is set to object (previously proposed name nested), the field value should be set to a sub object with sub metadata entries
  • the scientific metadata entry can be defined in short as:
"my_key": "my_value"

This syntax is backward compatible to older data already present in difference instances.
It is equivalent to the following syntax:

"my_key" : {
  "value": "my_value",
  "unit" : "",
  "type": "string",
  "human_readable_name": "My Key"
}
  • a variation of the previous syntax is:
"my_key": my_value

This syntax is backward compatible to older data already present in difference instances.
It is equivalent to the following syntax:

"my_key" : {
  "value": my_value,
  "unit" : "",
  "type": "number",
  "human_readable_name": "My Key"
}

This solution should address all the concerns for backward compatibility, no-schema metadata (except for metadata field schema and its system fields).

@nitrosx
Copy link
Contributor Author

nitrosx commented Jan 2, 2024

The type enumeration is currently defined as:

  • number
  • string
  • quantity (with unit)

In issue #984, I'm proposing an expanded list to cover additional cases that we have seen here at ESS and in collaborators metadata

@nitrosx
Copy link
Contributor Author

nitrosx commented Jan 2, 2024

I will not be opposed to allow the following metadata entry schema alternatives:

"my_key" : {
 "v[alue]" : "my_value",
 "u[nit]" : "my_unit",
 "t[ype]" : "my_type",
 "hrn[ame]|human_readable_name" : "My Key"
}

@nitrosx
Copy link
Contributor Author

nitrosx commented Jan 2, 2024

User is allowed to add any additional fields to the metadata entry schema as he/she sees fit for their purposes, as long as they do not collide with the system fields

@nitrosx
Copy link
Contributor Author

nitrosx commented Jan 2, 2024

The proposed metadata entry schema will allow us to address the proposed metadata types highlighted in #984

@minottic
Copy link
Contributor

minottic commented Jan 8, 2024

minor comment: I would use object rather than nested for nested values, as it is closer to JS data types.

@nitrosx
Copy link
Contributor Author

nitrosx commented Jan 8, 2024

@minottic that sounds good to me. I just updated the post

@nitrosx
Copy link
Contributor Author

nitrosx commented Jan 8, 2024

I also added the object type in the list of allowed types proposed in #924

@nitrosx nitrosx removed the meeting Discuss in meeting before merge label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants