Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation of voc4cat-tool for management of vocabularies #235

Open
oggioniale opened this issue Nov 26, 2024 · 27 comments
Open

Evaluation of voc4cat-tool for management of vocabularies #235

oggioniale opened this issue Nov 26, 2024 · 27 comments
Labels
question A request for help

Comments

@oggioniale
Copy link

Dear each,

I am writing to you because I am considering using your tool for managing some vocabularies as part of a European project.

Currently, the management is being done using an Excel sheet, but there is a complete lack of versioning and, above all, the ability to use GitHub as a platform for discussions, issue tracking, etc.

Following the instructions provided in the README of your repository, I created a clone on my GitHub account (https://github.com/oggioniale/elter-vocabularies/tree/main?tab=readme-ov-file). I made the necessary modifications to the file idranges.toml, but I have three questions to ask.

The first concerns the "Section of IDranges" in the file idranges.toml. What are ID ranges in the vocabulary?
The second is whether in the Excel file (e.g., voc4cat_template_043.xlsx) in the "Concept Scheme" sheet, I can add fields beyond those already present in blue. Fields such as language, license, acronym, label, etc. And whether I can define multiplicities for a concept (e.g., Creator).
The third, and final question, concerns the same Excel file, but in the "Concept" sheet. In this case, I would also like to know if I can add columns to enrich the concept, such as properties like prefLabel or definition in different languages, deprecated, created, modified, etc.

I apologize in advance for the many questions, but since what you have created seems to me a very interesting tool, I would like to understand if I can use it.

Best
Alessandro

@dalito
Copy link
Member

dalito commented Nov 26, 2024

Hi Alessandro,

I am writing to you because I am considering using your tool for managing some vocabularies as part of a European project.

Great! You would be the first to build another vocabulary from the template. There may be some rough edges / missing notes. But our voc4cat shows that it is working well.

The first concerns the "Section of IDranges" in the file idranges.toml. What are ID ranges in the vocabulary?

IDranges give users pre-reserved ranges of IDs. So they already know the final concept-IRI when they submit new terms (given that the PR is accepted). The pipeline checks that every user creates new concepts only within their pre-reserved IDrange. See also iri-design.md.

The IDrange-support is only useful/needed if terms include a unique number in the IRI. Some other vocabularies use UUID4-type IDs in their IRIs. In this case the IDrange-based coordination is not necessary and the IDrange-based checks may be removed from the pipeline.

The second is whether in the Excel file (e.g., voc4cat_template_043.xlsx) in the "Concept Scheme" sheet, I can add fields beyond those already present in blue. Fields such as language, license, acronym, label, etc. And whether I can define multiplicities for a concept (e.g., Creator).

What do you mean with label, skos:label? (not supported at the moment)

Changing the excel sheet requires that the SHACL profile and the read/write of xlsx are changed accordingly. Adding more fields to the concept scheme sheet would be relatively easy. Some additional fields like license would be useful for us, too.

The third, and final question, concerns the same Excel file, but in the "Concept" sheet. In this case, I would also like to know if I can add columns to enrich the concept, such as properties like prefLabel or definition in different languages, deprecated, created, modified, etc.

Some features are supported: multi-language (each language is on a separate line), multiple creators can be given as comma-separated list.

We have collected some ideas about possible future changes in #124. (addressing e.g. deprecation)

Modifying the concept sheet is more complex which is why we are collecting several possible changes in #124.

I apologize in advance for the many questions, but since what you have created seems to me a very interesting tool, I would like to understand if I can use it.

No problem. It would be great if the template is useful beyond the original project it was created in.

@oggioniale
Copy link
Author

Hi Alessandro,

Hi,
thank you for your prompt reply.

I am writing to you because I am considering using your tool for managing some vocabularies as part of a European project.

Great! You would be the first to build another vocabulary from the template. There may be some rough edges / missing notes. But our voc4cat shows that it is working well.

Sound good

The first concerns the "Section of IDranges" in the file idranges.toml. What are ID ranges in the vocabulary?

IDranges give users pre-reserved ranges of IDs. So they already know the final concept-IRI when they submit new terms (given that the PR is accepted). The pipeline checks that every user creates new concepts only within their pre-reserved IDrange. See also iri-design.md.

The IDrange-support is only useful/needed if terms include a unique number in the IRI. Some other vocabularies use UUID4-type IDs in their IRIs. In this case the IDrange-based coordination is not necessary and the IDrange-based checks may be removed from the pipeline.

Ok, I think in this case it won't be useful to me, but I'll keep the information in mind in case it becomes useful in the future.

The second is whether in the Excel file (e.g., voc4cat_template_043.xlsx) in the "Concept Scheme" sheet, I can add fields beyond those already present in blue. Fields such as language, license, acronym, label, etc. And whether I can define multiplicities for a concept (e.g., Creator).

What do you mean with label, skos:label? (not supported at the moment)

I would like to add properties that describe the vocabulary, such as skos:prefLabel, omv:acronym, omv:resourceLocator, omv:knownUsage, dct:audience, doap:repository, dct:license, and dct:language. In this sense, I was asking if it was possible.

These are properties I am currently using, and I would like to keep them.

I also see that here 4th draft for next xlsx template (v1.0) 2024-02-11 there is indeed a row with Repository.

Changing the excel sheet requires that the SHACL profile and the read/write of xlsx are changed accordingly. Adding more fields to the concept scheme sheet would be relatively easy. Some additional fields like license would be useful for us, too.

I don't know SHACL in detail, but I see that the profile is quite generic.

The third, and final question, concerns the same Excel file, but in the "Concept" sheet. In this case, I would also like to know if I can add columns to enrich the concept, such as properties like prefLabel or definition in different languages, deprecated, created, modified, etc.

Some features are supported: multi-language (each language is on a separate line), multiple creators can be given as comma-separated list.

We have collected some ideas about possible future changes in #124. (addressing e.g. deprecation)

Modifying the concept sheet is more complex which is why we are collecting several possible changes in #124.

I'll take the next few days to check which properties I’m collecting to define the various concepts, so I can identify which ones are missing in your template.

The goal is not to lose the properties I have collected so far to describe the concepts in the vocabulary.

I apologize in advance for the many questions, but since what you have created seems to me a very interesting tool, I would like to understand if I can use it.

No problem. It would be great if the template is useful beyond the original project it was created in.

Best

@oggioniale
Copy link
Author

Hi,
These are the properties that, among others, I would like to include.

dct:source (e.g. http://purl.obolibrary.org/obo/ENVO_01000155, http://purl.obolibrary.org/obo/ENVO_01000156)
dct:isReplacedBy (e.g. et:00003),
skos:note@en (e.g. [controlled by ] email or orcid on 2013-06-06),
skos:changeNote@en (e.g. changed by email or orcid on 2013-06-06),
skos:editorialNote@en (e.g. changed by email or orcid on 2013-06-06),
skos:broadMatch (In the voc4cat_template_043 file, there is an ‘Additional Concept Feature’ sheet. Does the ‘Broader Matches’ column return this property (skos:broadMatch) or the skos:boader property?),
owl:sameAs (e.g. http://dbpedia.org/resource.Reflectivity , http://dbpedia.org/resource.Reflectance),
dct:contributor (e.g. email or orcid),
dct:created^^xsd:date (e.g. 2013-07-01),
dct:creator (e.g. email or orcid),
dct:modified^^xsd:date (e.g. 2023-01-13),
skos:scopeNote@en (e.g. chaged by email or orcid on 2013-06-06),
skos:example@en (e.g. a text about example),
owl:deprecated^^xsd:boolean (TRUE or FALSE)

PS if i wanted to add a field in the excel sheet and have the counterpart in the ttl where should i act?

@dalito
Copy link
Member

dalito commented Nov 27, 2024

Just a "FYI": Several properties are for recording changes. So far we kept provenance info rather simple as it was in the australian vocpub profile which we re-used. Moreover git has all the history details. With git blame, quite detailed provenance is easily accessible, e.g. https://github.com/nfdi4cat/voc4cat/blame/main/vocabularies/voc4cat/0000005.ttl (this could in principle be converted to a history expressed with properties in RDF).

Does the ‘Broader Matches’ column return this property (skos:broadMatch) or the skos:boader property?),

The column in the concepts sheet results in skos:broader and skos:narrower relations (but not skos:broadMatch).

PS if i wanted to add a field in the excel sheet and have the counterpart in the ttl where should i act?

With ttl you mean the shacl profile? I would first write a test-turtle with a concept that has all the additional properties and then work on the shacl-profile until validation pases. With a local install of voc4cat-tool you can run validation (without conversion to Excel).
Once this works, I would look at the conversion of the turtle/rdf-concept example to Excel-concept sheet. The final step is than adapting the Excel to turtle conversion.

Modification of the Excel-sheet requires changing the conversion code which means that you create a custom voc4cat-tool. To avoid this, it would be best if we could agree on a common set of properties (=columns in Excel). The profiles can be different, so you could mark properties that are mandatory for voc4cat as optional in you vocabulary profile. Hope it makes sense...

@oggioniale
Copy link
Author

I tried to make a pull request (oggioniale/elter-vocabularies#1) following the readme here. However, I encountered an error.

Run voc4cat --version
voc4cat 0.8.5
INFO    |Executing cmd: voc4cat check --config _main_branch/idranges.toml --logfile outbox/voc4cat.log --ci-pre inbox-excel-vocabs/ _main_branch/vocabularies
DEBUG   |Processing common options.
DEBUG   |Config loaded from: _main_branch/idranges.toml
ERROR   |Unexpected error.
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/voc4cat/cli.py", line 450, in run_cli_app
    main_cli(raw_args)
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/voc4cat/cli.py", line 441, in main_cli
    process_common_options(args, raw_args)
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/voc4cat/cli.py", line 47, in process_common_options
    config.load_config(config_file=Path(args.config))
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/voc4cat/config.py", line 142, in load_config
    new_conf["IDRANGES"] = IDrangeConfig(**conf)
                           ^^^^^^^^^^^^^^^^^^^^^
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for IDrangeConfig
vocabs -> dataLevel -> id_range
  field required (type=value_error.missing)

The idranges.toml contains this:

single_vocab = false
# == eLTER Data Level code list derived from https://elter.atlassian.net/wiki/spaces/EC/pages/918749186/eLTER+Data+Levels ==
[vocabs.dataLevel]
id_length = 2
permanent_iri_part = "https://vocabs.lter-europe.net/dataLevel/" # may be to be changed with new URI of eLTER vocabularies!
[vocabs.dataLevel.checks]
allow_delete = false
[vocabs.dataLevel.prefix_map]
et = "https://vocabs.lter-europe.net/" # ???
# [[vocabs.dataLevel.idrange]]

@dalito
Copy link
Member

dalito commented Nov 27, 2024

Hmm. Your case is special and fails because you don't have any user defined in the IDranges file. Therefore, id_range is empty => ValidationError. The validation is too strict because we have never thought about using the tools without setting IDranges for the users.

I have to look how to approach this best.

@dalito
Copy link
Member

dalito commented Nov 28, 2024

From the urls in your test, I saw that you use SKOSMOS to make the vocabulary available. We also have concrete plans to make the voc4cat vocabulary available over an NFDI SKOSMOS service in Q1-2025. Do you know if a SKOSMOS-compatible profile exist?

If that is the case I would be interested in making the current profile compatible.

@oggioniale
Copy link
Author

Just a "FYI": Several properties are for recording changes. So far we kept provenance info rather simple as it was in the australian vocpub profile which we re-used. Moreover git has all the history details. With git blame, quite detailed provenance is easily accessible, e.g. https://github.com/nfdi4cat/voc4cat/blame/main/vocabularies/voc4cat/0000005.ttl (this could in principle be converted to a history expressed with properties in RDF).

I saw this and the provenance is very important for us.

Does the ‘Broader Matches’ column return this property (skos:broadMatch) or the skos:boader property?),

The column in the concepts sheet results in skos:broader and skos:narrower relations (but not skos:broadMatch).

Perfect!

PS if i wanted to add a field in the excel sheet and have the counterpart in the ttl where should i act?

With ttl you mean the shacl profile? I would first write a test-turtle with a concept that has all the additional properties and then work on the shacl-profile until validation pases. With a local install of voc4cat-tool you can run validation (without conversion to Excel). Once this works, I would look at the conversion of the turtle/rdf-concept example to Excel-concept sheet. The final step is than adapting the Excel to turtle conversion.

Ahh ok! I thought there was a direct correspondence between the excel sheet and the ttl file that was produced, without having to go through a validator.

Modification of the Excel-sheet requires changing the conversion code which means that you create a custom voc4cat-tool. To avoid this, it would be best if we could agree on a common set of properties (=columns in Excel). The profiles can be different, so you could mark properties that are mandatory for voc4cat as optional in you vocabulary profile. Hope it makes sense...

This makes a lot of sense. I already have the ttl corresponding to my excel sheet. But I don't have a shacl shape to validate it. But I think this will be a next step for me.

@oggioniale
Copy link
Author

oggioniale commented Nov 28, 2024

Hmm. Your case is special and fails because you don't have any user defined in the IDranges file. Therefore, id_range is empty => ValidationError. The validation is too strict because we have never thought about using the tools without setting IDranges for the users.

I have to look how to approach this best.

I tried yesterday to enter a user (myself). Now id_range is filled but it still gives me an error.
Maybe I didn't understand what id_range is :-(
I see the comments now on Pull. I will try ...

[[vocabs.dataLevel.id_range]]
first_id = 1
last_id = 10
gh_name = "oggioniale"
orcid = "0000-0002-7997-219X"
ror_id = "https://ror.org/02wxw4x45"

@dalito
Copy link
Member

dalito commented Nov 28, 2024

Copied from PR-comments (to record all difficulties here)

One problem was that the Excel file must just be named just dataLevel.xlsx for the vocabulary "dataLevel" not vocab_dataLevel.xlsx. The error message is in principle clear but one does not think about the excel file name as source of the problem. The excel filename determines which vocabulary is looked up in IDranges. This should be better documented and we should add a special check for this with a more meaningful error message.

Another problem was the changed the Excel-Template version to 0.0.1 (on Introduction-sheet). It should stay at 0.4.3. It is a version of the Excel-structure, not of the vocabulary contents.

@dalito
Copy link
Member

dalito commented Nov 28, 2024

With a little Python experience it is not particularly difficult to run the commands of the gh-action on your local computer. Especially if you plan to change more, running locally is important to iterate faster.

I can give a short summary how to setup a local clone to do this, if you are interested.

@oggioniale
Copy link
Author

Hello PR now successful: oggioniale/elter-vocabularies#5

But now I would have expected that there would be a folder called dataLabel in https://github.com/oggioniale/elter-vocabularies/tree/main/vocabularies and that there would be documentation generated in workflow-artifacts

I'm sorry for all these questions but I can't be autonomous.

@dalito
Copy link
Member

dalito commented Nov 28, 2024

Don't worry. For us, your interest is a great opportunity to test and improve re-usability. We are happy to help.

@dalito
Copy link
Member

dalito commented Nov 28, 2024

I created a PR in your test repository with a fixed Excel-file: oggioniale/elter-vocabularies#6

@dalito
Copy link
Member

dalito commented Nov 29, 2024

Your current configuration selects to manage multiple vocabularies together in a single git-repository. In voc4cat we manage just a single vocabulary in the repository. So this is more thoroughly tested in real life. Both variants have pro and cons.

For example the cons of multiple vocabularies in one repository:

  • users see the issues of unrelated vocabularies
  • more complex to understand/debug if something goes wrong.
  • if something breaks all vocabularies are affected.
  • supporting e.g. different profiles or documentation generation requires customization (CI jobs must get vocabulary-specific)
  • release tags are always for the whole vocabulary suite. Individual vocabulary releases are not possible (with git/GitHub tools alone)

Typically, I would use a distinct repository for each vocabulary. Only for small closely related vocabularies I would manage multiple in one.

@oggioniale
Copy link
Author

I created a PR in your test repository with a fixed Excel-file: oggioniale/elter-vocabularies#6

I saw your work and I downloaded the excel file! Thanks! I understand also the errors (ROR and prefix) I would never have got there. My mistake of the prefix for collection.

@oggioniale
Copy link
Author

Your current configuration selects to manage multiple vocabularies together in a single git-repository. In voc4cat we manage just a single vocabulary in the repository. So this is more thoroughly tested in real life. Both variants have pro and cons.

For example the cons of multiple vocabularies in one repository:

  • users see the issues of unrelated vocabularies
  • more complex to understand/debug if something goes wrong.
  • if something breaks all vocabularies are affected.
  • supporting e.g. different profiles or documentation generation requires customization (CI jobs must get vocabulary-specific)
  • release tags are always for the whole vocabulary suite. Individual vocabulary releases are not possible (with git/GitHub tools alone)

Typically, I would use a distinct repository for each vocabulary. Only for small closely related vocabularies I would manage multiple in one.

Thank you. I understand what you say. In this case, too, I wanted to do a test.
In our case however there might be the possibility of having more than one vocabulary but we could manage it with different repositories. Let's see...

@dalito
Copy link
Member

dalito commented Nov 29, 2024

BTW, the failure in the merge action is caused by a missing gh-pages branch. To use documentation hosting on gh-pages, gh-pages need to be activated in the settings of the repository and a (initially empty) gh-pages branch must be present.

@dalito dalito added the question A request for help label Nov 29, 2024
@oggioniale
Copy link
Author

BTW, the failure in the merge action is caused by a missing gh-pages branch. To use documentation hosting on gh-pages, gh-pages need to be activated in the settings of the repository and a (initially empty) gh-pages branch must be present.

I was just wondering where I would find the documentation.
I guess now to set it up I have to choose GitHub Actions as the source (https://github.com/oggioniale/elter-vocabularies/settings/pages)?

@dalito
Copy link
Member

dalito commented Nov 29, 2024

Here is a screenshot from the settings of voc4cat:

gh-settings

@dalito
Copy link
Member

dalito commented Nov 29, 2024

If you want to host the docs on gh-pages, a redirect from a permanent URL service to gh-pages also needs to be configured. voc4cat´s w3id.org config can serve as inspiration. It configures redirects to the correct anchor of the concepts in html and supports multiple vocabulary versions.

@oggioniale
Copy link
Author

Ahhhhh ok.
In fact, I still have no branch called gh-pages.

@oggioniale
Copy link
Author

BTW, the failure in the merge action is caused by a missing gh-pages branch. To use documentation hosting on gh-pages, gh-pages need to be activated in the settings of the repository and a (initially empty) gh-pages branch must be present.

Done alse this! https://oggioniale.github.io/elter-vocabularies/dev/dataLevel/

@oggioniale
Copy link
Author

last test for today.
I wanted to test the function of changes, deleting and adding concepts, and I made this new PR by loading a new version of the excel file.
The PR was successful, I merged the two branches but nothing changed. And now I see the excel file in the root of the project.
How do I change the content of a vocabulary? I thought it was sufficient to load the excel file with the changes.

@dalito
Copy link
Member

dalito commented Nov 29, 2024

We are getting closer to the finish. 😄 🏁

So what made it fail?

Another (minor) issue: You created the gh-pages branch obviously from the main branch. However, the gh-pages branch should be empty at the beginning. The only file that should be added is an empty file named .nojekyll (and a custom 404.html if you like).

A note on deleting: Deleting a concept in Excel does not delete the concept from the concept scheme. To completely remove a concept (which should be very rare!), its turtle file in the vocabulary directory has to be deleted via a PR. Due to this it is possible to edit only a reduced subset of a vocabulary via Excel-uploads.

@oggioniale
Copy link
Author

I have done all the tests to update the concepts. I think I now understand quite well how the process works. I think I'll do some tests with a denser vocabulary of concepts, I'll also compare with my colleagues and then I'll tell you.

However, I think I will propose this solution for vocabulary management. However, the questions remain open concerning the addition of certain properties and the management of the id_range.

On the other hand, I think it is good to be able to use only one vocabulary per repository.

I will get back to you soon.

@dalito
Copy link
Member

dalito commented Dec 4, 2024

Thanks for the update. Let us know how your proposal is received by your colleagues!

If you would like us to present voc4cat or discuss details like how to approach the customization, we (@nmoust or me) would also be available for a zoom call.

We will not address #237 immediately. Until your decision it has low priority because it does not affect our vocabulary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question A request for help
Projects
None yet
Development

No branches or pull requests

2 participants