Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for FAIR4RS metrics #478

Merged
merged 53 commits into from
Feb 28, 2024

Conversation

karacolada
Copy link
Contributor

@karacolada karacolada commented Jan 11, 2024

Description

This PR adds support for a new set of metrics, namely the FAIR4RS (research software) metrics as laid out in the deliverable available on Zenodo.

Specifically, the changes are as follows:

  • added two new metric YAML files, one with metric tests for general research software, and one with CESSDA-specific metric tests
  • added a new GitHub harvester that uses the GitHub API to collect the necessary info for the provided repository
    • added a use_github attribute to the API definition to make running this harvester optional (similar to use_datacite)
    • re-generated the swagger models (see notes below)
    • added a config/github.cfg file to define a GitHub API token (providing one increases the rate limit)
  • modified any regular expressions testing for the validity of metric (test) identifiers
  • added new tests to the license evaluator to implement the first FAIR4RS metric, FRSM-15
  • modified association between evaluator class and metric identifier (see below for details)
  • added documentation (see below for details)

All these changes are non-breaking. Evaluators can still be associated with a single metric, but can also be modified as described below to cater to metrics from different sources. The GitHub harvester is disabled by default, so the behaviour does not change unless use_github is set.

Details and notes

To allow an evaluator class (and its test functions) to be shared by metrics with different identifiers, the following steps were taken in the modified evaluator (FAIREvaluatorLicense):

  • pass list of metrics to self.set_metric() see code
  • introduced a self.metric_test_map dictionary mapping test functions to (potentially) a list of metric test identifiers see code
  • modified test definition verification in test functions see code example
    • note that this change is necessary for any test function of an evaluator with multiple metrics associated, not just for shared test functions

This required the following changes outside the modified evaluator class:

  • modified FAIREvaluator.set_metric() to handle lists of metric identifiers as well as single metric strings see code

The changes to the documentation are as follows:

  • added fuji_server/data/README.md with notes an what the data files are used for
    • this is incomplete as I couldn't figure it out for every file
  • added section about development to README.md, covering:
    • running the simpleclient for easy interaction with the tool
    • a walkthrough of how the components interact
    • what changes need to be made to add support for a new set of metrics
    • how to update the API definition and regenerate the swagger-generated models

Notes on the chanegs generated by Swagger:

  • disregarded harvester.py which after regenerating didn't have an auth_token property and crashed some tests
  • re-generating added from datetime import date, datetime # noqa: F401 to every single model
  • fuji_server/models/any_of_fair_results_items.py was re-generated as fuji_server/models/any_of_fair_results_results_items.py with no further changes
  • maturity was switched from type str to type int
  • in fuji_server/models/identifier_included_output.py, the property object_identifier_included was added

Related issue: #ISSUE_NUMBER

Motivation and context

The changes were necessary to

  • accomodate the evaluation of a new set of metrics (non-FAIRsFAIR) and
  • implement one FAIR4RS metric looking at the license of a software repository (FRSM-15).

This PR also serves as an opportunity for feedback on the way the license evaluator was modified to allow for different metric test configurations, so that we can keep it in mind for the addition of future metrics.

How has this been tested?

Running hatch run test resulted in 18 passed tests with 9 deprecation warnings.

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Other (please describe):

Checklist

  • I have read the contributor guide.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@karacolada karacolada marked this pull request as ready for review January 16, 2024 11:34
def setLicenseDataAndOutput(self):
self.license_info = []
specified_licenses = self.fuji.metadata_merged.get("license")
if specified_licenses is None: # try GitHub data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here it has to be tested first if a FRSM metric is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll add that. Just a note, I've also been thinking about merging the github_data dictionary into the merged_metadata but had to focus on other aspects first - might want to consider this further down the line though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 732b48e.

@@ -60,6 +60,7 @@ def flip_dict(dict_to_flip):
"right_holder": {"label": "License", "sameAs": "http://purl.org/dc/terms/rightsHolder"},
"object_size": {"label": "Object Size", "sameAs": "http://purl.org/dc/terms/extent"},
"language": {"label": "Language", "sameAs": "http://purl.org/dc/terms/language"},
"license_path": {"label": "License Path", "sameAs": None},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between license and license_path, can this be merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license contains the license name/identifier, so something like mit - we use this the same for the GitHub harvest as it's used by the metadata harvester. license_path on the other hand is the file path to the license file in the repository, e.g. ./LICENSE.TXT. It's relevant for one of the CESSDA-specific metric tests.

I'm not sure how they might be merged since we need both pieces of information for the tests. Happy to adjust this if you have any suggestions though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, this is indeed different, so let's keep it

@@ -2,6 +2,10 @@
#
# SPDX-License-Identifier: MIT

# coding: utf-8

from datetime import date, datetime # noqa: F401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think so, it's auto-generated by swagger though (and pre-commit didn't seem to mind it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, this is indeed different, so let's keep it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups wrong comment.. as its auto-generated and not complained we better keep it;)

Copy link

📋 Code Coverage

Code Coverage

Package Line Rate Branch Rate Health
. 58% 35%
config 100% 100%
controllers 85% 56%
evaluators 63% 37%
harvester 60% 51%
helper 66% 55%
models 80% 87%
Summary 70% (6807 / 9768) 60% (2855 / 4782)

Copy link

📋 Pytest Results

19 tests  +1   19 ✅ +1   25s ⏱️ -1s
 1 suites ±0    0 💤 ±0 
 1 files   ±0    0 ❌ ±0 

Results for commit 732b48e. ± Comparison against base commit 708fe43.

@huberrob huberrob merged commit b30b780 into pangaea-data-publisher:master Feb 28, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants