Skip to content

Commit

Permalink
Merge pull request #478 from softwaresaved/software
Browse files Browse the repository at this point in the history
Added support for FAIR4RS metrics
  • Loading branch information
huberrob authored Feb 28, 2024
2 parents 64d0951 + 732b48e commit b30b780
Show file tree
Hide file tree
Showing 79 changed files with 2,036 additions and 160 deletions.
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,15 @@
# vim
*.swp

# database
*.db

# local copies of Google database loading files
fuji_server/helper/catalogue_helper_google_datasearch_copy.py
fuji_server/helper/create_google_cache_db_copy.py

# private config
fuji_server/config/github.cfg

# Created by https://www.gitignore.io/api/python,linux,macos

Expand Down
127 changes: 126 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ The F-UJI server can now be started with:
python -m fuji_server -c fuji_server/config/server.ini
```

The OpenAPI user interface is then available at <http://localhost:1071/fuji/api/v1/ui/>.

### Docker-based installation

```bash
Expand All @@ -76,7 +78,7 @@ To access the OpenAPI user interface, open the URL below in the browser:

Your OpenAPI definition lives here:

<http://localhost:1071/fuji/api/v1/swagger.json>
<http://localhost:1071/fuji/api/v1/openapi.json>

You can provide a different server config file this way:

Expand All @@ -100,6 +102,129 @@ If you receive the exception `urllib2.URLError: <urlopen error [SSL: CERTIFICATE

F-UJI is using [basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication), so username and password have to be provided for each REST call which can be configured in `fuji_server/config/users.py`.

## Development

First, make sure to read the [contribution guidelines](./CONTRIBUTING.md).
They include instructions on how to set up your environment with `pre-commit` and how to run the tests.

The repository includes a [simple web client](./simpleclient/) suitable for interacting with the API during development.
One way to run it would be with a LEMP stack (Linux, Nginx, MySQL, PHP), which is described in the following.

First, install the necessary packages:

```bash
sudo apt-get update
sudo apt-get install nginx
sudo ufw allow 'Nginx HTTP'
sudo service mysql start # expects that mysql is already installed, if not run sudo apt install mysql-server
sudo service nginx start
sudo apt install php8.1-fpm php-mysql
sudo apt install php8.1-curl
sudo phpenmod curl
```

Next, configure the service by running `sudo vim /etc/nginx/sites-available/fuji-dev` and paste:

```php
server {
listen 9000;
server_name fuji-dev;
root /var/www/fuji-dev;

index index.php;

location / {
try_files $uri $uri/ =404;
}

location ~ \.php$ {
include snippets/fastcgi-php.conf;
fastcgi_pass unix:/var/run/php/php8.1-fpm.sock;
}

location ~ /\.ht {
deny all;
}
}
```

Link `simpleclient/index.php` and `simpleclient/icons/` to `/var/www/fuji-dev` by running `sudo ln <path_to_fuji>/fuji/simpleclient/* /var/www/fuji-dev/`. You might need to adjust the file permissions to allow non-root writes.

Next,
```bash
sudo ln -s /etc/nginx/sites-available/fuji-dev /etc/nginx/sites-enabled/
sudo nginx -t
sudo service nginx reload
sudo service php8.1-fpm start
```

The web client should now be available at <http://localhost:9000/>. Make sure to adjust the username and password in [`simpleclient/index.php`](./simpleclient/index.php).

After a restart, it may be necessary to start the services again:

```bash
sudo service php8.1-fpm start
sudo service nginx start
python -m fuji_server -c fuji_server/config/server.ini
```

### Component interaction (walkthrough)

This walkthrough can guide you through the comprehensive codebase.

A good starting point is [`fair_object_controller/assess_by_id`](fuji_server/controllers/fair_object_controller.py#36).
Here, we create a [`FAIRCheck`](fuji_server/controllers/fair_check.py) object called `ft`.
This reads the metrics YAML file during initialisation and will provide all the `check` methods.

Next, several harvesting methods are called, first [`harvest_all_metadata`](fuji_server/controllers/fair_check.py#329), followed by [`harvest_re3_data`](fuji_server/controllers/fair_check.py#345) (Datacite) and [`harvest_github`](fuji_server/controllers/fair_check.py#366) and finally [`harvest_all_data`](fuji_server/controllers/fair_check.py#359).
The harvesters are implemented separately in [`harvester/`](./fuji_server/harvester/), and each of them collects different kinds of data.
This is regardless of the defined metrics, the harvesters always run.
- The metadata harvester looks through HTML markup following schema.org, Dublincore etc., through signposting/typed links.
Ideally, it can find things like author information or license names that way.
- The data harvester is only run if the metadata harvester finds an `object_content_identifier` pointing at content files.
Then, the data harvester runs over the files and checks things like the file format.
- The Github harvester connects with the GitHub API to retrieve metadata and data from software repositories.
It relies on an access token being defined in [`config/github.cfg`](./fujji_server/config/github.cfg).

After harvesting, all evaluators are called.
Each specific evaluator, e.g. [`FAIREvaluatorLicense`](fuji_server/evaluators/fair_evaluator_license.py), is associated with a specific FsF and/or FAIR4RS metric.
Before the evaluator runs any checks on the harvested data, it asserts that its associated metric is listed in the metrics YAML file.
Only if it is, the evaluator runs through and computes a local score.

In the end, all scores are aggregated into F, A, I, R scores.

### Adding support for new metrics

Start by adding a new metrics YAML file in [`yaml/`](./fuji_server/yaml).
Its name has to match the following regular expression: `(metrics_v)?([0-9]+\.[0-9]+)(_[a-z]+)?(\.yaml)`,
and the content should be structured similarly to the existing metric files.

Metric names are tested for validity using regular expressions throughout the code.
If your metric names do not match those, not all components of the tool will execute as expected, so make sure to adjust the expressions.
Regular expression groups are also used for mapping to F, A, I, R categories for scoring, and debug messages are only displayed if they are associated with a valid metric.

Evaluators are mapped to metrics in their `__init__` methods, so adjust existing evaluators to associate with your metric as well or define new evaluators if needed.
The multiple test methods within an evaluator also check whether their specific test is defined.
[`FAIREvaluatorLicense`](fuji_server/evaluators/fair_evaluator_license.py) is an example of an evaluator corresponding to metrics from different sources.

For each metric, the maturity is determined as the maximum of the maturity associated with each passed test.
This means that if a test indicating maturity 3 is passed and one indicating maturity 2 is not passed, the metric will still be shown to be fulfilled with maturity 3.

### Updates to the API

Making changes to the API requires re-generating parts of the code using Swagger.
First, edit [`fuji_server/yaml/openapi.yaml`](fuji_server/yaml/openapi.yaml).
Then, use the [Swagger Editor](https://editor.swagger.io/) to generate a python-flask server.
The zipped files should be automatically downloaded.
Unzip it.

Next:
1. Place the files in `swagger_server/models` into `fuji_server/models`, except `swagger_server/models/__init__.py`.
2. Rename all occurrences of `swagger_server` to `fuji_server`.
3. Add the content of `swagger_server/models/__init__.py` into `fuji_server/__init__.py`.

Unfortunately, the Swagger Editor doesn't always produce code that is compliant with PEP standards.
Run `pre-commit run` (or try to commit) and fix any errors that cannot be automatically fixed.

## License
This project is licensed under the MIT License; for more details, see the [LICENSE](https://github.com/pangaea-data-publisher/fuji/blob/master/LICENSE) file.
Expand Down
37 changes: 34 additions & 3 deletions fuji_server/__init__.py
Original file line number Diff line number Diff line change
@@ -1,38 +1,69 @@
# -*- coding: utf-8 -*-

# SPDX-FileCopyrightText: 2020 PANGAEA (https://www.pangaea.de/)
#
# SPDX-License-Identifier: MIT

# coding: utf-8

# flake8: noqa
from __future__ import absolute_import

# import models into model package
from fuji_server.models.any_of_fair_results_items import AnyOfFAIRResultsResultsItems
from fuji_server.models.any_of_fair_results_results_items import AnyOfFAIRResultsResultsItems
from fuji_server.models.body import Body
from fuji_server.models.community_endorsed_standard import CommunityEndorsedStandard
from fuji_server.models.community_endorsed_standard_output import CommunityEndorsedStandardOutput
from fuji_server.models.community_endorsed_standard_output_inner import CommunityEndorsedStandardOutputInner
from fuji_server.models.core_metadata import CoreMetadata
from fuji_server.models.core_metadata_output import CoreMetadataOutput
from fuji_server.models.data_access_level import DataAccessLevel
from fuji_server.models.data_access_output import DataAccessOutput
from fuji_server.models.data_content_metadata import DataContentMetadata
from fuji_server.models.data_content_metadata_output import DataContentMetadataOutput
from fuji_server.models.data_content_metadata_output_inner import DataContentMetadataOutputInner
from fuji_server.models.data_file_format import DataFileFormat
from fuji_server.models.data_file_format_output import DataFileFormatOutput
from fuji_server.models.data_file_format_output_inner import DataFileFormatOutputInner
from fuji_server.models.data_provenance import DataProvenance
from fuji_server.models.data_provenance_output import DataProvenanceOutput
from fuji_server.models.data_provenance_output_inner import DataProvenanceOutputInner
from fuji_server.models.debug import Debug
from fuji_server.models.fair_result_common import FAIRResultCommon
from fuji_server.models.fair_result_common_score import FAIRResultCommonScore
from fuji_server.models.fair_result_evaluation_criterium import FAIRResultEvaluationCriterium
from fuji_server.models.fair_results import FAIRResults
from fuji_server.models.formal_metadata import FormalMetadata
from fuji_server.models.formal_metadata_output import FormalMetadataOutput
from fuji_server.models.formal_metadata_output_inner import FormalMetadataOutputInner
from fuji_server.models.harvest import Harvest
from fuji_server.models.harvest_results import HarvestResults
from fuji_server.models.harvest_results_metadata import HarvestResultsMetadata
from fuji_server.models.identifier_included import IdentifierIncluded
from fuji_server.models.identifier_included_output import IdentifierIncludedOutput
from fuji_server.models.identifier_included_output_inner import IdentifierIncludedOutputInner
from fuji_server.models.license import License
from fuji_server.models.license_output import LicenseOutput
from fuji_server.models.license_output_inner import LicenseOutputInner
from fuji_server.models.metadata_preserved import MetadataPreserved
from fuji_server.models.metadata_preserved_output import MetadataPreservedOutput
from fuji_server.models.metric import Metric
from fuji_server.models.metrics import Metrics
from fuji_server.models.output_core_metadata_found import OutputCoreMetadataFound
from fuji_server.models.output_search_mechanisms import OutputSearchMechanisms
from fuji_server.models.persistence import Persistence
from fuji_server.models.persistence_output import PersistenceOutput
from fuji_server.models.persistence_output_inner import PersistenceOutputInner
from fuji_server.models.related_resource import RelatedResource
from fuji_server.models.related_resource_output import RelatedResourceOutput
from fuji_server.models.related_resource_output_inner import RelatedResourceOutputInner
from fuji_server.models.searchable import Searchable
from fuji_server.models.searchable_output import SearchableOutput
from fuji_server.models.semantic_vocabulary import SemanticVocabulary
from fuji_server.models.semantic_vocabulary_output import SemanticVocabularyOutput
from fuji_server.models.semantic_vocabulary_output_inner import SemanticVocabularyOutputInner
from fuji_server.models.standardised_protocol_data import StandardisedProtocolData
from fuji_server.models.standardised_protocol_data_output import StandardisedProtocolDataOutput
from fuji_server.models.standardised_protocol_metadata import StandardisedProtocolMetadata
from fuji_server.models.standardised_protocol_metadata_output import StandardisedProtocolMetadataOutput
from fuji_server.models.uniqueness import Uniqueness
from fuji_server.models.uniqueness_output import UniquenessOutput

Expand Down
3 changes: 3 additions & 0 deletions fuji_server/config/github.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[ACCESS]
# set equal to access token if available to increase rate limit (usually starts with 'ghp_')
token =
20 changes: 18 additions & 2 deletions fuji_server/controllers/fair_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
from fuji_server.evaluators.fair_evaluator_unique_identifier_data import FAIREvaluatorUniqueIdentifierData
from fuji_server.evaluators.fair_evaluator_unique_identifier_metadata import FAIREvaluatorUniqueIdentifierMetadata
from fuji_server.harvester.data_harvester import DataHarvester
from fuji_server.harvester.github_harvester import GithubHarvester
from fuji_server.harvester.metadata_harvester import MetadataHarvester
from fuji_server.helper.linked_vocab_helper import linked_vocab_helper
from fuji_server.helper.metadata_collector import MetadataOfferingMethods
Expand Down Expand Up @@ -80,6 +81,7 @@ def __init__(
metadata_service_url=None,
metadata_service_type=None,
use_datacite=True,
use_github=False,
verify_pids=True,
oaipmh_endpoint=None,
metric_version=None,
Expand Down Expand Up @@ -137,6 +139,7 @@ def __init__(

self.rdf_collector = None
self.use_datacite = use_datacite
self.use_github = use_github
self.repeat_pid_check = False
self.logger_message_stream = io.StringIO()
logging.addLevelName(self.LOG_SUCCESS, "SUCCESS")
Expand Down Expand Up @@ -347,6 +350,17 @@ def harvest_all_data(self):
data_harvester.retrieve_all_data()
self.content_identifier = data_harvester.data

def harvest_github(self):
if self.use_github:
github_harvester = GithubHarvester(self.id)
github_harvester.harvest()
self.github_data = github_harvester.data
else:
self.github_data = {}
# NOTE: Update list of metrics that are impacted by this as more are implemented.
for m in ["FRSM-15-R1.1"]:
self.logger.warning(f"{m} : Github support disabled, therefore skipping harvesting through Github API")

def retrieve_metadata_embedded(self):
self.metadata_harvester.retrieve_metadata_embedded()
self.metadata_unmerged.extend(self.metadata_harvester.metadata_unmerged)
Expand Down Expand Up @@ -512,7 +526,7 @@ def get_log_messages_dict(self):
logger_messages = {}
self.logger_message_stream.seek(0)
for log_message in self.logger_message_stream.readlines():
if log_message.startswith("FsF-"):
if log_message.startswith("FsF-") or log_message.startswith("FRSM-"):
m = log_message.split(":", 1)
metric = m[0].strip()
message_n_level = m[1].strip().split("|", 1)
Expand Down Expand Up @@ -541,7 +555,9 @@ def get_assessment_summary(self, results):
}
for res_k, res_v in enumerate(results):
if res_v.get("metric_identifier"):
metric_match = re.search(r"^FsF-(([FAIR])[0-9](\.[0-9])?)-", str(res_v.get("metric_identifier")))
metric_match = re.search(
r"^(?:FRSM-[0-9]+|FsF)-(([FAIR])[0-9](\.[0-9])?)", str(res_v.get("metric_identifier"))
) # match both FAIR and FsF metrics
if metric_match.group(2) is not None:
fair_principle = metric_match[1]
fair_category = metric_match[2]
Expand Down
5 changes: 4 additions & 1 deletion fuji_server/controllers/fair_object_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ async def assess_by_id(body):
oaipmh_endpoint = body.get("oaipmh_endpoint")
metadata_service_type = body.get("metadata_service_type")
usedatacite = body.get("use_datacite")
usegithub = body.get("use_github")
metric_version = body.get("metric_version")
print("BODY METRIC", metric_version)
auth_token = body.get("auth_token")
Expand All @@ -56,6 +57,7 @@ async def assess_by_id(body):
metadata_service_url=metadata_service_endpoint,
metadata_service_type=metadata_service_type,
use_datacite=usedatacite,
use_github=usegithub,
oaipmh_endpoint=oaipmh_endpoint,
metric_version=metric_version,
)
Expand All @@ -80,10 +82,11 @@ async def assess_by_id(body):
if ft.repeat_pid_check:
ft.retrieve_metadata_external(ft.pid_url, repeat_mode=True)
ft.harvest_re3_data()
ft.harvest_github()
core_metadata_result = ft.check_minimal_metatadata()
# print(ft.metadata_unmerged)
content_identifier_included_result = ft.check_data_identifier_included_in_metadata()
# print('F-UJI checks: accsee level')
# print('F-UJI checks: access level')
access_level_result = ft.check_data_access_level()
# print('F-UJI checks: license')
license_result = ft.check_license()
Expand Down
21 changes: 21 additions & 0 deletions fuji_server/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Data files


- [`linked_vocabs/*_ontologies.json`](./linked_vocabs)
- [`access_rights.json`](./access_rights.json): Lists COAR, EPRINTS, EU, OPENAIRE access rights. Used for evaluation of the data access level, FsF-A1-01M, which looks for metadata item `access_level`.
- [`bioschemastypes.txt`](./bioschemastypes.txt)
- [`creativeworktypes.txt`](./creativeworktypes.txt)
- [`default_namespaces.txt`](./default_namespaces.txt): Excluded during evaluation of the semantic vocabulary, FsF-I2-01M.
- [`file_formats.json`](./file_formats.json): Dictionary of scientific file formats. Used in evaluation of R1.3-02D to check the file format of the data.
- [`google_cache.db`](./google_cache.db): Used for evaluating FsF-F4-01M (searchability in major catalogues like DataCite registry, Google Dataset, Mendeley, ...). Google Data search is queried for a PID in column `google_links`. It's a dataset with metadata about datasets that have a DOI or persistent identifier from `identifer.org`.
- [`identifiers_org_resolver_data.json`](./identifiers_org_resolver_data.json): Used in [`IdentifierHelper`](fuji_server/helper/identifier_helper.py).
- [`jsonldcontext.json`](./jsonldcontext.json)
- [`licenses.json`](./licenses.json): Used to populate `Preprocessor.license_names`, a list of SPDX licences. Used in evaluation of licenses, FsF-R1.1-01M.
- [`linked_vocab.json`](./linked_vocab.json)
- [`longterm_formats.json`](./longterm_formats.json): This isn't used any more (code is commented out). Instead, the info should be pulled from [`file_formats.json`](./file_formats.json).
- [`metadata_standards_uris.json`](./metadata_standards_uris.json)
- [`metadata_standards.json`](./metadata_standards.json): Used in evaluation of community metadata, FsF-R1.3-01M.
- [`open_formats.json`](./open_formats.json): This isn't used any more (code is commented out). Instead, the info should be pulled from [`file_formats.json`](./file_formats.json).
- [`repodois.yaml`](./repodois.yaml): DOIs from re3data (Datacite).
- [`ResourceTypes.txt`](./ResourceTypes.txt)
- [`standard_uri_protocols.json`](./standard_uri_protocols.json): Used for evaluating access through standardised protocols (FsF-A1-03D). Mapping of acronym to long name (e.g. FTP, SFTP, HTTP etc.)
Loading

0 comments on commit b30b780

Please sign in to comment.