Merge pull request #478 from softwaresaved/software

Added support for FAIR4RS metrics
pangaea-data-publisher · Feb 28, 2024 · b30b780 · b30b780
2 parents 64d0951 + 732b48e
commit b30b780
Show file tree

Hide file tree

Showing 79 changed files with 2,036 additions and 160 deletions.
diff --git a/.gitignore b/.gitignore
@@ -21,6 +21,15 @@
 # vim
 *.swp
 
+# database
+*.db
+
+# local copies of Google database loading files
+fuji_server/helper/catalogue_helper_google_datasearch_copy.py
+fuji_server/helper/create_google_cache_db_copy.py
+
+# private config
+fuji_server/config/github.cfg
 
 # Created by https://www.gitignore.io/api/python,linux,macos
 

diff --git a/README.md b/README.md
@@ -65,6 +65,8 @@ The F-UJI server can now be started with:
 python -m fuji_server -c fuji_server/config/server.ini
 ```
 
+The OpenAPI user interface is then available at <http://localhost:1071/fuji/api/v1/ui/>.
+
 ### Docker-based installation
 
 ```bash
@@ -76,7 +78,7 @@ To access the OpenAPI user interface, open the URL below in the browser:
 
 Your OpenAPI definition lives here:
 
-<http://localhost:1071/fuji/api/v1/swagger.json>
+<http://localhost:1071/fuji/api/v1/openapi.json>
 
 You can provide a different server config file this way:
 
@@ -100,6 +102,129 @@ If you receive the exception `urllib2.URLError: <urlopen error [SSL: CERTIFICATE
 
 F-UJI is using [basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication), so username and password have to be provided for each REST call which can be configured in `fuji_server/config/users.py`.
 
+## Development
+
+First, make sure to read the [contribution guidelines](./CONTRIBUTING.md).
+They include instructions on how to set up your environment with `pre-commit` and how to run the tests.
+
+The repository includes a [simple web client](./simpleclient/) suitable for interacting with the API during development.
+One way to run it would be with a LEMP stack (Linux, Nginx, MySQL, PHP), which is described in the following.
+
+First, install the necessary packages:
+
+```bash
+sudo apt-get update
+sudo apt-get install nginx
+sudo ufw allow 'Nginx HTTP'
+sudo service mysql start  # expects that mysql is already installed, if not run sudo apt install mysql-server
+sudo service nginx start
+sudo apt install php8.1-fpm php-mysql
+sudo apt install php8.1-curl
+sudo phpenmod curl
+```
+
+Next, configure the service by running `sudo vim /etc/nginx/sites-available/fuji-dev` and paste:
+
+```php
+server {
+    listen 9000;
+    server_name fuji-dev;
+    root /var/www/fuji-dev;
+
+    index index.php;
+
+    location / {
+        try_files $uri $uri/ =404;
+    }
+
+    location ~ \.php$ {
+        include snippets/fastcgi-php.conf;
+        fastcgi_pass unix:/var/run/php/php8.1-fpm.sock;
+     }
+
+    location ~ /\.ht {
+        deny all;
+    }
+}
+```
+
+Link `simpleclient/index.php` and `simpleclient/icons/` to `/var/www/fuji-dev` by running `sudo ln <path_to_fuji>/fuji/simpleclient/* /var/www/fuji-dev/`. You might need to adjust the file permissions to allow non-root writes.
+
+Next,
+```bash
+sudo ln -s /etc/nginx/sites-available/fuji-dev /etc/nginx/sites-enabled/
+sudo nginx -t
+sudo service nginx reload
+sudo service php8.1-fpm start
+```
+
+The web client should now be available at <http://localhost:9000/>. Make sure to adjust the username and password in [`simpleclient/index.php`](./simpleclient/index.php).
+
+After a restart, it may be necessary to start the services again:
+
+```bash
+sudo service php8.1-fpm start
+sudo service nginx start
+python -m fuji_server -c fuji_server/config/server.ini
+```
+
+### Component interaction (walkthrough)
+
+This walkthrough can guide you through the comprehensive codebase.
+
+A good starting point is [`fair_object_controller/assess_by_id`](fuji_server/controllers/fair_object_controller.py#36).
+Here, we create a [`FAIRCheck`](fuji_server/controllers/fair_check.py) object called `ft`.
+This reads the metrics YAML file during initialisation and will provide all the `check` methods.
+
+Next, several harvesting methods are called, first [`harvest_all_metadata`](fuji_server/controllers/fair_check.py#329), followed by [`harvest_re3_data`](fuji_server/controllers/fair_check.py#345) (Datacite) and [`harvest_github`](fuji_server/controllers/fair_check.py#366) and finally [`harvest_all_data`](fuji_server/controllers/fair_check.py#359).
+The harvesters are implemented separately in [`harvester/`](./fuji_server/harvester/), and each of them collects different kinds of data.
+This is regardless of the defined metrics, the harvesters always run.
+- The metadata harvester looks through HTML markup following schema.org, Dublincore etc., through signposting/typed links.
+Ideally, it can find things like author information or license names that way.
+- The data harvester is only run if the metadata harvester finds an `object_content_identifier` pointing at content files.
+Then, the data harvester runs over the files and checks things like the file format.
+- The Github harvester connects with the GitHub API to retrieve metadata and data from software repositories.
+It relies on an access token being defined in [`config/github.cfg`](./fujji_server/config/github.cfg).
+
+After harvesting, all evaluators are called.
+Each specific evaluator, e.g. [`FAIREvaluatorLicense`](fuji_server/evaluators/fair_evaluator_license.py), is associated with a specific FsF and/or FAIR4RS metric.
+Before the evaluator runs any checks on the harvested data, it asserts that its associated metric is listed in the metrics YAML file.
+Only if it is, the evaluator runs through and computes a local score.
+
+In the end, all scores are aggregated into F, A, I, R scores.
+
+### Adding support for new metrics
+
+Start by adding a new metrics YAML file in [`yaml/`](./fuji_server/yaml).
+Its name has to match the following regular expression: `(metrics_v)?([0-9]+\.[0-9]+)(_[a-z]+)?(\.yaml)`,
+and the content should be structured similarly to the existing metric files.
+
+Metric names are tested for validity using regular expressions throughout the code.
+If your metric names do not match those, not all components of the tool will execute as expected, so make sure to adjust the expressions.
+Regular expression groups are also used for mapping to F, A, I, R categories for scoring, and debug messages are only displayed if they are associated with a valid metric.
+
+Evaluators are mapped to metrics in their `__init__` methods, so adjust existing evaluators to associate with your metric as well or define new evaluators if needed.
+The multiple test methods within an evaluator also check whether their specific test is defined.
+[`FAIREvaluatorLicense`](fuji_server/evaluators/fair_evaluator_license.py) is an example of an evaluator corresponding to metrics from different sources.
+
+For each metric, the maturity is determined as the maximum of the maturity associated with each passed test.
+This means that if a test indicating maturity 3 is passed and one indicating maturity 2 is not passed, the metric will still be shown to be fulfilled with maturity 3.
+
+### Updates to the API
+
+Making changes to the API requires re-generating parts of the code using Swagger.
+First, edit [`fuji_server/yaml/openapi.yaml`](fuji_server/yaml/openapi.yaml).
+Then, use the [Swagger Editor](https://editor.swagger.io/) to generate a python-flask server.
+The zipped files should be automatically downloaded.
+Unzip it.
+
+Next:
+1. Place the files in `swagger_server/models` into `fuji_server/models`, except `swagger_server/models/__init__.py`.
+2. Rename all occurrences of `swagger_server` to `fuji_server`.
+3. Add the content of `swagger_server/models/__init__.py` into `fuji_server/__init__.py`.
+
+Unfortunately, the Swagger Editor doesn't always produce code that is compliant with PEP standards.
+Run `pre-commit run` (or try to commit) and fix any errors that cannot be automatically fixed.
 
 ## License
 This project is licensed under the MIT License; for more details, see the [LICENSE](https://github.com/pangaea-data-publisher/fuji/blob/master/LICENSE) file.

diff --git a/fuji_server/__init__.py b/fuji_server/__init__.py
@@ -1,38 +1,69 @@
-# -*- coding: utf-8 -*-
-
 # SPDX-FileCopyrightText: 2020 PANGAEA (https://www.pangaea.de/)
 #
 # SPDX-License-Identifier: MIT
 
+# coding: utf-8
+
 # flake8: noqa
 from __future__ import absolute_import
 
 # import models into model package
-from fuji_server.models.any_of_fair_results_items import AnyOfFAIRResultsResultsItems
+from fuji_server.models.any_of_fair_results_results_items import AnyOfFAIRResultsResultsItems
 from fuji_server.models.body import Body
+from fuji_server.models.community_endorsed_standard import CommunityEndorsedStandard
+from fuji_server.models.community_endorsed_standard_output import CommunityEndorsedStandardOutput
+from fuji_server.models.community_endorsed_standard_output_inner import CommunityEndorsedStandardOutputInner
 from fuji_server.models.core_metadata import CoreMetadata
 from fuji_server.models.core_metadata_output import CoreMetadataOutput
+from fuji_server.models.data_access_level import DataAccessLevel
+from fuji_server.models.data_access_output import DataAccessOutput
+from fuji_server.models.data_content_metadata import DataContentMetadata
+from fuji_server.models.data_content_metadata_output import DataContentMetadataOutput
+from fuji_server.models.data_content_metadata_output_inner import DataContentMetadataOutputInner
+from fuji_server.models.data_file_format import DataFileFormat
+from fuji_server.models.data_file_format_output import DataFileFormatOutput
+from fuji_server.models.data_file_format_output_inner import DataFileFormatOutputInner
+from fuji_server.models.data_provenance import DataProvenance
+from fuji_server.models.data_provenance_output import DataProvenanceOutput
+from fuji_server.models.data_provenance_output_inner import DataProvenanceOutputInner
 from fuji_server.models.debug import Debug
 from fuji_server.models.fair_result_common import FAIRResultCommon
 from fuji_server.models.fair_result_common_score import FAIRResultCommonScore
+from fuji_server.models.fair_result_evaluation_criterium import FAIRResultEvaluationCriterium
 from fuji_server.models.fair_results import FAIRResults
+from fuji_server.models.formal_metadata import FormalMetadata
+from fuji_server.models.formal_metadata_output import FormalMetadataOutput
+from fuji_server.models.formal_metadata_output_inner import FormalMetadataOutputInner
+from fuji_server.models.harvest import Harvest
+from fuji_server.models.harvest_results import HarvestResults
+from fuji_server.models.harvest_results_metadata import HarvestResultsMetadata
 from fuji_server.models.identifier_included import IdentifierIncluded
 from fuji_server.models.identifier_included_output import IdentifierIncludedOutput
 from fuji_server.models.identifier_included_output_inner import IdentifierIncludedOutputInner
 from fuji_server.models.license import License
 from fuji_server.models.license_output import LicenseOutput
 from fuji_server.models.license_output_inner import LicenseOutputInner
+from fuji_server.models.metadata_preserved import MetadataPreserved
+from fuji_server.models.metadata_preserved_output import MetadataPreservedOutput
 from fuji_server.models.metric import Metric
 from fuji_server.models.metrics import Metrics
 from fuji_server.models.output_core_metadata_found import OutputCoreMetadataFound
 from fuji_server.models.output_search_mechanisms import OutputSearchMechanisms
 from fuji_server.models.persistence import Persistence
 from fuji_server.models.persistence_output import PersistenceOutput
+from fuji_server.models.persistence_output_inner import PersistenceOutputInner
 from fuji_server.models.related_resource import RelatedResource
 from fuji_server.models.related_resource_output import RelatedResourceOutput
 from fuji_server.models.related_resource_output_inner import RelatedResourceOutputInner
 from fuji_server.models.searchable import Searchable
 from fuji_server.models.searchable_output import SearchableOutput
+from fuji_server.models.semantic_vocabulary import SemanticVocabulary
+from fuji_server.models.semantic_vocabulary_output import SemanticVocabularyOutput
+from fuji_server.models.semantic_vocabulary_output_inner import SemanticVocabularyOutputInner
+from fuji_server.models.standardised_protocol_data import StandardisedProtocolData
+from fuji_server.models.standardised_protocol_data_output import StandardisedProtocolDataOutput
+from fuji_server.models.standardised_protocol_metadata import StandardisedProtocolMetadata
+from fuji_server.models.standardised_protocol_metadata_output import StandardisedProtocolMetadataOutput
 from fuji_server.models.uniqueness import Uniqueness
 from fuji_server.models.uniqueness_output import UniquenessOutput
 

diff --git a/fuji_server/config/github.ini b/fuji_server/config/github.ini
@@ -0,0 +1,3 @@
+[ACCESS]
+# set equal to access token if available to increase rate limit (usually starts with 'ghp_')
+token =
diff --git a/fuji_server/controllers/fair_check.py b/fuji_server/controllers/fair_check.py
@@ -37,6 +37,7 @@
 from fuji_server.evaluators.fair_evaluator_unique_identifier_data import FAIREvaluatorUniqueIdentifierData
 from fuji_server.evaluators.fair_evaluator_unique_identifier_metadata import FAIREvaluatorUniqueIdentifierMetadata
 from fuji_server.harvester.data_harvester import DataHarvester
+from fuji_server.harvester.github_harvester import GithubHarvester
 from fuji_server.harvester.metadata_harvester import MetadataHarvester
 from fuji_server.helper.linked_vocab_helper import linked_vocab_helper
 from fuji_server.helper.metadata_collector import MetadataOfferingMethods
@@ -80,6 +81,7 @@ def __init__(
         metadata_service_url=None,
         metadata_service_type=None,
         use_datacite=True,
+        use_github=False,
         verify_pids=True,
         oaipmh_endpoint=None,
         metric_version=None,
@@ -137,6 +139,7 @@ def __init__(
 
         self.rdf_collector = None
         self.use_datacite = use_datacite
+        self.use_github = use_github
         self.repeat_pid_check = False
         self.logger_message_stream = io.StringIO()
         logging.addLevelName(self.LOG_SUCCESS, "SUCCESS")
@@ -347,6 +350,17 @@ def harvest_all_data(self):
             data_harvester.retrieve_all_data()
             self.content_identifier = data_harvester.data
 
+    def harvest_github(self):
+        if self.use_github:
+            github_harvester = GithubHarvester(self.id)
+            github_harvester.harvest()
+            self.github_data = github_harvester.data
+        else:
+            self.github_data = {}
+            # NOTE: Update list of metrics that are impacted by this as more are implemented.
+            for m in ["FRSM-15-R1.1"]:
+                self.logger.warning(f"{m} : Github support disabled, therefore skipping harvesting through Github API")
+
     def retrieve_metadata_embedded(self):
         self.metadata_harvester.retrieve_metadata_embedded()
         self.metadata_unmerged.extend(self.metadata_harvester.metadata_unmerged)
@@ -512,7 +526,7 @@ def get_log_messages_dict(self):
         logger_messages = {}
         self.logger_message_stream.seek(0)
         for log_message in self.logger_message_stream.readlines():
-            if log_message.startswith("FsF-"):
+            if log_message.startswith("FsF-") or log_message.startswith("FRSM-"):
                 m = log_message.split(":", 1)
                 metric = m[0].strip()
                 message_n_level = m[1].strip().split("|", 1)
@@ -541,7 +555,9 @@ def get_assessment_summary(self, results):
         }
         for res_k, res_v in enumerate(results):
             if res_v.get("metric_identifier"):
-                metric_match = re.search(r"^FsF-(([FAIR])[0-9](\.[0-9])?)-", str(res_v.get("metric_identifier")))
+                metric_match = re.search(
+                    r"^(?:FRSM-[0-9]+|FsF)-(([FAIR])[0-9](\.[0-9])?)", str(res_v.get("metric_identifier"))
+                )  # match both FAIR and FsF metrics
                 if metric_match.group(2) is not None:
                     fair_principle = metric_match[1]
                     fair_category = metric_match[2]

diff --git a/fuji_server/controllers/fair_object_controller.py b/fuji_server/controllers/fair_object_controller.py
@@ -39,6 +39,7 @@ async def assess_by_id(body):
         oaipmh_endpoint = body.get("oaipmh_endpoint")
         metadata_service_type = body.get("metadata_service_type")
         usedatacite = body.get("use_datacite")
+        usegithub = body.get("use_github")
         metric_version = body.get("metric_version")
         print("BODY METRIC", metric_version)
         auth_token = body.get("auth_token")
@@ -56,6 +57,7 @@ async def assess_by_id(body):
             metadata_service_url=metadata_service_endpoint,
             metadata_service_type=metadata_service_type,
             use_datacite=usedatacite,
+            use_github=usegithub,
             oaipmh_endpoint=oaipmh_endpoint,
             metric_version=metric_version,
         )
@@ -80,10 +82,11 @@ async def assess_by_id(body):
         if ft.repeat_pid_check:
             ft.retrieve_metadata_external(ft.pid_url, repeat_mode=True)
         ft.harvest_re3_data()
+        ft.harvest_github()
         core_metadata_result = ft.check_minimal_metatadata()
         # print(ft.metadata_unmerged)
         content_identifier_included_result = ft.check_data_identifier_included_in_metadata()
-        # print('F-UJI checks: accsee level')
+        # print('F-UJI checks: access level')
         access_level_result = ft.check_data_access_level()
         # print('F-UJI checks: license')
         license_result = ft.check_license()

diff --git a/fuji_server/data/README.md b/fuji_server/data/README.md
@@ -0,0 +1,21 @@
+# Data files
+
+
+- [`linked_vocabs/*_ontologies.json`](./linked_vocabs)
+- [`access_rights.json`](./access_rights.json): Lists COAR, EPRINTS, EU, OPENAIRE access rights. Used for evaluation of the data access level, FsF-A1-01M, which looks for metadata item `access_level`.
+- [`bioschemastypes.txt`](./bioschemastypes.txt)
+- [`creativeworktypes.txt`](./creativeworktypes.txt)
+- [`default_namespaces.txt`](./default_namespaces.txt): Excluded during evaluation of the semantic vocabulary, FsF-I2-01M.
+- [`file_formats.json`](./file_formats.json): Dictionary of scientific file formats. Used in evaluation of R1.3-02D to check the file format of the data.
+- [`google_cache.db`](./google_cache.db): Used for evaluating FsF-F4-01M (searchability in major catalogues like DataCite registry, Google Dataset, Mendeley, ...). Google Data search is queried for a PID in column `google_links`. It's a dataset with metadata about datasets that have a DOI or persistent identifier from `identifer.org`.
+- [`identifiers_org_resolver_data.json`](./identifiers_org_resolver_data.json): Used in [`IdentifierHelper`](fuji_server/helper/identifier_helper.py).
+- [`jsonldcontext.json`](./jsonldcontext.json)
+- [`licenses.json`](./licenses.json): Used to populate `Preprocessor.license_names`, a list of SPDX licences. Used in evaluation of licenses, FsF-R1.1-01M.
+- [`linked_vocab.json`](./linked_vocab.json)
+- [`longterm_formats.json`](./longterm_formats.json): This isn't used any more (code is commented out). Instead, the info should be pulled from [`file_formats.json`](./file_formats.json).
+- [`metadata_standards_uris.json`](./metadata_standards_uris.json)
+- [`metadata_standards.json`](./metadata_standards.json): Used in evaluation of community metadata, FsF-R1.3-01M.
+- [`open_formats.json`](./open_formats.json): This isn't used any more (code is commented out). Instead, the info should be pulled from [`file_formats.json`](./file_formats.json).
+- [`repodois.yaml`](./repodois.yaml): DOIs from re3data (Datacite).
+- [`ResourceTypes.txt`](./ResourceTypes.txt)
+- [`standard_uri_protocols.json`](./standard_uri_protocols.json): Used for evaluating access through standardised protocols (FsF-A1-03D). Mapping of acronym to long name (e.g. FTP, SFTP, HTTP etc.)