Skip to content
Damiano Piovesan edited this page Jan 24, 2024 · 32 revisions

The word aspect, namespace and sub-ontology are used interchangeably in the following documentation.

Workflow

Workflow_black

Parsing

Ontology file - Only the OBO format is accepted. The following rules are applied:

  • Obsolete terms are always excluded.
  • Only "is_a" and "part_of" relationships are considered. You can modify this behaviour calling obo_parser function with the valid_rel argument.
  • Cross-aspect (cross-namespace) relationships are always discarded.
  • Alternative term identifiers are automatically mapped to canonical identifiers both in the prediction and ground truth inputs.
  • When information accretion is provided, terms which are not available in the accretion file are removed from the ontology.

Prediction folder - Prediction files inside the prediction folder are filtered considering only those targets included in the ground truth and only those terms included in the ontology file. If the ground truth contains only annotations from one aspect (e.g. "molecular function"), the evaluation is provided only for that aspect.

Internal representation and memory usage

  • The algorithm stores in memory a Numpy boolean N x M array (N = number of ground truth targets; M = number of ontology terms of a single aspect) for each aspect in the ground truth file.

  • An array of the same size (rows ≤ N), but containing floats (the prediction scores) instead of booleans, is stored for each prediction file. Prediction files are processed one by one and the matrix gets reassigned.

Propagation and topological sorting

Both the predictions and the ground truth annotations are always propagated up to the ontology root(s). The tologically sorted list of nodes allows to optimize the propagation process by scanning the prediction and ground truth matrices only once based on the indexed provided in the sorting vector.

Two propagation strategies are available:

  • max scores are propagated considering always the max.
  • fill prediction scores are propagated without overwriting the scores assigned to the parents.

Evaluation

Critical Assessment of protein Function Annotation (CAFA)

Provious CAFA challanges

In order to replicate CAFA results, you can simply adapt the input files.

  • No/partial knowledge can be reproduced by filtering/splitting the ground truth file
  • In order to exclude specific terms from the analyses, e.g. generic "binding" terms, you can directly modify the input ontology file

CAFA5 / Kaggle

Owing to its reliability and accuracy, the organizers have selected CAFA-evaluator as the official evaluation software in the CAFA5 Kaggle competition. In Kaggle the software is executed with the following command:

cafaeval go-basic.obo prediction_dir test_terms.tsv -ia IA.txt -prop fill -norm cafa -th_step 0.001 -max_terms 500

In the example above the method prediction file should be inside the prediction_dir folder and evaluated against the test_terms.tsv file (not available to participants) containing the ground truth.

Ontology and topological sorting

  • Ground truth and predictions
  • Information accretion

Evaluation

  • Normalization
  • Roots
  • Score propagation
  • Thresholds granularity
  • Monotonic curves
Clone this wiki locally