Merge pull request #17 from AuReMe/precomputed_db

Precomputed db
AuReMe · Jan 27, 2025 · acf38e8 · acf38e8
2 parents 536536c + 45222c8
commit acf38e8
Show file tree

Hide file tree

Showing 33 changed files with 787 additions and 9,342 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -58,7 +58,7 @@ jobs:
         pytest test_proteomes.py
         pytest test_clustering.py
         pytest test_annotation.py
-        pytest test_workflow.py
+        pytest test_workflow_uniprot.py
         pytest test_precomputed.py
         pytest test_eggnog.py
         pytest test_report.py
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,15 +1,30 @@
 # Changelog
 
-# EsMeCaTa v0.5.4 (2024-11-09)
+# EsMeCaTa v0.6.0 (2025-01-27)
 
 ## Add
 
-* Create database from different output folders of esmecata (`from_runs`).
+* New command `esmecata_create_db` to create database from different output folders of esmecata (`from_runs`).
+* Full release of `esmecata precomputed` associated with the first version of [esmecata precomputed database](https://doi.org/10.5281/zenodo.13354073).
+* Option threshold (`-t`) to precomputed.
+* Add `--gseapyCutOff` option to `gseapy_enrichr`.
+* A check after database creation to detect taxon with few predicted proteins compared to higher affiliated taxon. 
 * Check the good format of the gzip file.
+* Header `KEGG_reaction` in annotation_reference from `annotation_uniprot` to avoid issues with `esmecata_create_db`.
 
 ## Fix
 
 * Issue with protein IDs from UniParc during annotation (incorrect split on '|').
+* Fix issue in `get_taxon_obs_name` function.
+* Issues in test.
+
+## Modify
+
+* Add database version in log.
+* Rename `test_workflow.py` into `test_workflow_uniprot.py`, to better reflect what is done.
+* Update workflow figure.
+* Update readme.
+* Update article_data folder and the associated readme.
 
 # EsMeCaTa v0.5.4 (2024-11-06)
 

diff --git a/README.md b/README.md
diff --git a/article_data/README.md b/article_data/README.md
@@ -3,22 +3,26 @@
 ## Table of contents
 - [Input files from the article](#input-files-from-the-article)
   - [Table of contents](#table-of-contents)
-  - [Experiments](#experiments)
-    - [Manually selected taxa](#manually-selected-taxa)
-    - [MGnify validation](#mgnify-validation)
-    - [Biogas reactor](#biogas-reactor)
-    - [Algae symbionts](#algae-symbionts)
-  - [Old Experiments](#old-experiments)
+  - [Experiments (preprint of 2025)](#experiments-preprint-of-2025)
+    - [Datasets](#datasets)
+      - [Manually selected taxa](#manually-selected-taxa)
+      - [MGnify validation](#mgnify-validation)
+      - [Methanogenic reactor](#methanogenic-reactor)
+      - [Algae symbionts](#algae-symbionts)
+    - [Reproduce experiments](#reproduce-experiments)
+  - [Old Experiments (preprint of 2022)](#old-experiments-preprint-of-2022)
     - [Taxonomic affiliations from Gammaproteobacteria and Alveolata](#taxonomic-affiliations-from-gammaproteobacteria-and-alveolata)
     - [Taxonomic affiliations from 16S and rpoB](#taxonomic-affiliations-from-16s-and-rpob)
 
-## Experiments
+## Experiments (preprint of 2025)
 
-### Manually selected taxa
+### Datasets
+
+#### Manually selected taxa
 
 A folder containing an input file for esmecata containing 13 manually selected taxonomic affiliations from Gammaproteobacteria and Alveolata.
 
-### MGnify validation
+#### MGnify validation
 
 4 input files associated with dataset from MGnify:
 - [honeybee gut v1.0](https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/honeybee-gut/v1.0/)
@@ -29,15 +33,50 @@ A folder containing an input file for esmecata containing 13 manually selected t
 For each dataset, it contains all the metadata associated with the genomes of the dataset. The columns `observation_name` and `taxonomic_affiliation` are added so the file can be used by EsMeCaTa.
 The column `Completness` is used to filter the genomes (for the article by using the threshold of at least 90% of Completness).
 
-### Biogas reactor
+#### Methanogenic reactor
 
-An OTU table from a biogas reactor experiment containing the taxonomic assignment and the abundance of each OTU in different samples (corresponding to time of measurements of the biogas reactor).
+An OTU table from a methanogenic reactor experiment containing the taxonomic assignment and the abundance of each OTU in different samples (corresponding to time of measurements of the biogas reactor).
 
-### Algae symbionts
+#### Algae symbionts
 
 Microbiotes from Metagenomes experiment ([Burgunter et al. 2020](https://doi.org/10.3389/fmars.2020.00085) and [KleinJan et al. 2023](https://doi.org/10.1111/mec.16766)). 35 MAGs were selected from KleinJan et al. 2023 with more than 90% completion.
 
-## Old Experiments
+### Reproduce experiments
+
+The experimetns made in the article of EsMeCaTa (a preprint is available at [biorXiv](https://doi.org/10.1101/2022.03.16.484574
+)) are available in this [Zenodo archive](https://zenodo.org/records/14502342).
+
+Furthermore, in this archive, there are several precomptued database present (to try) to reproduce these experiments.
+
+To run these experiments, you will have to install esmecata with: `pip install esmecata`
+
+To do so, download one of the precomputed database associated with the dataset you want to run (such as `precomputed_db_honeybee.zip`). Then you can use `esmecata precomputed` command to use this database on an input file. For example:
+
+```
+esmecata precomputed -i honeybee_esmecata_metdata.tsv -d precomputed_db_honeybee.zip -o output_folder_honeybee
+```
+
+For a better change to reproduce the results, it is recommended to use the NCBI Taxonomy database associacted with the dataset. The NCBI Taxonomy database can be dwonload from the Zenodo archive (file `ncbi_taxonomy_database.zip`). To know which version use, refer to the following table:
+
+| dataset                   | UniProt | NCBI Taxonomy |
+|---------------------------|---------|---------------|
+| Toy example               | 2023_04 | 09-2023       |
+| E. siliculosus microbiota | 2023_02 | 04-2023       |
+| Honeybee gut              | 2023_05 | 12-2023       |
+| Human Oral                | 2023_05 | 12-2023       |
+| Marine                    | 2023_05 | 12-2023       |
+| Pig Gut                   | 2023_05 | 12-2023       |
+| Methanogenic reactor      | 2024_01 | 01-2024       |
+
+You can update the NCBI Taxonomy database used by ete3 with the following command (here we use as example `taxdmp_2024-01.tar.gz` associated with the methanogenic reactor):
+
+```
+python3 -c "from ete3 import NCBITaxa; ncbi = NCBITaxa(); ncbi.update_taxonomy_database('taxdmp_2024-01.tar.gz')"
+```
+
+This will create output folder containing the predictions for the associated organisms of the community.
+
+## Old Experiments (preprint of 2022)
 
 In the [first preprint](https://www.biorxiv.org/content/10.1101/2022.03.16.484574v1) two experiments were performed:
 

diff --git a/...le_data/biogas_reactor/biogas_reactor.tsv → ...anogenic_reactor/methanogenic_reactor.tsv b/...le_data/biogas_reactor/biogas_reactor.tsv → ...anogenic_reactor/methanogenic_reactor.tsv
diff --git a/esmecata/__init__.py b/esmecata/__init__.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
+# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
 # Univ. Grenoble Alpes, Inria, Microcosme
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -13,4 +13,4 @@
 # You should have received a copy of the GNU General Public License
 # along with this program. If not, see <http://www.gnu.org/licenses/>
 
-__version__ = '0.5.5'
+__version__ = '0.6.0'
diff --git a/esmecata/__main__.py b/esmecata/__main__.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
+# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
 # Univ. Grenoble Alpes, Inria, Microcosme
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -139,7 +139,7 @@ def main():
         '--update-affiliations',
         dest='update_affiliations',
         help='''If the taxonomic affiliations were assigned from an outdated taxonomic database, this can lead to taxon not be found in ete3 database. \
-            This option tries to udpate the taxonomic affiliations using the lowest taxon name.''',
+            This option tries to update the taxonomic affiliations using the lowest taxon name.''',
         required=False,
         action='store_true',
         default=None)
@@ -386,7 +386,8 @@ def main():
         help='Use precomputed database to create estimated data for the run.',
         parents=[
             parent_parser_i_taxon, parent_parser_d, parent_parser_o,
-            parent_parser_rank_limit, parent_parser_update_affiliation
+            parent_parser_rank_limit, parent_parser_update_affiliation,
+            parent_parser_thr
             ],
         allow_abbrev=False)
 
@@ -460,7 +461,8 @@ def main():
                                 args.linclust, args.minimal_number_proteomes, args.update_affiliations,
                                 args.option_bioservices, args.eggnog_tmp_dir, args.no_dbmem)
     elif args.cmd == 'precomputed':
-        precomputed_parse_affiliation(args.input, args.database, args.output, args.rank_limit, args.update_affiliations)
+        precomputed_parse_affiliation(args.input, args.database, args.output, args.rank_limit, args.update_affiliations,
+                                      args.threshold_clustering)
 
     logger.info("--- Total runtime %.2f seconds ---" % (time.time() - start_time))
     logger.warning(f'--- Logs written in {log_file_path} ---')

diff --git a/esmecata/__main_create_database__.py b/esmecata/__main_create_database__.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
+# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
 # Univ. Grenoble Alpes, Inria, Microcosme
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by

diff --git a/esmecata/__main_gseapy__.py b/esmecata/__main_gseapy__.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
+# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
 # Univ. Grenoble Alpes, Inria, Microcosme
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -119,6 +119,16 @@ def main():
         metavar='INT',
         default=None)
 
+    parent_parser_gseapyCutOff = argparse.ArgumentParser(add_help=False)
+    parent_parser_gseapyCutOff.add_argument(
+        '--gseapyCutOff',
+        dest='gseapyCutOff',
+        required=False,
+        type=float,
+        help='Adjust-Pval cutoff for gseapy enrichr, default: 0.05 (--cut-off argument of gseapy).',
+        metavar='FLOAT',
+        default=0.05)
+
     # subparsers
     subparsers = parser.add_subparsers(
         title='subcommands',
@@ -130,7 +140,7 @@ def main():
         help='Extract enriched functions from groups (either chosen from tax_rank or manually selected) using gseapy enrichr and orsum.',
         parents=[
             parent_parser_f, parent_parser_o, parent_parser_grouping, parent_parser_t, parent_parser_taxa_list,
-            parent_parser_e, parent_parser_g, parent_parser_orsumMinTermSize
+            parent_parser_e, parent_parser_g, parent_parser_orsumMinTermSize, parent_parser_gseapyCutOff
             ],
         allow_abbrev=False)
 
@@ -158,7 +168,8 @@ def main():
 
     if args.cmd == 'gseapy_enrichr':
         taxon_rank_annotation_enrichment(args.input_folder, args.output, args.grouping, taxon_rank=args.taxon_rank, taxa_lists_file=args.taxa_list,
-                                         enzyme_data_file=args.enzyme_file, go_basic_obo_file=args.go_file, orsum_minterm_size=args.orsumMinTermSize)
+                                         enzyme_data_file=args.enzyme_file, go_basic_obo_file=args.go_file, orsum_minterm_size=args.orsumMinTermSize,
+                                         selected_adjust_pvalue_cutoff=args.gseapyCutOff)
 
     logger.info("--- Total runtime %.2f seconds ---" % (time.time() - start_time))
     logger.warning(f'--- Logs written in {log_file_path} ---')

diff --git a/esmecata/__main_report__.py b/esmecata/__main_report__.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
+# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
 # Univ. Grenoble Alpes, Inria, Microcosme
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by

diff --git a/esmecata/core/annotation.py b/esmecata/core/annotation.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
+# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
 # Univ. Grenoble Alpes, Inria, Microcosme
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -1029,22 +1029,23 @@ def write_annotation_reference(protein_annotations, reference_proteins, annotati
     with open(annotation_reference_file, 'w') as output_tsv:
         csvwriter = csv.writer(output_tsv, delimiter='\t')
         if expression_output_dict:
-            csvwriter.writerow(['protein_cluster', 'cluster_members', 'protein_name', 'gene_name', 'GO', 'EC', 'Induction', 'Tissue_Specificity', 'Disruption_Phenotype'])
+            csvwriter.writerow(['protein_cluster', 'cluster_members', 'protein_name', 'gene_name', 'GO', 'EC', 'KEGG_reaction' 'Induction', 'Tissue_Specificity', 'Disruption_Phenotype'])
         else:
-            csvwriter.writerow(['protein_cluster', 'cluster_members', 'protein_name', 'gene_name', 'GO', 'EC'])
+            csvwriter.writerow(['protein_cluster', 'cluster_members', 'protein_name', 'gene_name', 'GO', 'EC', 'KEGG_reaction'])
         for protein in protein_annotations:
             protein_name = protein_annotations[protein][0]
             gene_name = protein_annotations[protein][3]
             cluster_members = ','.join(reference_proteins[protein])
             gos = ','.join(sorted(list(protein_annotations[protein][1])))
             ecs = ','.join(sorted(list(protein_annotations[protein][2])))
+            kegg_reaction = ''
             if expression_output_dict:
                 induction = expression_output_dict[protein][0]
                 tissue_specificity = expression_output_dict[protein][1]
                 disruption = expression_output_dict[protein][2]
-                csvwriter.writerow([protein, cluster_members, protein_name, gene_name, gos, ecs, induction, tissue_specificity, disruption])
+                csvwriter.writerow([protein, cluster_members, protein_name, gene_name, gos, ecs, kegg_reaction, induction, tissue_specificity, disruption])
             else:
-                csvwriter.writerow([protein, cluster_members, protein_name, gene_name, gos, ecs])
+                csvwriter.writerow([protein, cluster_members, protein_name, gene_name, gos, ecs, kegg_reaction])
 
 
 def create_pathologic(base_filename, annotated_protein_to_keeps, reference_proteins, pathologic_output_file):

diff --git a/esmecata/core/clustering.py b/esmecata/core/clustering.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
+# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
 # Univ. Grenoble Alpes, Inria, Microcosme
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by

diff --git a/esmecata/core/eggnog.py b/esmecata/core/eggnog.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
+# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
 # Univ. Grenoble Alpes, Inria, Microcosme
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by