Skip to content

Commit

Permalink
Merge pull request #17 from AuReMe/precomputed_db
Browse files Browse the repository at this point in the history
Precomputed db
  • Loading branch information
ArnaudBelcour authored Jan 27, 2025
2 parents 536536c + 45222c8 commit acf38e8
Show file tree
Hide file tree
Showing 33 changed files with 787 additions and 9,342 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:
pytest test_proteomes.py
pytest test_clustering.py
pytest test_annotation.py
pytest test_workflow.py
pytest test_workflow_uniprot.py
pytest test_precomputed.py
pytest test_eggnog.py
pytest test_report.py
19 changes: 17 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,30 @@
# Changelog

# EsMeCaTa v0.5.4 (2024-11-09)
# EsMeCaTa v0.6.0 (2025-01-27)

## Add

* Create database from different output folders of esmecata (`from_runs`).
* New command `esmecata_create_db` to create database from different output folders of esmecata (`from_runs`).
* Full release of `esmecata precomputed` associated with the first version of [esmecata precomputed database](https://doi.org/10.5281/zenodo.13354073).
* Option threshold (`-t`) to precomputed.
* Add `--gseapyCutOff` option to `gseapy_enrichr`.
* A check after database creation to detect taxon with few predicted proteins compared to higher affiliated taxon.
* Check the good format of the gzip file.
* Header `KEGG_reaction` in annotation_reference from `annotation_uniprot` to avoid issues with `esmecata_create_db`.

## Fix

* Issue with protein IDs from UniParc during annotation (incorrect split on '|').
* Fix issue in `get_taxon_obs_name` function.
* Issues in test.

## Modify

* Add database version in log.
* Rename `test_workflow.py` into `test_workflow_uniprot.py`, to better reflect what is done.
* Update workflow figure.
* Update readme.
* Update article_data folder and the associated readme.

# EsMeCaTa v0.5.4 (2024-11-06)

Expand Down
91 changes: 61 additions & 30 deletions README.md

Large diffs are not rendered by default.

65 changes: 52 additions & 13 deletions article_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,26 @@
## Table of contents
- [Input files from the article](#input-files-from-the-article)
- [Table of contents](#table-of-contents)
- [Experiments](#experiments)
- [Manually selected taxa](#manually-selected-taxa)
- [MGnify validation](#mgnify-validation)
- [Biogas reactor](#biogas-reactor)
- [Algae symbionts](#algae-symbionts)
- [Old Experiments](#old-experiments)
- [Experiments (preprint of 2025)](#experiments-preprint-of-2025)
- [Datasets](#datasets)
- [Manually selected taxa](#manually-selected-taxa)
- [MGnify validation](#mgnify-validation)
- [Methanogenic reactor](#methanogenic-reactor)
- [Algae symbionts](#algae-symbionts)
- [Reproduce experiments](#reproduce-experiments)
- [Old Experiments (preprint of 2022)](#old-experiments-preprint-of-2022)
- [Taxonomic affiliations from Gammaproteobacteria and Alveolata](#taxonomic-affiliations-from-gammaproteobacteria-and-alveolata)
- [Taxonomic affiliations from 16S and rpoB](#taxonomic-affiliations-from-16s-and-rpob)

## Experiments
## Experiments (preprint of 2025)

### Manually selected taxa
### Datasets

#### Manually selected taxa

A folder containing an input file for esmecata containing 13 manually selected taxonomic affiliations from Gammaproteobacteria and Alveolata.

### MGnify validation
#### MGnify validation

4 input files associated with dataset from MGnify:
- [honeybee gut v1.0](https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/honeybee-gut/v1.0/)
Expand All @@ -29,15 +33,50 @@ A folder containing an input file for esmecata containing 13 manually selected t
For each dataset, it contains all the metadata associated with the genomes of the dataset. The columns `observation_name` and `taxonomic_affiliation` are added so the file can be used by EsMeCaTa.
The column `Completness` is used to filter the genomes (for the article by using the threshold of at least 90% of Completness).

### Biogas reactor
#### Methanogenic reactor

An OTU table from a biogas reactor experiment containing the taxonomic assignment and the abundance of each OTU in different samples (corresponding to time of measurements of the biogas reactor).
An OTU table from a methanogenic reactor experiment containing the taxonomic assignment and the abundance of each OTU in different samples (corresponding to time of measurements of the biogas reactor).

### Algae symbionts
#### Algae symbionts

Microbiotes from Metagenomes experiment ([Burgunter et al. 2020](https://doi.org/10.3389/fmars.2020.00085) and [KleinJan et al. 2023](https://doi.org/10.1111/mec.16766)). 35 MAGs were selected from KleinJan et al. 2023 with more than 90% completion.

## Old Experiments
### Reproduce experiments

The experimetns made in the article of EsMeCaTa (a preprint is available at [biorXiv](https://doi.org/10.1101/2022.03.16.484574
)) are available in this [Zenodo archive](https://zenodo.org/records/14502342).

Furthermore, in this archive, there are several precomptued database present (to try) to reproduce these experiments.

To run these experiments, you will have to install esmecata with: `pip install esmecata`

To do so, download one of the precomputed database associated with the dataset you want to run (such as `precomputed_db_honeybee.zip`). Then you can use `esmecata precomputed` command to use this database on an input file. For example:

```
esmecata precomputed -i honeybee_esmecata_metdata.tsv -d precomputed_db_honeybee.zip -o output_folder_honeybee
```

For a better change to reproduce the results, it is recommended to use the NCBI Taxonomy database associacted with the dataset. The NCBI Taxonomy database can be dwonload from the Zenodo archive (file `ncbi_taxonomy_database.zip`). To know which version use, refer to the following table:

| dataset | UniProt | NCBI Taxonomy |
|---------------------------|---------|---------------|
| Toy example | 2023_04 | 09-2023 |
| E. siliculosus microbiota | 2023_02 | 04-2023 |
| Honeybee gut | 2023_05 | 12-2023 |
| Human Oral | 2023_05 | 12-2023 |
| Marine | 2023_05 | 12-2023 |
| Pig Gut | 2023_05 | 12-2023 |
| Methanogenic reactor | 2024_01 | 01-2024 |

You can update the NCBI Taxonomy database used by ete3 with the following command (here we use as example `taxdmp_2024-01.tar.gz` associated with the methanogenic reactor):

```
python3 -c "from ete3 import NCBITaxa; ncbi = NCBITaxa(); ncbi.update_taxonomy_database('taxdmp_2024-01.tar.gz')"
```

This will create output folder containing the predictions for the associated organisms of the community.

## Old Experiments (preprint of 2022)

In the [first preprint](https://www.biorxiv.org/content/10.1101/2022.03.16.484574v1) two experiments were performed:

Expand Down
4 changes: 2 additions & 2 deletions esmecata/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Univ. Grenoble Alpes, Inria, Microcosme
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand All @@ -13,4 +13,4 @@
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>

__version__ = '0.5.5'
__version__ = '0.6.0'
10 changes: 6 additions & 4 deletions esmecata/__main__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Univ. Grenoble Alpes, Inria, Microcosme
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand Down Expand Up @@ -139,7 +139,7 @@ def main():
'--update-affiliations',
dest='update_affiliations',
help='''If the taxonomic affiliations were assigned from an outdated taxonomic database, this can lead to taxon not be found in ete3 database. \
This option tries to udpate the taxonomic affiliations using the lowest taxon name.''',
This option tries to update the taxonomic affiliations using the lowest taxon name.''',
required=False,
action='store_true',
default=None)
Expand Down Expand Up @@ -386,7 +386,8 @@ def main():
help='Use precomputed database to create estimated data for the run.',
parents=[
parent_parser_i_taxon, parent_parser_d, parent_parser_o,
parent_parser_rank_limit, parent_parser_update_affiliation
parent_parser_rank_limit, parent_parser_update_affiliation,
parent_parser_thr
],
allow_abbrev=False)

Expand Down Expand Up @@ -460,7 +461,8 @@ def main():
args.linclust, args.minimal_number_proteomes, args.update_affiliations,
args.option_bioservices, args.eggnog_tmp_dir, args.no_dbmem)
elif args.cmd == 'precomputed':
precomputed_parse_affiliation(args.input, args.database, args.output, args.rank_limit, args.update_affiliations)
precomputed_parse_affiliation(args.input, args.database, args.output, args.rank_limit, args.update_affiliations,
args.threshold_clustering)

logger.info("--- Total runtime %.2f seconds ---" % (time.time() - start_time))
logger.warning(f'--- Logs written in {log_file_path} ---')
Expand Down
2 changes: 1 addition & 1 deletion esmecata/__main_create_database__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Univ. Grenoble Alpes, Inria, Microcosme
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand Down
17 changes: 14 additions & 3 deletions esmecata/__main_gseapy__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Univ. Grenoble Alpes, Inria, Microcosme
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand Down Expand Up @@ -119,6 +119,16 @@ def main():
metavar='INT',
default=None)

parent_parser_gseapyCutOff = argparse.ArgumentParser(add_help=False)
parent_parser_gseapyCutOff.add_argument(
'--gseapyCutOff',
dest='gseapyCutOff',
required=False,
type=float,
help='Adjust-Pval cutoff for gseapy enrichr, default: 0.05 (--cut-off argument of gseapy).',
metavar='FLOAT',
default=0.05)

# subparsers
subparsers = parser.add_subparsers(
title='subcommands',
Expand All @@ -130,7 +140,7 @@ def main():
help='Extract enriched functions from groups (either chosen from tax_rank or manually selected) using gseapy enrichr and orsum.',
parents=[
parent_parser_f, parent_parser_o, parent_parser_grouping, parent_parser_t, parent_parser_taxa_list,
parent_parser_e, parent_parser_g, parent_parser_orsumMinTermSize
parent_parser_e, parent_parser_g, parent_parser_orsumMinTermSize, parent_parser_gseapyCutOff
],
allow_abbrev=False)

Expand Down Expand Up @@ -158,7 +168,8 @@ def main():

if args.cmd == 'gseapy_enrichr':
taxon_rank_annotation_enrichment(args.input_folder, args.output, args.grouping, taxon_rank=args.taxon_rank, taxa_lists_file=args.taxa_list,
enzyme_data_file=args.enzyme_file, go_basic_obo_file=args.go_file, orsum_minterm_size=args.orsumMinTermSize)
enzyme_data_file=args.enzyme_file, go_basic_obo_file=args.go_file, orsum_minterm_size=args.orsumMinTermSize,
selected_adjust_pvalue_cutoff=args.gseapyCutOff)

logger.info("--- Total runtime %.2f seconds ---" % (time.time() - start_time))
logger.warning(f'--- Logs written in {log_file_path} ---')
Expand Down
2 changes: 1 addition & 1 deletion esmecata/__main_report__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Univ. Grenoble Alpes, Inria, Microcosme
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand Down
11 changes: 6 additions & 5 deletions esmecata/core/annotation.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Univ. Grenoble Alpes, Inria, Microcosme
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand Down Expand Up @@ -1029,22 +1029,23 @@ def write_annotation_reference(protein_annotations, reference_proteins, annotati
with open(annotation_reference_file, 'w') as output_tsv:
csvwriter = csv.writer(output_tsv, delimiter='\t')
if expression_output_dict:
csvwriter.writerow(['protein_cluster', 'cluster_members', 'protein_name', 'gene_name', 'GO', 'EC', 'Induction', 'Tissue_Specificity', 'Disruption_Phenotype'])
csvwriter.writerow(['protein_cluster', 'cluster_members', 'protein_name', 'gene_name', 'GO', 'EC', 'KEGG_reaction' 'Induction', 'Tissue_Specificity', 'Disruption_Phenotype'])
else:
csvwriter.writerow(['protein_cluster', 'cluster_members', 'protein_name', 'gene_name', 'GO', 'EC'])
csvwriter.writerow(['protein_cluster', 'cluster_members', 'protein_name', 'gene_name', 'GO', 'EC', 'KEGG_reaction'])
for protein in protein_annotations:
protein_name = protein_annotations[protein][0]
gene_name = protein_annotations[protein][3]
cluster_members = ','.join(reference_proteins[protein])
gos = ','.join(sorted(list(protein_annotations[protein][1])))
ecs = ','.join(sorted(list(protein_annotations[protein][2])))
kegg_reaction = ''
if expression_output_dict:
induction = expression_output_dict[protein][0]
tissue_specificity = expression_output_dict[protein][1]
disruption = expression_output_dict[protein][2]
csvwriter.writerow([protein, cluster_members, protein_name, gene_name, gos, ecs, induction, tissue_specificity, disruption])
csvwriter.writerow([protein, cluster_members, protein_name, gene_name, gos, ecs, kegg_reaction, induction, tissue_specificity, disruption])
else:
csvwriter.writerow([protein, cluster_members, protein_name, gene_name, gos, ecs])
csvwriter.writerow([protein, cluster_members, protein_name, gene_name, gos, ecs, kegg_reaction])


def create_pathologic(base_filename, annotated_protein_to_keeps, reference_proteins, pathologic_output_file):
Expand Down
2 changes: 1 addition & 1 deletion esmecata/core/clustering.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Univ. Grenoble Alpes, Inria, Microcosme
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand Down
2 changes: 1 addition & 1 deletion esmecata/core/eggnog.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (C) 2021-2024 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Copyright (C) 2021-2025 Arnaud Belcour - Inria, Univ Rennes, CNRS, IRISA Dyliss
# Univ. Grenoble Alpes, Inria, Microcosme
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand Down
Loading

0 comments on commit acf38e8

Please sign in to comment.