Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes incomplete extraction of sample ids from filenames #28

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
0b3757d
FEAT: started work on eggnog annotation stuff.
Keegan-Evans Jul 13, 2022
f0acccc
IMP: linting....next to move dependencies
Keegan-Evans Jul 13, 2022
5fafa65
IMP: just adding eggnog stuff
Keegan-Evans Jul 15, 2022
8727fe3
IMP: merging in altair fixes from main
Keegan-Evans Jul 15, 2022
7fd7738
IMP: just getting started on adding mapper.
Keegan-Evans Jul 19, 2022
4b9e1de
FEAT: initial pass at reference db downloader.
Keegan-Evans Aug 8, 2022
8e8170d
IMP: more work on ancillary database downloader
Keegan-Evans Aug 10, 2022
d324cbb
IMP: getting closer.
Keegan-Evans Aug 12, 2022
556d426
IMP: clean up trying to get tests passing.
Keegan-Evans Aug 13, 2022
0e73a1d
BUG: downloader functional/tests passing
Keegan-Evans Aug 15, 2022
430c3b2
TEST: Tests for taxa checker utility.
Keegan-Evans Aug 16, 2022
c624434
TEST: fixing failure from incorrect setup....
Keegan-Evans Aug 16, 2022
b63f1b0
TEST: skipping tests with actual downloads
Keegan-Evans Aug 17, 2022
97837ae
BUG: fixing linting
Keegan-Evans Aug 17, 2022
629fc00
BUG: fixing linting
Keegan-Evans Aug 17, 2022
a465c87
IMP: cleaning somethings up....
Keegan-Evans Aug 25, 2022
3c237a0
YEP: getting there
Keegan-Evans Sep 1, 2022
dfbe560
IMP: Formats & Types updated and sorted, started combining the
Keegan-Evans Sep 8, 2022
0b03d7c
FEAT: diamond_search method
Keegan-Evans Oct 11, 2022
c3a77fd
IMP: implementing functionality for search_diamond
Keegan-Evans Oct 11, 2022
dcf0078
FEAT: eggnog_diamond_search now working?
Keegan-Evans Oct 27, 2022
f3b8b3c
FEAT: diamond seed ortholog search for eggnogmapper
Keegan-Evans Oct 28, 2022
04b5e2a
LINT: cleaning up first draft
Keegan-Evans Oct 31, 2022
8dadc18
FEAT: add eggnog_annotate_seed_orthologs
Keegan-Evans Nov 4, 2022
46675c9
FEAT: eggnog annotation working!
Keegan-Evans Nov 7, 2022
e53cbac
FEAT: starting to add usage examples
Keegan-Evans Nov 8, 2022
ada9cc4
FEAT: added multi-cpu utilization
Keegan-Evans Dec 12, 2022
63a9114
IMP: linting
Keegan-Evans Dec 12, 2022
484f4d0
FEAT: add read eggnog database into memory.
Keegan-Evans Dec 12, 2022
2db601a
GETTING q2-types-genomics and moshpit on same page
Keegan-Evans Dec 13, 2022
627e9a1
reorganizing for just eggnog stuff
Keegan-Evans Dec 13, 2022
8a3feee
IMP: dependency specification update.
Keegan-Evans Dec 14, 2022
9b4a7f1
FEAT: Generate FT on eggnog diamond search
Keegan-Evans Jan 27, 2023
c8ed0e8
EOD MONDAY
Keegan-Evans Jan 30, 2023
547f15c
BACKUP before cleanup
Keegan-Evans Feb 2, 2023
562f934
IMP: fixing linting issues
Keegan-Evans Feb 2, 2023
83c96d7
REBASE: add metabat2 changes back in
Keegan-Evans Feb 2, 2023
5b8be74
BUG: remove artifacts from merge
Keegan-Evans Feb 2, 2023
9195e42
TEST: added test data/reference artifacts and a basic test for eggnog
Keegan-Evans Feb 7, 2023
10106f8
LINT: cleanup test commit
Keegan-Evans Feb 7, 2023
5119220
ONLY LOCAL FAILING are not eggnog related
Keegan-Evans Feb 20, 2023
1fd630a
lint setup
Keegan-Evans Feb 20, 2023
bdb0d41
added dependency on q2_types_genomics
Keegan-Evans Feb 20, 2023
c94799b
make types genomics available?
Keegan-Evans Feb 20, 2023
5f10625
TEST: added general test to eggnog annotater.
Keegan-Evans Feb 21, 2023
5f87345
add --dbmem parameter to address issue with very long runtime
gregcaporaso Mar 3, 2023
29afd5d
Merge pull request #1 from gregcaporaso/eggnog_pr
Keegan-Evans Mar 6, 2023
ae7216a
TEST: revert imports and fix test_small_good_hits
Keegan-Evans Mar 6, 2023
df515ec
fixes incomplete extraction of sample ids from filenames
gregcaporaso Apr 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,3 +130,8 @@ dmypy.json

# PyCharm configuration
.idea/

# temp eggnogmapper database files
e5.proteomes.faa
e5.taxid_info.tsv
*.dmnd
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
include versioneer.py
include eggnog-mapper
include diamond
include q2_moshpit/_version.py
6 changes: 6 additions & 0 deletions ci/recipe/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ requirements:
- samtools
- qiime2 {{ qiime2_epoch }}.*
- q2-types-genomics {{ qiime2_epoch }}.*
- q2templates {{ qiime2_epoch }}.*
- eggnog-mapper >=2.1.10
- diamond
- click

test:
requires:
Expand All @@ -30,6 +34,8 @@ test:
imports:
- q2_moshpit
- qiime2.plugins.moshpit
- qiime2.plugins.types_genomics
- q2_types_genomics
commands:
- pytest --cov q2_moshpit --pyargs q2_moshpit

Expand Down
4 changes: 3 additions & 1 deletion q2_moshpit/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@

from .kraken2 import classification, database
from .metabat2 import metabat2
from . import diamond, annotation


from ._version import get_versions
__version__ = get_versions()['version']
del get_versions

__all__ = ['metabat2', 'classification', 'database']
__all__ = ['metabat2', 'classification', 'database', 'diamond', 'annotation']
11 changes: 11 additions & 0 deletions q2_moshpit/annotation/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2022, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------


from ._method import eggnog_annotate_seed_orthologs
__all__ = ['eggnog_annotate_seed_orthologs']
62 changes: 62 additions & 0 deletions q2_moshpit/annotation/_method.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2022, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------


# plugin imports
from q2_types_genomics.genome_data import OrthologFileFmt, SeedOrthologDirFmt
from q2_types_genomics.feature_data import OrthologAnnotationDirFmt
from q2_types_genomics.reference_db import EggnogRefDirFmt

# library imports
import subprocess
import shutil
import tempfile
import os


def eggnog_annotate_seed_orthologs(hits_table: SeedOrthologDirFmt,
eggnog_db: EggnogRefDirFmt,
db_in_memory: bool = False,
) -> OrthologAnnotationDirFmt:

eggnog_db_fp = eggnog_db.path
temp = tempfile.TemporaryDirectory()

# run analysis
for relpath, obj_path in hits_table.seed_orthologs.iter_views(
OrthologFileFmt):
sample_label = str(relpath).rsplit(r'.', 2)[0]

_annotate_seed_orthologs_runner(seed_ortholog=obj_path,
eggnog_db=eggnog_db_fp,
sample_label=sample_label,
output_loc=temp.name,
db_in_memory=db_in_memory)

# INSTANTIATE RESULT OBJECT
result = OrthologAnnotationDirFmt()

for item in os.listdir(temp.name):
shutil.copy(os.path.join(temp.name, item), result.path)

return result


def _annotate_seed_orthologs_runner(seed_ortholog, eggnog_db, sample_label,
output_loc, db_in_memory):

# at this point instead of being able to specify the type of target
# orthologs, we want to annotate _all_.

cmds = ['emapper.py', '-m', 'no_search', '--annotate_hits_table',
str(seed_ortholog), '--data_dir', str(eggnog_db),
'-o', str(sample_label), '--output_dir', str(output_loc)]
if db_in_memory:
cmds.append('--dbmem')

subprocess.run(cmds, check=True)
7 changes: 7 additions & 0 deletions q2_moshpit/annotation/tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2022, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------
Binary file added q2_moshpit/annotation/tests/data/.DS_Store
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1000565.METUNv1_03812 1000565.METUNv1_03812 4.71e-264 714.0 COG0012@1|root,COG0012@2|Bacteria,1MVM4@1224|Proteobacteria,2VJ1W@28216|Betaproteobacteria,2KUD2@206389|Rhodocyclales 206389|Rhodocyclales J ATPase that binds to both the 70S ribosome and the 50S ribosomal subunit in a nucleotide-independent manner ychF - - ko:K06942 - - - - ko00000,ko03009 - - - MMR_HSR1,YchF-GTPase_C
362663.ECP_0061 362663.ECP_0061 0.0 1624.0 COG0417@1|root,COG0417@2|Bacteria,1MVY9@1224|Proteobacteria,1RMQ1@1236|Gammaproteobacteria,3XPER@561|Escherichia 1236|Gammaproteobacteria L DNA polymerase polB GO:0003674,GO:0003824,GO:0003887,GO:0004518,GO:0004527,GO:0004529,GO:0004536,GO:0006139,GO:0006259,GO:0006260,GO:0006261,GO:0006281,GO:0006725,GO:0006807,GO:0006950,GO:0006974,GO:0007154,GO:0008150,GO:0008152,GO:0008296,GO:0008408,GO:0009058,GO:0009059,GO:0009432,GO:0009605,GO:0009987,GO:0009991,GO:0016740,GO:0016772,GO:0016779,GO:0016787,GO:0016788,GO:0016796,GO:0016895,GO:0018130,GO:0019438,GO:0031668,GO:0033554,GO:0034061,GO:0034641,GO:0034645,GO:0034654,GO:0043170,GO:0044237,GO:0044238,GO:0044249,GO:0044260,GO:0044271,GO:0045004,GO:0045005,GO:0046483,GO:0050896,GO:0051716,GO:0071496,GO:0071704,GO:0071897,GO:0090304,GO:0090305,GO:0140097,GO:1901360,GO:1901362,GO:1901576 2.7.7.7 ko:K02336 - - - - ko00000,ko01000,ko03400 - - - DNA_pol_B,DNA_pol_B_exo1
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1000565.METUNv1_03812 1000565.METUNv1_03812 4.71e-264 714.0 1 363 1 363 100.0 100.0 100.0
362663.ECP_0061 362663.ECP_0061 0.0 1624.0 1 783 1 783 100.0 100.0 100.0
36 changes: 36 additions & 0 deletions q2_moshpit/annotation/tests/test_annotate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2022, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------


from qiime2.plugin.testing import TestPluginBase
from .._method import eggnog_annotate_seed_orthologs
from q2_types_genomics.genome_data import SeedOrthologDirFmt, OrthologFileFmt
from q2_types_genomics.reference_db import EggnogRefDirFmt
import pandas as pd
import pandas.testing as pdt

class TestAnnotate(TestPluginBase):
package = 'q2_moshpit.annotation.tests'

def test_small_good_hits(self):
so_fp = self.get_data_path('good_hits/')
seed_orthologs = SeedOrthologDirFmt(so_fp, mode='r')

egg_db_fp = self.get_data_path('eggnog_db/')
egg_db = EggnogRefDirFmt(egg_db_fp, mode='r')

obs_obj = eggnog_annotate_seed_orthologs(hits_table=seed_orthologs,
eggnog_db=egg_db)

exp_fp = self.get_data_path('expected/test_output.emapper.annotations')
exp = OrthologFileFmt(exp_fp, mode='r').view(pd.DataFrame)

for rel_path, obj in obs_obj.annotations.iter_views(OrthologFileFmt):
obs = obj.view(pd.DataFrame)
pdt.assert_frame_equal(obs, exp)

12 changes: 12 additions & 0 deletions q2_moshpit/diamond/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2022, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------


from ._method import eggnog_diamond_search, extract_ft_from_seed_orthologs

__all__ = ['eggnog_diamond_search', 'extract_ft_from_seed_orthologs']
84 changes: 84 additions & 0 deletions q2_moshpit/diamond/_method.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2022, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------


from q2_types_genomics.per_sample_data import ContigSequencesDirFmt
from q2_types_genomics.genome_data import SeedOrthologDirFmt, OrthologFileFmt
from q2_types.feature_data import DNAFASTAFormat
from q2_types_genomics.reference_db import DiamondDatabaseDirFmt

import os
import subprocess
import shutil
import tempfile
import re
import pandas as pd


def eggnog_diamond_search(input_sequences: ContigSequencesDirFmt,
diamond_db: DiamondDatabaseDirFmt,
num_cpus: int = 1, db_in_memory: bool = False
) -> (SeedOrthologDirFmt, pd.DataFrame):

diamond_db_fp = os.path.join(str(diamond_db), 'ref_db.dmnd')
temp = tempfile.TemporaryDirectory()

# run analysis
for relpath, obj_path in input_sequences.sequences.iter_views(
DNAFASTAFormat):
sample_label = str(relpath).rsplit(r'_', 1)[0]

_diamond_search_runner(input_path=obj_path,
diamond_db=diamond_db_fp,
sample_label=sample_label,
output_loc=temp.name,
num_cpus=num_cpus,
db_in_memory=db_in_memory)

result = SeedOrthologDirFmt()

for item in os.listdir(temp.name):
if re.match(r".*\.seed_orthologs", item):
shutil.copy(os.path.join(temp.name, item), result.path)

ft = extract_ft_from_seed_orthologs(result)

return (result, ft)


def extract_ft_from_seed_orthologs(seed_orthologs: SeedOrthologDirFmt
) -> pd.DataFrame:

per_sample_counts = []

for sample_path, obj in seed_orthologs.seed_orthologs.iter_views(
OrthologFileFmt):
sample_name = str(sample_path).rsplit(r"\.", 2)[0]
sample_df = obj.view(pd.DataFrame)
sample_feature_counts = sample_df.value_counts(['sseqid'])
sample_feature_counts.name = str(sample_name)
per_sample_counts.append(sample_feature_counts)

df = pd.DataFrame(per_sample_counts)
df.fillna(0, inplace=True)
df.columns = [x[0] for x in df.columns.to_series()]

return df


def _diamond_search_runner(input_path, diamond_db, sample_label, output_loc,
num_cpus, db_in_memory):

cmds = ['emapper.py', '-i', str(input_path), '-o', sample_label,
'-m', 'diamond', '--no_annot', '--dmnd_db', str(diamond_db),
'--itype', 'metagenome', '--output_dir', output_loc, '--cpu',
str(num_cpus), '--dmnd_ignore_warnings']
if db_in_memory:
cmds.append('--dbmem')

subprocess.run(cmds, check=True)
7 changes: 7 additions & 0 deletions q2_moshpit/diamond/tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2022, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------
Binary file added q2_moshpit/diamond/tests/data/tiny_diamond_db.qza
Binary file not shown.
Binary file added q2_moshpit/diamond/tests/data/tiny_test_data.qza
Binary file not shown.
28 changes: 28 additions & 0 deletions q2_moshpit/diamond/tests/test_diamond.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# ----------------------------------------------------------------------------
# Copyright (c) 2022, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------


from qiime2.plugin.testing import TestPluginBase
from .._method import eggnog_diamond_search
from q2_types_genomics.reference_db import DiamondDatabaseDirFmt
from q2_types_genomics.per_sample_data import ContigSequencesDirFmt


class TestDiamond(TestPluginBase):
package = 'q2_moshpit.diamond.tests'

def test_good_small_search(self):
input_sequences = ContigSequencesDirFmt(
self.get_data_path('tiny_test_data.qza'), mode='r')

diamond_db = DiamondDatabaseDirFmt(
self.get_data_path('tiny_diamond_db.qza'), mode='r')

eggnog_diamond_search(
input_sequences=input_sequences,
diamond_db=diamond_db)
10 changes: 6 additions & 4 deletions q2_moshpit/metabat2/metabat2.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,16 @@
from q2_moshpit.metabat2.utils import _process_metabat2_arg


def _get_sample_name_from_path(fp):
return os.path.splitext(os.path.basename(fp))[0].split('_')[0]
def _get_sample_name_from_path(fp, suffix):
return os.path.basename(fp).rsplit(suffix, maxsplit=1)[0]


def _assert_samples(contigs_fps, maps_fps) -> dict:
contigs_fps, maps_fps = sorted(contigs_fps), sorted(maps_fps)
contig_samps = [_get_sample_name_from_path(x) for x in contigs_fps]
map_samps = [_get_sample_name_from_path(x) for x in maps_fps]
contig_samps = [_get_sample_name_from_path(x, '_contigs.fa')
for x in contigs_fps]
map_samps = [_get_sample_name_from_path(x, '_alignment.bam')
for x in maps_fps]
if set(contig_samps) != set(map_samps):
raise Exception('Contigs and alignment maps should belong to the '
'same sample set. You provided contigs for '
Expand Down
Loading